[Development] Why Coin?

Tue Nov 20 08:17:15 CET 2018

Hi all,

I’ve been from time to time getting questions around Coin, and why we started developing our own CI system for Qt instead of using some available solution.

To understand it, we will probably need to go back a few years to the time when we started developing it. At that point in time, we had a Jenkins based CI system that was giving us quite a few problems. Amongst them were

* We had lots of stability issues with the system. Much later, we saw that some of those issues were problems in the networking and virtualisation layer, but we didn’t know this at that time.
* CI machines were constantly running, making it very hard to balance resource requirements and leading to rather bad hardware utilisation. 
* The long running CI machines could easily accumulate a lot of garbage (again leading to instabilities and hard to debug problems)
* We could only ever do one release at a time, as switching branches required us to switch the VM templates
* We couldn’t deal with modularised repositories in a decent way, so we always had to compile all dependencies from scratch leading to extremely long turn around times
* The branch configurations were managed by hand, sometimes leading to problems when creating new branches
* CI and packaging were disconnected, so we had to compile everything once again from scratch to create binary packages (with slightly different configurations). This often lead to build errors during packaging that weren’t visible in CI. In addition, this wasted a lot of time and resources. We had a minimum turn around time from a fix being staged on Gerrit until it ended up in the package of 48 hours.
* Lack of provisioned build/test VMs.
* Developers couldn’t access the build/test VMs themselves for debugging (now we can at least provide it to people working at TQtC)
* We didn’t have any tests for the CI system itself, making it difficult to change and maintain.
* There were probably other issues that I’ve forgotten about now…

So we did sit down some years ago trying to find out how to best solve those problems. We did look at a variety of different solutions that existed and whether they could solve the problems we were having. In the end we came to the conclusion that none of the solutions that existed at that point in time were what we really needed and wanted. 

That left us with the option of either implementing a lot of new functionality for an existing system or doing our own solution. We ended up going for our own, as we couldn’t see how to easily bring the existing solutions to where we wanted them to be.

That turned out to be a larger effort than we initially estimated. Nevertheless, Coin is nowadays a much better system than what we had some years ago with Jenkins. Most of the problems mentioned above are solved today.

So where are we with Coin today? Let’s have a quick overview over what we have and maybe also the remaining issues we’re seeing:

The CI system contains several layers. As the basis, we have a cluster of rather powerful blades and a large set of Macs. Those are running inside a separate DMZ inside the Qt Company’s network. Each of those blades runs Linux with KVM as the hypervisor. The whole cluster is administered through OpenNebula/MAAS.

Coin itself runs on a separate powerful machine and brings up VMs as needed through OpenNebula. It listens to staging requests from Gerrit and contains all the logic to determine how and on which platforms to test a set of changes (the list of platforms and how they are provisioned is stored in qt5.git). In addition, it has a large storage area where we cache generated binary artifacts for the different repositories/branches/sha1s. Those are being used to test changes in dependent repositories, so we don’t need to compile qtbase every time. In addition, the binary artefacts are also being used for the creation of our binary packages.

But as you know, it’s certainly not perfect, and we have our regular share of bugs and problems with the system. Coin itself is actually running pretty nicely and doesn’t generate too many problems. We have decent control over it, and most bugs that we notice in the Coin codebase itself are not too hard to fix.

There are however a couple of other issues that are still creating problems for us:

* The network was causing lots of problems, we were seeing random packets being dropped and random disconnects of TCP connections. We have done some changes here last week, and are optimistic that this has now been fixed

* Windows 10 VMs are sometimes extremely slow when being run on top of our current host Linux/KVM combination. The root cause has still not been fully identified, but we are currently working on upgrading the host OS to a newer Ubuntu version. Judging from similar bug reports by others, there’s a good chance that this will resolve the problem.

* Flaky tests are a recurring problems. We’ve spend a lot of time trying to identify them and fix things where tests are relying on specific timing or other non-deterministic behaviour. A second source of flakiness comes from the underlying system, something we hope will be resolved with fixes for the two points above. Another issue is maintenance of the VMs and the fact that we have to be careful that those machines don’t start doing heavy work (such as auto updates etc) on their own.

* We still have some issues in the interaction between Coin and OpenNebula, where Coin fails to acquire machines and Tier2 images getting corrupted. This is being worked on by the CI team.

We are now moving Coin’s SW development is moving towards being able to build, test and package not only Qt, but also the other products we have (Qt Creator, 3D Studio, Design Studio, Automotive, etc.). This will also make it a lot easier to have additional frameworks that are not part of qt5.git to be tested and packaged in Coin.

We should continue to evaluate alternatives from time to time, but currently Coin is the best option we have for our CI.

Cheers,
Lars