[Development] CI stability

Wed Feb 8 10:46:36 CET 2017

Hi everybody,

I guess all of you know the frustration about not getting your change in because an unrelated autotest failed, or due to changes in other modules breaking things for you.

After quite a few discussions, we’ve now decided that we will try to tackle this and hopefully make the whole process smoother in the longer term. The most important item is that our CI system is never blocked, so that we all can get our patches and bug fixes in as smoothly and quickly as possible. We need to put a very high priority on any issue that prevents us from getting (valid) changes in.

As a first item, I’d like to make it clear that reverting changes is something we should do more often. If a change is found to cause problems in other modules (like auto test failures), and the author of the change can’t fix it immediately (for whatever reason), it’s ok for anybody to revert those changes. In that case, simply re-open the corresponding bug and send the author of the problematic change an email explaining what you reverted and why.

The second larger issue are flaky tests, ie. tests that fail randomly from time to time. These tests are causing huge issues in CI, and especially make qt5.git integrations that are required for releasing and to get updated packages very painful. The flakiness of the tests and also the flakiness in the CI system itself (which is a separate item for the team working on the CI) is one of the main reasons why releases are getting delayed.

So we will try a new policy for those flaky tests from now on:

Anybody who identifies a flaky test (ie. a test that is randomly failing in CI), can blacklist that test; under one condition. He needs to at the same time create a P0 bug report about it. Please also add the labels ‘autotest’ and ‘flaky’ to the bug report, so that we can follow up on those.

Flaky tests might be badly written tests, but they can also hide real problems in Qt. So we need to have a very high priority for looking at them and keeping our blacklist minimal. Making them a P0 ensures that the issue will be looked quickly (as they will block the next release).

Anybody who get’s one of those bug reports assigned please look at them immediately and try to see what’s going wrong. This is not always easy or straightforward, as they aren’t always reproducible outside the CI system. Here are a few pointers to help:

Have a look at https://wiki.qt.io/Writing_good_tests , and check that the test is following the rules there. Often tests fail more easily under high load, so this is something to check as well. If you’re working for The Qt Company, you have the additional option of creating a VM inside the CI system or running test builds of a pushed change in the CI system. If you’re not working for TQtC, ask someone who does and we can schedule a build of any patch you push to gerrit inside the CI on the platform of your choice (with results being reported back to the gerrit change). 

If the test is flaky due to some external dependency (e.g. the network test server), you might want to file them as a subtask for getting a better test server in place and keep it blacklisted. In almost all other cases, you should try to fix the test (or the bug in Qt). If it's not possible to fix the test, think about how it could be rewritten. If the test is worthless (for example because it doesn’t test anything we haven’t covered in other ways), remove it.

In any case, please handle those bug reports quickly, as said above they will block the release until handled. Please don’t down-prioritize these bugs reports without very good reasons (and talking to the module maintainer).

I hope that as many people as possible will help in the effort. Fixing those flaky tests is quite some work right now, but in the longer term we will benefit us all when integrations go in more smoothly and we can more easily update qt5.git and get releases out.

Cheers,
Lars