[Development] results of July flaky tests fixing

Thu Sep 1 11:28:55 CEST 2022

Thanks for sharing that overview, Anna!

When I chased after failures of my integrations caused by flaky tests during summer, I’ve seen a couple of patterns I think are worth keeping in mind when investigating flaky tests, or when writing new tests:

* QTest::qWaitForWindowActive - very often, a test doesn’t need an active window at all, but just an exposed window. Use QTest::qWaitForWindowExposed instead

* stress tests for data races: if your test doesn’t expose any race conditions if you run with QThread::idealThreadCount threads, then it’s unlikely that it will expose races if you run with more threads. But with time sharing, the threads might run a lot longer than you expect. See e.g. https://codereview.qt-project.org/c/qt/qtbase/+/421391

* hardcoded waiting times is an anti-pattern. I know it’s not always possible to avoid (we don’t have qWaitFor… helpers for everything), but when testing high-level functionality that relies on lower-level functionality, then it’s a good idea to check that the lower-level bits worked. E.g. https://codereview.qt-project.org/c/qt/qtbase/+/421658

To the last point - tests can use our private APIs, so adding private infrastructure that makes it easier to write robust tests is a good idea!

Cheers,
Volker

> On 30 Aug 2022, at 18:53, Anna Wojciechowska <anna.wojciechowska at qt.io> wrote:
> 
> Hello,
> 
> I would like to present the results of the July fixing of flaky tests.
> 
> short version:
> 19 - number of platform-specific flakiness to start with: (14 different tests) 
> 
> as a result:
> 11 platform-specific flakiness was fixed (caused by 8 different tests) 
> 4 platform specific flakiness was still flaky (from 4 different tests)
> 4 cases of flakiness were blacklisted (2 different tests)
> 
> The table under the link below shows more detailed information about fixed tests.
> https://wiki.qt.io/Fixed_flaky_tests_in_July_2022
> 
> long version:
> How was the problem approached? 
> We collected data about flakiness from June, in July we created a list of top "worst" cases that failed integrations and we contacted module maintainers. We gave some time for changes to be merged and run a sufficient number of times to gain confidence that the fix actually worked - and in late August we checked the results again.
> 
> The complete lists of flaky tests from June that were being fixed in July can be found at this link:
> https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&from=1656626400000&to=1659304799000&viewPanel=65
> 
> Which tests were taken into analysis? 
> The tests from dev branch that impacted negatively the integration system by causing at least 1 failure in any integrations and at least 1 flaky event.
> 
> What is the difference between a failed and a flaky test?
> You can find a good explanation here:
> https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&viewPanel=55
> and here:
> https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&viewPanel=41
> 
> What is understood by a "test"?
> A test is an umbrella term for a pair: test case and test function. A test case (usually a cpp file) contains several test functions that return results (pass, fail, or xfail). We collected and analyzed the results. Additionally, some tests contain data tags - test function arguments that also provide more detailed results, however, we do not store them, the granularity of the data ends at the test function level.
> 
> What is understood by a "platform-specific flakiness"?
> A test runs on a specific platform - we describe it as "target operating system" and "target architecture". In most cases, flakiness is related to a particular test run on a specific platform. 
> E.g., test case: "tst_qmutex" , test function "more stress" can return stable results most platforms but be flaky on: MacOS_11 X86_64 or on Windows_10_21H2 X86_64 . In such case, it will be counted as 2 "platform specific flakiness" (MacOS_11 and Windows_10_21H2_) caused by a single (unique, distinct) test.
> 
> Since July fixing provided good results, in August we repeated the procedure: we gathered data about the most damaging (failing integrations) flaky tests and we compared it to July, to make sure only "new" tests are on the list. August's failing flakiness can be viewed under the link below.  Developers and maintainers are welcome to check if their tests are on the list.
> https://wiki.qt.io/Flaky_tests_that_caused_failures_in_August
> 
> Big thanks to everyone participating in fixing the tests!
> 
> Anna Wojciechowska
> 
> The notebooks used to prepare this analysis can be found at:
> https://git.qt.io/qtqa/notebooks/-/tree/main/flakiness/august_2022
> 
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> https://lists.qt-project.org/listinfo/development