[Development] results of July flaky tests fixing

Tue Aug 30 18:53:20 CEST 2022

Hello,

I would like to present the results of the July fixing of flaky tests.

short version:

19 - number of platform-specific flakiness to start with: (14 different tests)

as a result:

11 platform-specific flakiness was fixed (caused by 8 different tests)

4 platform specific flakiness was still flaky (from 4 different tests)

4 cases of flakiness were blacklisted (2 different tests)

The table under the link below shows more detailed information about fixed tests.

https://wiki.qt.io/Fixed_flaky_tests_in_July_2022

long version:

How was the problem approached?

We collected data about flakiness from June, in July we created a list of top "worst" cases that failed integrations and we contacted module maintainers. We gave some time for changes to be merged and run a sufficient number of times to gain confidence that the fix actually worked - and in late August we checked the results again.

The complete lists of flaky tests from June that were being fixed in July can be found at this link:

https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&from=1656626400000&to=1659304799000&viewPanel=65

Which tests were taken into analysis?

The tests from dev branch that impacted negatively the integration system by causing at least 1 failure in any integrations and at least 1 flaky event.

What is the difference between a failed and a flaky test?

You can find a good explanation here:

https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&viewPanel=55<http://flaky_and_failed_test_definition>

and here:
https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info?orgId=1&viewPanel=41

What is understood by a "test"?

A test is an umbrella term for a pair: test case and test function. A test case (usually a cpp file) contains several test functions that return results (pass, fail, or xfail). We collected and analyzed the results. Additionally, some tests contain data tags - test function arguments that also provide more detailed results, however, we do not store them, the granularity of the data ends at the test function level.

What is understood by a "platform-specific flakiness"?

A test runs on a specific platform - we describe it as "target operating system" and "target architecture". In most cases, flakiness is related to a particular test run on a specific platform.

E.g., test case: "tst_qmutex" , test function "more stress" can return stable results most platforms but be flaky on: MacOS_11 X86_64 or on Windows_10_21H2 X86_64 . In such case, it will be counted as 2 "platform specific flakiness" (MacOS_11 and Windows_10_21H2_) caused by a single (unique, distinct) test.

Since July fixing provided good results, in August we repeated the procedure: we gathered data about the most damaging (failing integrations) flaky tests and we compared it to July, to make sure only "new" tests are on the list. August's failing flakiness can be viewed under the link below.  Developers and maintainers are welcome to check if their tests are on the list.

https://wiki.qt.io/Flaky_tests_that_caused_failures_in_August

Big thanks to everyone participating in fixing the tests!

Anna Wojciechowska

The notebooks used to prepare this analysis can be found at:

https://git.qt.io/qtqa/notebooks/-/tree/main/flakiness/august_2022

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20220830/34c0e47d/attachment.htm>