[Development] Splitting Qt Network out of qtbase (was: QtBase network failures)

Sun Jun 26 15:38:30 CEST 2022

> On 25 Jun 2022, at 18:33, Thiago Macieira <thiago.macieira at intel.com> wrote:
> 
> On Saturday, 25 June 2022 01:57:24 PDT Volker Hilsheimer wrote:
>> Perhaps this is a good time to discuss whether we should move Qt Network
>> into its own repository. This would make qtbase integrations less exposed
>> to network failure, which - even without certificates expiring - are a fact
>> of life. And qtbase integrations already suffer from plenty of flakiness.
>> And that an operational issue might require patches to merge and to get
>> cherry picked, which might take several attempts, each taking several
>> hours, just amplifies that problem further.
> 
>> Conceptually, we have made that kind of change before (when taking Qt
>> Positioning out of the qtlocation repo). But there are some challenges.
> 
>> One challenge is that several of our Qt Core tests are using networking
>> features (tests outside of tests/auto/network that include
>> network-settings.h: tst_qdir, tst_qdiriterator, tst_qfile, tst_qfileinfo,
>> tst_qiodevice, tst_qtextstream, tst_qfiledialog2). Without having looked
>> into the details, I’d assume that we might not need an actual server to
>> test many of those codepaths (or that those tests can be moved into a
>> qtnetwork repo, ie. QTextStream::stillOpenWhenAtEnd doesn’t seem to test
>> QTextStream, which never closes a QIODevice).
> 
> Personally, I'd prefer if those Core tests ddn't use Networking. The majority 
> of them aren't actually using QtNetwork, they are the Windows portion that 
> deals with the SMB server provided by the Network Test Server. So the issue 
> isn't that of QtNetwork, but of the NTS and would remain anyway. That would 
> leave a few tests like QTextStream that use QTcpSocket for some particular 
> QIODevice sequential condition, but which could be replaced with an identical 
> condition with a different class, like QProcess.
> 
> But what's the gain? This looks like a lot of effort to me, particularly if we 
> don't move the UNC path tests in the file classes.
> 
> Not looking scientifically at it, but from memory, the network test server and 
> the networking tests haven't been the majority of spurious failures in the CI. 
> They're a big contributor, but not the majority. From a random sampling of 
> test failures in the past week, I see:
> 
> Non-test failures:
> * general CI failures - "failed to acquire machine" [1]
> * sccache network failures [2] 
> * licensing issues with the INTEGRITY compiler
> * timeouts [3]
> * weird unexplained failures like [4] or [5]
> Test failures:
> * flaky tests on timing (QMutex, QDeadlineTimer, etc.)
> * QFSModel on macOS on ARM [6]
> * a std::filesystem unexplored issue on Windows [7]
> * some widget issues like [8] or [9]
> 
> And yes, network test failures in 
> https://testresults.qt.io/coin/integration/qt/qtbase/tasks/1655473411
> 
> But they are nowhere near the majority, even the plurality. The CI general 
> failures, sccache failures and timeouts appear to be far more common and 
> deserve more attention. 
> 
> Even among pure test failures the network ones don't appear to be the largest 
> contributor. So I have to ask: is the effort worth the benefit?

It’s not going to be a silver-bullet, and I agree that there are other sources of flakiness that are likely larger contributors to failing integrations. Anecdotally, it seems that every single patch that I was involved in during the last couple of weeks was blocked in some branch by either tst_qtcpsocket or tst_qnetworkreply failure (or both [1]). Perhaps it’s likely that things are related - if the network is unstable enough, then we might either get an sccache failure, or a Qt Network test failure.

[1] https://codereview.qt-project.org/c/qt/qtbase/+/417978

However, and again anecdotally, qtbase seems to suffer from the dependency on a stable network a lot more than other repositories. Maybe because it’s large enough for any glitch to likely hit either a build or a test run. In which case making qtbase smaller, and esp taking away those tests that might take a lot of time. In which case making qtbase smaller might improve things as well.

So, I think that it might be worth it. Cleaning up the QtCore tests that don’t just test file I/O with UNC paths (*) seems like an almost trivial refactoring - move the relevant tests to a test case in qtnetwork. The git work is also mostly a mechanical exercise, repeating what we did with Qt Positioning. We might not have to solve any hard engineering problem to take at least a little step forward.

Whereas making sccache fault tolerant, or making the CI system run test VMs with certain performance guarantees, or generally writing reliable tests that nevertheless depend on non-deterministic subsystems, seem like much harder engineering problems (that are still worth trying).

Volker

(*) As for the UNC stuff - it seems that we are testing only string parsing code. We are not taking care of any of the actual network traffic or SMB protocol. So do we need to access a share from a remote server at all? Would it be an option to create and share a folder on the Windows VMs running those tests during provisioning, and then use '\\$(COMPUTERNAME)’? That works for me on a local VM at least, all QFile tests pass (and we could probably even enable tst_QFile::largeUncFileSupport and simplify tst_QFile::writeLargeDataBlock_data) after running this as admin:

New-Item 'c:\testshare' -ItemType Directory
New-SmbShare -Name testshare -Path 'c:\testshare’ -ReadAccess Users
Write-Output $(“." * 32) > test.pri # test expects size of 34 bytes
New-Item c:\testshare\readme.txt -ItemType File

New-Item 'c:\testsharewritable' -ItemType Directory
New-SmbShare -Name testsharewritable -Path 'c:\testsharewritable’ -ChangeAccess Users