[Development] Need advise on acceptable timeouts for autotests

Thu Mar 16 16:53:12 CET 2017

Hi

1. We don't know to be frank. The VMs aren't live migrated, the SAN hasn't been proved to be down or even the bottleneck, at least for standard hardware. Apple hardware is another issue, and here SAN seems to either slow down or possibly even break up sometimes. Some Macs slow down to the point that reading or writing to the hard drive goes through the SAN that slows it down to 10 MB/s. I did encounter curl not being able to download a tar ball, but that was on a buggy version of 10.12. The problem behind curl got fixed in 10.12.2. However, I can't confirm or deny if this has happened after that.

We also don't over allocate resources. If we have a 20 core server, we assign only 5 x 4vcpu VMs to it, even if hyperthreading shows the host to have 40 cores. However, I just realized a week ago, that provisioning doesn't follow this "rule". If we have maxed out a server, and we launch provisioning for a template that is assigned to that host, we launch the VM to be provisioned there as well. And if we launch 2 or 3 branches simultaneously, there's a possibility that we allocate 6-8 VMs on that 20 core server. It _should_ cope with it easily, but in theory things start to slow down. That's because the VMWare hypervisor allocates all the CPU cycles the guest might need, even if it only uses 1 core in reality. That's because the hypervisor can't know if the guest fully needs all 4 or not. But even if we doubled the amount of VMs on a host, and all of them ran 100%, that would mean that we reduce the amount of CPU cycles in half. Even then an action taking a split second won't take 2 seconds, not to mention 2 minutes.

2 years ago we debugged one of these oddities where network tests failed for no good reason. It looked like vSphere's underlying virtual LAN caused problems dropping packages. However, when debugging, we found out that it was in fact Qt's network code that was buggy. It had something to do with 2 IP packages coming one after another so fast that they both existed in our network code's buffer. But the bug was that only the first one was ever handled. The second one was discarded. At the time no one knew how to fix it, so it was left there. I still don't know if it has been fixed. Someone said that the entire network code should be rewritten, because the current one was a mess.

We've also seen GUI tests failing where sleeps seem to help. Here something like waitUntilExposed (or whatever), didn't work as expected in Macs. And obviously this caused problems once or twice in thousands of runs looking like something in the hardware is causing these flaky runs. It is still possible that only certain servers cause these small timing differences. Even likely, since we have lots of servers with several generations between them and repeating these failures on one VM on one host seems to be very difficult when actually starting to debug the tests. But we can't go replacing all the hardware we have in one go and throw away perfectly fine hardware just because it's not identical to the next one.  In an ideal world we'd have controlled runs on older and newer generations etc, but in reality I think we have to imagine the hardware being a constant non changing factor beneath the virtualization layer. Broken hardware is another story, and for that we need to gather metrics to see what failed and where. If it's always the same hardware and not even the same generation of hardware, then it's most likely a broken unit.

With all that said, I didn't really provide you with any answers did I? :P

Regards,
-Tony
-----Original Message-----
From: marc at kdab.com [mailto:marc at kdab.com] On Behalf Of Marc Mutz
Sent: torstaina 16. maaliskuuta 2017 11.01
To: development at qt-project.org
Cc: Qt CI <qt.ci at qt.io>
Subject: Need advise on acceptable timeouts for autotests

Hi,

We repeatedly have the problem that timeouts that developers think are ample (because they exceed typical runtime by, say, two orders of magnitude) are found to be insufficient on the CI.

Latest example: 
http://testresults.qt.io/coin/integration/qt/qtbase/tasks/1489618366

The timeout to run update-mime-database was recently increased to 2mins. But that still does not seem to be enough. For a call that hardly takes a second to run on unloaded machines.

We can of course crank up timeouts to insane amounts like 1h, but that means that a test will just sit there idling for an hour in the worst case.

I have two questions:

1. Where do these huge slowdowns come from? Is the VM live-migrated? Is the
   SAN, if any, down? At this point it lools like no overcommitting of CPU/RAM
   could ever explain how update-mime-database can take 2mins to run.

2. What should we choose as timeouts? I understand that tests which are stuck
   are killed after some time (how long?). Maybe timeouts should be set to the
   same value?

Thanks,
Marc

--
Marc Mutz <marc.mutz at kdab.com> | Senior Software Engineer KDAB (Deutschland) GmbH & Co.KG, a KDAB Group Company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts