[Development] qtbase CI now takes 6 hours

Fri May 17 08:14:12 CEST 2013

Hi

I'm responsible for keeping the CI up and running and having it stable. So in cases like this, you may point a finger at me and ask what's going on. Currently quite a few things, and I'll try to summarize here what's going on.

To begin with, we were keeping a few "improvements" on hold for a week or two, until we got the 5.1 beta out. Now that it got out, we took a risk and pushed the changes forward, and something brown hit the fan. We also have a few long term problems, that didn't appear now, but they are always causing pain nevertheless.

What we try to do there, is puppetise everything, so that we can destroy and setup new virtual machines quickly. Also that everything is reproducible by anyone in the community without a lot of documentation to read through. This time we went for puppetising Java installation in Windows machines. We still don't know why, or if that actually had something to do with this, but for some reason we now have the build nodes getting fatal errors right in the middle of something. It's always a temporary file that can't be deleted for some odd reason. We don't even know why this temporary file exists, or what it is for, but accordingly to its name, it has something to do with Hudson.

We guessed, that we installed a Java version that causes this problem. However, we have this error appearing in both servers that got updated, and servers that yet didn't receive this new Java. So currently, we have no clue to what causes this Windows node errors, as they don't seem to be related to our Java change at all.

One old thing is our network test server. Again, for some reason we keep having builds that occasionally fail the network tests. The test log states, that it can't connect to the server or service in the server. However, the next build passes this just fine. Seems like we have some network issues that cause this, but again narrowing things like that down, is painful. Specially as they all are running in the same vSphere host as virtualized servers. I could create a virtualized server to keep running pings back and forth for a few weeks and see if it breaks down, but what if the connection only gets severed between the CI node and the network test server? :)

It could also be due to capacity given the network test server. Perhaps it gets too many connections from parallel builds. This I have already addressed with a new network test server, but that one is still only in testing. It looks like it could be swapped now, but I'd have to make sure, and these failing builds surpasses that on my priority list.

We also have had serious problems with cloning stuff from Git, and we tried making improvements on that last week. The implementation caused a few build breaks, but with help we got that back up. Still, it looked to you like our CI was broken again.

One very old problem is the famous 'cmd.exe' problem appearing in random windows 8 builds. This has been addressed several times, and we can't find the problem. Something goes haywire, and despite several good ideas what it could be, we haven't been able to tackle them.

But now back to fixing the problems...

Regards,
-Tony
-----Original Message-----
From: development-bounces+tony.sarajarvi=digia.com at qt-project.org [mailto:development-bounces+tony.sarajarvi=digia.com at qt-project.org] On Behalf Of Stephen Kelly
Sent: 16. toukokuuta 2013 15:00
To: development at qt-project.org
Subject: [Development] qtbase CI now takes 6 hours

Hi there,

Adding to the other major problem of CI (non-deterministic results), CI in 
qtbase now takes 6 hours. See the most recent successful build here: 

 https://codereview.qt-project.org/#change,55977

and mouse-over the date to get the timestamps.

As a reminder, you can see all CI reports here:

 http://thread.gmane.org/gmane.comp.lib.qt.ci-reports

As you can see, things fail very often. If you look into the failures, you can 
see that many many of them do not relate to the patches under test (they are 
flaky, or the test machines hit a problem etc).

I recommend that anyone who can do anything about the CI keep track of it by 
subscribing to the reports mailing list:

 http://lists.qt-project.org/mailman/listinfo/ci-reports

It is very easy to see if failures are related to the patch under test or not, 
and then that can be reacted to. The reaction might be to mark a test as 
insignificant, or to mark a broken platform as having insignificant tests (as 
was done recently for macx-clang_developer-build_qtnamespace_OSX_10.7 for the 
qt5.git repo).

I really don't know if 'the CI people' are already tracking failures like 
that. The CI really is a mess these days, and I get the feeling that it's not 
something anyone is tracking or systematically trying to fix. I could notify 
about individual problems, but a) they should be so obvious that I don't have 
to and b) that doesn't solve the systematic problems. 

Someone who can get more information and actually solve the problems needs to 
be on top of it without being notified. I have no way of finding out why 
integrations suddenly take 6 hours instead of 2. I don't have the karma to 
mark macx-clang_developer-build_qtnamespace_OSX_10.7 or anything else as 
insignificant (or significant). I don't have any way to get the information 
that a widgets test is failing because the widget is bigger than the 'screen 
size' used by the vm (eg https://codereview.qt-project.org/51289). But someone 
must have access to all that information and karma, and they should be on top 
of the CI failures (by following the above mailing list) without being 
notified by me or someone else.

Currently qtwebkit stable can't be merged to dev, which means qt5.git dev 
branch can't be updated at all. We need a better way to track issues like 
that. It is important that qt5.git can be updated. So, I think submodule 
merges and qt5.git integrations should be of particular importance for any 'CI 
people' tracking the reports mailing list.

If such tracking is not possible by the people with more-direct access to the 
CI system, or no such person exists who can track that stuff, then we need a 
better way to track these failures as a community. We at least need to all be 
aware that there is no person who is going to take responsibility for CI 
issues like that, and it's up to everyone else to 'scramble and do what they 
can to fix it', using publically available information. If that's the best 
we're going to be able to do, then we should know that. 

I also recommend adding the duration of an integration run to the mails sent 
to the reports mailing list. That way, the mailing list is a 'one stop shop' 
for seeing when CI is having systematic or unusual problems.

Thanks,

-- 
Stephen Kelly <stephen.kelly at kdab.com> | Software Engineer
KDAB (Deutschland) GmbH & Co.KG, a KDAB Group Company
www.kdab.com || Germany +49-30-521325470 || Sweden (HQ) +46-563-540090
KDAB - Qt Experts - Platform-Independent Software Solutions