[Development] qtbase CI now takes 6 hours

Tue May 21 09:19:40 CEST 2013

Hi

I finally have some time to give you a proper answer :)

> -----Original Message-----
> From: Stephen Kelly [mailto:stephen.kelly at kdab.com]
> Sent: 17. toukokuuta 2013 10:34
> To: Sarajärvi Tony
> Cc: development at qt-project.org
> Subject: Re: [Development] qtbase CI now takes 6 hours
> 
> On Friday, May 17, 2013 06:14:12 you wrote:
> > Hi
> >
> > I'm responsible for keeping the CI up and running and having it stable. So
> > in cases like this, you may point a finger at me and ask what's going on.
> > Currently quite a few things, and I'll try to summarize here what's going
> > on.
> 
> Hi Tony,
> 
> Thanks for responding. I have some follow-ups.
> 
> 1) Can you add the time taken to integrate to the report emails?

I think so. I haven't tested this, but I recon something like this would work.
Add these lines after line 1040 to the 'qt-jenkins-integrator.pl':

---
    my $duration_in_seconds = $build->{ duration } / 1000;
    my ($days,$hours,$minutes,$seconds) = (gmtime $duration_in_seconds)[7,2,1,0];
    $hours += $days * 24;

    $formatted .= "\nDuration of build: $hours h, $minutes min, $seconds sec.";
---

That's a bit difficult to test, since our development version doesn't send emails at all. Guess I'd have to enable it or something.

> 
> 2) Can you notify this list when you know a CI blocker has been introduced?
> (like the improvements you mentioned which were held back until after beta1
> and which cause some problems, and the current network tests failing, which
> you might be aware of) That way we don't have to point a finger at you and ask
> what's going on.

I'll take this as a lesson learned. I'll try to keep you more up to speed on what's going on :)

> 
> 3) Can you disable staging when such problems occur, so that everyone knows
> that it won't work anyway?

I wished I could have done that, but I don't know how to do that. I'll investigate how that could be achieved. Same thing crossed my mind though.

> 
> 4) You say you're responsible for keeping CI up and running. Does that include
> keeping branches merging and keeping qt5.git up to date? If not, then as I
> wrote before I think we still need a better way of tracking that.

No. The CI team doesn't look or touch the content of the repos at all, not counting qtqa and sysadmin repos that include the tools which we keep up to date.

> 
> 5) It's quite easy to see when the integration failed due to cloning, and when
> it failed due to a problematic integration machine (eg macx-clang_developer-
> build_qtnamespace_OSX_10.7), or a network server issue. The problem is we
> don't really know if anything is being done to address some of those issues.
> Is anything being done about https://bugreports.qt-project.org/browse/QTBUG-
> 30646 ? Either fixing the problems that are occuring
> on it, or disabling it in places where it still causes problems (qtbase and
> qtdeclarative)?

Flaky test cases are trouble, we know. We try to mark them as insignificant as we stumble upon them, although it's not exactly our job in the CI. We can't look into every failing build and look at the details behind it, and keep metrics of failing cases. For this we have created the QtMetrics web-page, which I will send an announcement of probably right after I answer you on this.

Who the person would then be to mark them as insignificant? I'm not quite sure. Sergio manually runs a script that marks them the other way around. If something passes and has been marked as insignificant, the script creates a task that removes that flag. This of course might create problems on Flaky tests that fail once every 10 builds. 

> 
> 6) I see that some Windows machines seems to be persistently failing in a way
> which reminds me of macx-clang_developer-build_qtnamespace_OSX_10.7. Are
> you
> aware of that/have it on your needs-fix radar?

Windows machines have a few problems that are persistent:

Windows 8 64bit and its cmd.exe problem is one: https://bugreports.qt-project.org/browse/QTQAINFRA-597. That one causes builds to just simply randomly fail. And we haven't found a root cause for that yet, nor are we actively looking for it either. We instead fix other problems that we currently do know how to fix.

Windows 8 64bit and shutdown.exe crashes at shutdown (no bug report). This causes our machines not to restart properly after a build has been completed. This makes the nodes go offline and builds that are in queue will never start building. Very annoying, and we have no idea what causes this randomly.

Then we had this Java issue last week. Apparently when we tried to upgrade our java from Update 7 to Update 17, Jenkins starts disconnecting the nodes. The disconnection happens seemingly at random. We had disconnects happening everywhere between 10 minutes to 3 hours in to the build. For some reason, Puppet didn't upgrade all the nodes to Update 17. Still, the ones remaining at U7 had problems as well. The only logical reason we found was, that the update had done something to the U7, but not completely upgraded it to U17. The fix was to make Puppet downgrade the U17 back to U7, and now they just seem to work again.

So we are working with a house of cards here :)

> 
> 7) Does your responsibility for keeping CI up and running include anything to
> do with solving problems of integration which are not related to the patches
> under test? Does it involve marking tests or platforms insignificant where
> appropriate and in a timely way? If not, then that's something we need to
> know, as I wrote before, so we can see if a 'scramble and fix it' solution can
> work.
> 

The configuration of Jenkins and building is our responsibility. So when we create a new platform, say Windows 8 32 bit might be coming soon, we initially mark it as 'forcesuccess'. If we notice after the build that it passes, and perhaps even all tests pass, we change the 'forcesuccess' and / or 'insignificant' appropriately. We also take care of the building nodes / servers, so that they have the tools we need. We update visual studios, mingws, install Perl modules, update Puppet manifests etc. And if we notice in a build log, that some server for some reason lacked something, it's our job to investigate why it lacked it, was puppet out of sync or something, and we fix it. If we notice that some platform constantly fails (say windows 8 cmd.exe problem would occur every single time) we would just go ahead and mark it as 'forcesuccess'. Currently it passes from time to time, so we are keeping it as it is, although I know you'd like to see it changed ;) Perhaps we should change this however...

> 
> The current situation of CI failing for reasons unrelated to the patches under
> test is very frustrating. We need to find out who is responsible for which
> parts of that and able to fix the issues.

True. We will also look at the option of adding a button somewhere that even though the build failed, if we know by looking at the results that the fail reason was unrelated to the change, we could bypass the checkpoint and forcefully merge it in Gerrit. This would help you in cases where we fail you :)

A lot of what I wrote here should be in a web page perhaps. But, with our resources available, the communication and information sharing toward the community is also something that will suffer among other things. I'll add this to our backlog however.

With regards,
-Tony

Tony Sarajärvi
Senior Software Designer
Digia, Qt

Digia Plc
Elektroniikkatie 10 
FI-90590 Oulu 

Email: tony.sarajarvi at digia.com 
Mobile: +358 050 482 1416
http://qt.digia.com 
| Blog | Twitter | LinkedIn | YouTube | Facebook |
------------------------------------------------------------------
PRIVACY AND CONFIDENTIALITY NOTICE
This message and any attachments are intended only for use by the named addressee and may contain privileged and/or confidential information. If you are not the named addressee you should not disseminate, copy or take any action in reliance on it. If you have received this message in error, please contact the sender immediately and delete the message and any attachments accompanying it. Digia Plc does not accept liability for any corruption, interception, amendment, tampering or viruses occurring to this message.
------------------------------------------------------------------

> 
> Thanks,
> 
> --
> Stephen Kelly <stephen.kelly at kdab.com> | Software Engineer
> KDAB (Deutschland) GmbH & Co.KG, a KDAB Group Company
> www.kdab.com || Germany +49-30-521325470 || Sweden (HQ) +46-563-
> 540090
> KDAB - Qt Experts - Platform-Independent Software Solutions