[Development] On the reliability of CI

Thu Oct 25 05:00:47 CEST 2012

Replying here to some comments on IRC, since I'm rarely online at the
same time as the others, but I don't want to let all the comments go
unanswered...

> steveire> [06:32:44] CI is seriously depresssing. For the last 24 hours
> there has been one successful merge. Many of the others are failing
> because of something in network.

> richmoore1> [06:40:03] steveire: a lot of this seems to be caused by the
> moving of the CI infrstructure to digia. it doesn't seem to be working
> fully yet

I don't think that's true.  As far as I know, the projects migrated to
Digia have been working fine, and only the Nokia system (which has lost
the majority of its support staff) has been having problems.

The Nokia qt-test-server kernel recently started to produce
"kernel: [14564774.569761] swapper: page allocation failure. order:4, mode:0x4020"
after 377 days uptime with few problems.  I don't think this is directly
related to any Nokia -> Digia transition activities, rather an
unfortunate coincidence of timing.

> ThorbjornTux> [06:54:20] steveire: there was a discussion not that long ago
> in here ... I think that the conclusion were that tests that failed without reason
> was ok to be marked as insignificant .... (as you suggested).
> ThorbjornTux> [06:54:47] steveire: the problem seems only to be if anybody does it ..

> steveire> [06:55:10] Exactly. There used to be people who did things like that.

True, there used to be Nokia employees reading every failure report and
chasing up apparently unstable tests, either trying to fix the tests, or
acknowledge them via bug reports and marking them insignificant.
Those people are gone and the test results are likely to be less stable
until they're replaced - either by more people doing the same job, or
an automated solution to achieve stable test results from an unstable
product.

> jpnurmi> [07:30:23] steveire: np, those tests have been annoying me several times :)

> steveire> [07:31:01] Yes. But why did I get so much pushback on fixing it? Something
> for qt-project to think about.

> sahumada> [07:31:35] because you are not fixing it .. you are hiding it :)

> steveire> [07:32:23] I'm fixing the problem that nothing has any chance of integrating.
> With your attitude, insignificant_test and QSKIP would not exist or be needed.

I think it's great to have more people actively doing something about
failing tests, as long as they take responsibility for their actions.
The alternative of, when you see a flaky test, waiting for "someone" to
do something, is not going to work (any more?)

It might be good to have some guidelines about the best ways to handle
flaky tests, since there are several options.

> torarne> [09:18:47] anyone got powers to put things into qt5.git
> without the ci getting in the way?

There's no built-in mechanism to bypass the system.  We haven't needed
one so far, we've always managed to handle problems as they arise.
If it were an acceptable option to bypass the system when problems occur,
it seems to me it would greatly reduce the incentive to fix the
problems.

> <steveire> [10:34:22] Right. Anyone who can do anything doesn't really
> care. This is the kind of thing that should be fixable quickly

The first part is false.  I care, and I can do something - just not at
the time you've reported the issue (although I was probably awake, I
made a choice a while ago to minimize time spent fixing problems outside
of normal working hours, because I felt it was burning me out).

Actually, every CI failure which is not related to any of the changes
under test slightly erodes my soul.
I can guarantee I've been at least as frustrated as any users of the
system, during its most unreliable times.

The latter part is true and the problem was fixed quickly (for some
subjective value of "quickly"), once it was known.

> richmoore1> [10:36:01] doing CI from one side of the world to the other
> is optimistic

Yup, we used to have the Pulse server and all clients located together
in Brisbane.
The migration to Jenkins meant the server was moved to Europe.
We have suffered a little bit from that.
Luckily, this will soon be over; just a few more days and everything
will be operating out of Europe.

> steveire> [10:36:50] And yet, there's been no communication on the mailing
> list about the network problems (affecting everyone staging anything),
> despite the fact that it's been known since Monday at least.
> <steveire> [10:39:19] The insignification should have been done on
> Monday imo

I didn't understand this part.
There has been no known problem since Monday, this seems to be a false
assumption.
The specific network problem you're complaining about was reported to
JIRA by you, last night at 9pm my time, and fixed by me within the first
30 minutes of my working day today.
Reporting problems greatly increases the likelihood of a timely fix.

You'll be able to get technical support within your own timezone once
the transition to Digia completes.

Please note that several days of instability doesn't imply several days
of the same problem going unfixed. In the last few days I've also been
debugging mysterious OOM conditions from the kernel on some Linux
builders, metacity crashes caused by Qt autotests (which do not
themselves fail but cause later tests to fail), and exacerbation of
these conditions by test machines mysteriously failing to reboot
themselves between builds.
So, although it might look from your point of view that there have been
several days of "generic instability" with no activity, in fact there
are a few different things going on.

> torarne> [10:42:41] what about blames, bisecting integration runs, incremental
> builds, testing subsets of tests, single-patch integrations, paralysing building
> and testing, any work in those areas?

Not that I'm aware of.

Unfortunately in the Nokia times, the standing directive from management
(at least for the last 1-2 years) was to spend the minimum time possible
on the CI system.
That's why it had virtually no feature improvements in that time. (The
last notable feature added was to allow tests marked with
CONFIG+=parallel_test to run in parallel).

I hope Digia will put more resources toward improving the system rather
than merely maintaining it as-is.

=====

I know it's frustrating to have some tool blocking your work and not
being able to do much about it.  Maybe this is why discussions about the
CI so often veer into toxic semi-rants and baseless assumptions. Please
do try to make a conscious effort to avoid this, because it acts as a
disincentive to work on the system.  This kind of thing is probably one
of the reasons why sysadmins tend to stay aloof from developers.