[Development] On the reliability of CI

Stephen Kelly stephen.kelly at kdab.com
Thu Oct 25 09:52:04 CEST 2012


On Thursday, October 25, 2012 13:00:47 Rohan McGovern wrote:
> Replying here to some comments on IRC, since I'm rarely online at the
> same time as the others, but I don't want to let all the comments go
> unanswered...

> > jpnurmi> [07:30:23] steveire: np, those tests have been annoying me
> > several times :)
> > 
> > steveire> [07:31:01] Yes. But why did I get so much pushback on fixing it?
> > Something for qt-project to think about.
> > 
> > sahumada> [07:31:35] because you are not fixing it .. you are hiding it :)
> > 
> > steveire> [07:32:23] I'm fixing the problem that nothing has any chance of
> > integrating. With your attitude, insignificant_test and QSKIP would not
> > exist or be needed.
> I think it's great to have more people actively doing something about
> failing tests, as long as they take responsibility for their actions.
> The alternative of, when you see a flaky test, waiting for "someone" to
> do something, is not going to work (any more?)

Yes, I agree.

> 
> It might be good to have some guidelines about the best ways to handle
> flaky tests, since there are several options.

Yes, I agree.

> > <steveire> [10:34:22] Right. Anyone who can do anything doesn't really
> > care. This is the kind of thing that should be fixable quickly
> 
> The first part is false.  I care, and I can do something - just not at
> the time you've reported the issue (although I was probably awake, I
> made a choice a while ago to minimize time spent fixing problems outside
> of normal working hours, because I felt it was burning me out).

Not working outside normal working hours makes a lot of sense. I wouldn't 
encourage anyone to change that.

I'm glad you care about the CI system too. My comment was intended to refer to 
the fact that everyone kept re-staging the same changes without realizing that 
it wasn't going to work until the network tests were excluded somehow. Also 
when I raised the issue to try to do something about it, I didn't exactly get 
full support. I got a few +1s and 'let's wait for the maintainer to show up'.

As you said, there are no longer Nokia employees chasing this stuff down, so 
we probably need a guideline that either 1) what I was trying to do was 
correct and should be supported or 2) something else concrete should be done 
and supported. 

There are many of us capable to band together to fix issues like this when 
needed, even if Lars/you/someone else is not around. I think we should be 
confident enough to do that among the people who are around. That's the kind 
of 'care' that I meant. Sorry for any misunderstanding.

> > steveire> [10:36:50] And yet, there's been no communication on the mailing
> > list about the network problems (affecting everyone staging anything),
> > despite the fact that it's been known since Monday at least.
> > <steveire> [10:39:19] The insignification should have been done on
> > Monday imo
> 
> I didn't understand this part.

As was pointed out, I had missed the start of this thread with the heads-up 
posted on Friday. The reference to Monday was about this related bug that was 
posted as a comment: https://bugreports.qt-project.org/browse/QTBUG-27673

I had no idea if they were separate issues, so I assumed they were the same. 
That bug appears to say that network testing autotests were broken, so I 
thought that at that point (if they were known to have been broken), they 
should have been marked insignificant already.

I hope that clarifies what I wrote at least. I understand now that it was not 
the same issue.

> Please note that several days of instability doesn't imply several days
> of the same problem going unfixed. In the last few days I've also been
> debugging mysterious OOM conditions from the kernel on some Linux
> builders, metacity crashes caused by Qt autotests (which do not
> themselves fail but cause later tests to fail), and exacerbation of
> these conditions by test machines mysteriously failing to reboot
> themselves between builds.
> So, although it might look from your point of view that there have been
> several days of "generic instability" with no activity, in fact there
> are a few different things going on.

> I know it's frustrating to have some tool blocking your work and not
> being able to do much about it.  Maybe this is why discussions about the
> CI so often veer into toxic semi-rants and baseless assumptions. Please
> do try to make a conscious effort to avoid this, because it acts as a
> disincentive to work on the system.  This kind of thing is probably one
> of the reasons why sysadmins tend to stay aloof from developers.

Thanks for all your work here! Thanks also for the clarifications and sorry 
for the rant.

Thanks,

-- 
Stephen Kelly <stephen.kelly at kdab.com> | Software Engineer
KDAB (Deutschland) GmbH & Co.KG, a KDAB Group Company
www.kdab.com || Germany +49-30-521325470 || Sweden (HQ) +46-563-540090
KDAB - Qt Experts - Platform-Independent Software Solutions
** Qt Developer Conference: http://qtconference.kdab.com/ **
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20121025/6df65372/attachment.sig>


More information about the Development mailing list