[Development] On the reliability of CI

Fri Oct 26 01:45:14 CEST 2012

Shawn Rutledge said:
> On Thu, Oct 25, 2012 at 01:00:47PM +1000, Rohan McGovern wrote:
> > Replying here to some comments on IRC, since I'm rarely online at the
> > same time as the others, but I don't want to let all the comments go
> > unanswered...
> > 
> > > steveire> [06:32:44] CI is seriously depresssing. For the last 24 hours
> > > there has been one successful merge. Many of the others are failing
> > > because of something in network.
> 
> Personally I think the fundamental problem which CI could do better is to 
> triage problems.  Often patches get staged in large batches, so when the
> whole batch fails it's very easy to take a quick look at the failure,
> think "that can't possibly be because of what I did", and leave it to 
> someone else who presumably understands his own code better.  But maybe
> that person also takes a while to realize that his code was really the 
> cause.  
> 
> I think when a test fails, the CI system should try to break down the 
> patch set in some way.  For example it could divide the patch set in half, 
> arbitrarily, and see if half of them will integrate successfully, then
> the other half, and continue this recursively until the one bad patch is
> found, or at least a smaller subset.
> 

I think there are two major flaws in need of attention.

One of them is the batching of changes as you say; that's clearly the
least scalable part of the system right now.

The other is the fact that the results are not accurate enough
(primarily due to flaky tests).

The two problems are related; the presence of flaky tests effectively
poisons any kind of advanced test strategy, like the bisection you
mentioned. Sometimes it's already a challenge to get one test run with
accurate results; a bisection process needs a whole string of accurate
results, so I think it's unlikely to give a satisfactory outcome without
improving the test stability somehow.

That's also why I am resistant to some features commonly suggested to save
test time at the cost of accuracy, such as incremental builds and applying
heuristics to decide what to build/test.

I also wonder if one might spend a lot of time implementing a complex
post-test bisection process, then find it is easily outperformed by a
single human who reads the failure message and the one-line summary of
each change and makes a judgment call about what to re-stage... of
course, human time is worth a lot more than computer time.

Personally I think the logical next step, which could be implemented
in parallel to the current setup without drastic new hardware purchases,
would be to provide a service where changes in gerrit can be submitted
for testing separately from the "please merge to stable branch"
mechanism.  Right now, if you know your change is risky, you have best
intentions and you would like it tested separately, you have no
mechanism for doing that.

If that system worked well, it seems to me that naturally the de-facto
situation would arise that you should not merge your change to staging
unless it has already passed in this system or you have a good reason
for doing so. Then it could be considered to make the system
semi-mandatory or attempt to build something more complex on top of it
(such as described in Charley's mail).

Extending the Early Warning System (sanity bot and other similar bots)
to also include compile and autotesting might help, but I have a feeling
that system could never be extended to cover as many configurations as
the full CI, which is likely to always have some un-sandboxable or
un-virtualizable platforms.

> At previous jobs I've seen various more or less unpleasant social regimes
> to prevent "breaking the build", but didn't like any of those, and they
> are not amenable to distributed projects anyway.  For example, the 
> build master does a build every day, or every Friday, and personally 
> nags people if it fails (that was mid-90's, before continuous
> integration).  Later with CI, maybe you have to pay a fine when you break 
> the build; you have to wear a rubber chicken around your neck for a day; 
> or maybe the rubber chicken is used as a token, you can only integrate a 
> set of patches if you possess the chicken, you must control which set of 
> patches are in the batch (ensure you understand them, or at least 
> understand that they are definitely independent), and you cannot pass 
> on the chicken to someone else until all the tests pass.  Probably 
> most commercial development is done under some such brute-force
> regime.  But this is a technical problem, seems like it should have 
> a technical solution.  I can only imagine for example that Google 
> has a better system for internal development, I just don't know what it is.
> 

Trolltech also had a history of some approaches like that.  All
well intentioned, but I think they were relatively ineffective compared to
not letting changes in until after they've passed testing.