[Development] New bot: LLM CI Failure Analysis

Wed Oct 30 10:58:09 CET 2024

A couple months ago, I trialled a new proof-of-concept during a bugfix sprint to help reduce the investigation time required following failure of an integration in CI. This was met with a very good reception internally at TQtC, and so work continued to improve reliability-- The new CI Failure Analysis Bot is now in production deployment and will comment on most changes that fail CI, with some intentional exceptions like submodule updates. The bot runs in all repos.

So what is it?
The bot operates in a few stages:

  1.
For each change in the failed integration, the log is gathered and the last 1000 or so lines are fed into an LLM (today GPT 4o) with a prompt to extract the most relevant failure in the log and snip it out, along with some other relevant data points like error type classification, relevant filenames, etc...
  2.
The bot then collects the source for the relevant files, and test source if it was involved.
  3.
The diff of the change itself is also then collected.
  4.
When all needed data are collected, the log snip, sources, and change diff is again fed into an LLM (today also GPT 4o) and asked to produce a brief summary of the failure, along with an attempted determination of if the failure was directly caused by the change being analysed.
  5.
The resulting summary is posted to the change which failed CI.

Why?
Historically, the shortcut to determining if your change caused the CI failure has been to simply restage it until its obvious that your change isn't passing for some reason. Running pre-checks has reduced this practice significantly, but changes may still cause failures on platforms that can't be tested in a standard pre-check.

This bot aims to give a quick direction to users. Namely, if it is fairly clear that the failure was not caused by the change, it will simply state so and suggest a restage. If there is a clear reason how the change was the cause, the bot will attempt to point that out, but intentionally avoids giving specific solutions.

Is it accurate?
Fairly, but analysis is done in a zero-shot manner; This means that only one result is generated and isn't cross-checked by running the analysis multiple times. If TQtC invests in on-premises LLM hardware, or the price of online models continues to fall, accuracy can be improved by a multi-shot consensus. No matter what, it's still not a human, and isn't actually intelligent. It may make mistakes or create a misleading analysis, so always weigh that when troubleshooting a change.

What about data privacy and copyright?
Ther current online LLM analysis is performed using Microsoft Azure OpenAI services. No data is stored, and no data is used for LLM training purposes. The service used is fully GDPR compliant and has been cleared for use in TQtC and with open-source Qt Project code.

What's next?
If you see an analysis that isn't correct, contact me. Either the bot didn't find the correct error in the log, failed to retrieve sources correctly, or may otherwise simply benefit from changes to the prompts.

Best regards,
-Daniel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20241030/997d927c/attachment.htm>