Pungi 4: the new generation of the Fedora compose tools, and what it means for QA

Adam Williamson

2016-02-15 16:08

There's a big change coming to Fedora 24. The way Fedora composes are built is changing.

How do things look now

Currently we have three distinct types of Fedora composes. Probably everyone knows about 'nightly composes' and TCs/RCs. You may not know about the post-release nightly Cloud composes. (I'm not counting the live respins, which are demi-semi-official and not produced by releng).

Nightly composes

'Nightly composes' are an interesting concept - in fact, they hardly exist from any perspective but that of the actual compose process. What really happens nightly is:

buildrawhide (or buildbranched, for Branched nightlies) calls pungify (note that all the heavy lifting is done by build-functions.sh, which is sourced by both buildrawhide and buildbranched)
pungify 'pungifies' the Rawhide repositories (it doesn't know anything about variants - that's Workstation / Server / Cloud). That is, it creates network installer images and provides a kernel and initramfs for direct kernel boots, and writes the old-style metadata like the .treeinfo file
buildrawhide / buildbranched calls a livecd script, a cloud image script, an arm image script and a few other bits
buildrawhide / buildbranched syncs the pungified tree to the public mirrors (each day's Rawhide and Branched trees are also kept around here for a bit)

So we wind up with the pungi outputs and a bunch of Koji tasks that try to build some live images, ARM disk images, and Cloud disk images. There's nothing that really ties the Koji tasks to the pungified repositories, and after the fact there's no metadata about the compose as a whole. There are some fedmsg signals sent during the compose process, but there's no fedmsg signal sent after all the Koji tasks complete.

The pungified repos live in the main public Fedora servers for one day, then get replaced with the next day's compose. They're kept around for a few weeks, but the location where they live is not really documented anywhere, and there's no signalling of it via fedmsg. The images built with Koji never get put anywhere else and nothing in the releng process communicates their location (or the Koji task IDs or anything like that) - if you want them, you have to go find them in Koji somehow, and you're on your own with the "how".

Finally, the old compose processes are both fairly slow. A nightly compose takes approximately 9 hours (including Koji tasks) - it usually starts around 0515 UTC with the final Koji task completing around 1415 UTC. TC and RC composes are similar.

TCs / RCs

TCs and RCs are built using a different script from the same releng repo - it uses the same livecd/arm/cloud build scripts, but builds the variant install trees rather than pungifying the main repositories, and builds the Server DVD and Cloud_Atomic installer image as well as network install images. The Koji task outputs and installer trees are then rather messily glommed together and sync'ed out in a way which requires some manual intervention. There are no fedmsg signals sent at all as part of TC/RC creation. There is no useful metadata for the compose as a whole, only the bits of metadata pungi produces in the installer trees.

Post-release nightly Cloud composes

For the last few months we have also had nightly composes built from the current stable release. These composes contain only Cloud images. They are created to support the two week Atomic process; each two weeks one of these composes is 'blessed' and released (yes, the Atomic downloads on the official page are not the ones from the initial Fedora 23 release, but the latest 'two-week Atomic' images). These are created with yet another script.

What's changing

From some time relatively soon (according to Dennis), all Fedora composes will be built with the new Pungi 4 tool and the newer scripts and configuration that go with it. Though it's still called Pungi, this 'new version' is almost a completely different tool (it's actually the union of old-pungi and Red Hat's distribution build tool used for RHEL). With Pungi 4:

Composes will happen frequently - multiple times per day - and take less time
The same process will be used for all composes (at least, I'm assuming the post-release nightlies will use it too)
All composes will (try to) build all images (inc. all the variant installer images)
All image builds happen in Koji (even installer media, which did not before)
An 'Everything' variant will provide a generic network install image
Live images will be created with livemedia-creator
fedmsg signals will be sent throughout the process, including one after the whole compose process is done and the compose is available, with the compose ID and location
Composes will include much better and more comprehensive metadata and logs. Some metadata is still inside the installer trees (these links are to a current Pungi 4 compose at the time of writing, they will likely go stale in future and may not represent the future state of the metadata)
The various reports generated as part of the compose process itself (such as the ones combined into the 'rawhide report' email) will be improved in several ways, for instance when packages change, not only is the new NEVR reported but also the previous one

What it means

Easy task scheduling (RIP fedfind...ish)

The most obvious consequence of the fedmsg and metadata improvements is that it gets much easier to find out when a compose completes and where you can find the bits of it once it's done. We are already scheduling various things to happen on completion of a compose at present (more on exactly what later), but we had to build a whole messy project to make it possible - fedfind.

Fedfind compensates for the lack of fedmsg signals by working out (quite painfully, in the case of nightly composes) when a compose is complete, and compensates for the lack of consistent compose locations, contents and metadata by having lots of hardcoded knowledge about where composes live (which needs updating whenever that changes) and having some quite ugly capabilities for crawling through compose trees and querying Koji and figuring out what images can be considered to be part of a given 'compose'.

With Pungi 4, all of this becomes unnecessary. To find out when a compose is complete, you can simply listen for a fedmsg signal. If you don't want to do that, there's a STATUS file in the compose tree which can tell you the current status of it. As images that come from Koji tasks are properly pulled into the final compose tree, there's no need to go and poke Koji to find lives or Cloud images or ARM images. And as there's comprehensive metadata about the images in the compose, there's no need to crawl through the compose tree to find images and try to infer their identity from their filenames.

This is great news for me, because I no longer need to maintain the messy ball of hacks that is fedfind! Well, mostly - there are some odds and ends in there that we'll probably still need. But it gets much smaller, and we might be able to move the remaining bits into python-fedora or similar.

More frequent and rapid compose testing

With composes happening multiple times a day and taking significantly less time, we can really shorten the time between a package being tagged in Koji and it appearing in a compose. This is of course great news for Rawhide users in general, but for QA it also means we can get finer grained with our compose testing: each compose can be run through the automated tests we have available (e.g. the openQA tests) as soon as it appears, and thus the longest time between a package being tagged and a compose including it being tested might be seven or eight hours rather than over 32.

Validation process changes: the end of TCs?

I recently floated an idea on the mailing lists: dropping TCs. I won't rehash the mail, but basically, if we have regular composes that look just like release composes, there's no meaning to TCs any more. TCs at first existed because we simply didn't have anything like nightly composes - TCs and RCs were all the composes we had. If we wanted to see how a Fedora build would work right now, hey, we went to releng and asked for a TC. Then when we started getting nightly composes, we kept TCs because nightlies still didn't really look like real composes. Pungi 4 solves both those problems, and so there's no real reason to have TCs any more.

No-one seems to be opposed to this, so it looks like we'll be going ahead and killing off TCs once we officially switch to Pungi 4. RCs will likely remain, as there are issues with identification and certain settings that are flipped for 'released' images, but the difference between an RC and a regular compose will be much smaller than before and we can start to think about whether we might want to move away from a milestone-based development / test process in future.

How the release validation process will likely actually work is that we'll keep 'nominating' nightly composes for manual testing (a process which is really just about controlling the compose firehose, because humans can't cope with running complex tests every six hours and we don't really want seven thousand wiki pages per release) all the way up until Alpha RC, then we'll do a series of RC composes just as usual, then after Alpha release we'll switch back to nominating nightlies until Beta RC, and so forth. openQA - and in future any other automated tests we run on composes - will run on every compose that comes out of releng, and report its results to the wiki when a compose is nominated.

Consolidating post-compose tasks

The final thing I wanted to talk about is the fact that this change gives us a great opportunity: we can consolidate all the various things that happen after a compose. Over time, and especially over the last year or so, we've accreted kind of a lot of these. My list - probably not exhaustive - is below. Of course, all of these things should be emitting fedmsgs on start and on completion.

There might be opportunities for reconciling some capabilities of the bits below, especially the Stage 2 bits: I've been working on making check-compose capable of replacing the two-week atomic check bits. But more importantly, I think it would be a good idea to run as many of these things as possible out of Taskotron. One of taskotron's main strengths, after all, is running tasks based on fedmsg messages, and in an ideal world (I reckon) all of these things would run off fedmsg.

Once we get Pungi 4 deployed, I'd really like it if we could work to have a nice clean fedmsg and Taskotron-based process for running all of these various things, so if you're involved in any of them and I haven't talked to you already, I'd love to hear from you! I'd also love to hear about anything that should be listed below but isn't (and any corrections to the things that are listed there, or anything else in this post).

Stage 0

This is the start of the whole process, just here for completeness (and because it's where 'rawhide report' comes).

The compose itself
The reports generated as part of the compose: 'rawhide report'

Stage 1

These are the things that happen (or at least ought to happen - not all of them currently do) immediately on compose completion.

Automated testing

We have openQA and autocloud now. Taskotron and Beaker will likely be running automated tests on composes in future (Taskotron already runs several tests, but none of them are part of compose testing). openQA currently uses fedfind to run when composes complete; autocloud listens for Koji task fedmsgs and runs when it sees one that looks like it was for one of the images it tests.

Manual validation test nomination(?)

relval currently does this simply as a cron job: it wakes up at a given time each day and decides whether to nominate that day's compose for manual testing. This is marked ? as it may be appropriate to move it to stage 2, and only nominate composes that pass certain automated tests.

Submission of compose information to PDC(?)

PDC, if you've never heard of it, is a store of information on composes and a web service for accessing it. Information for all Fedora composes should be stored in PDC in future. I don't know how this is handled at present. It is marked ? as it may be appropriate to move it to stage 0 (i.e. have it happen as part of the compose process itself).

Stage 2

These are the things that happen (or ought to happen) after one or more of the things in stage 1.

check-compose

We have a few things that do something along the lines of a "status check" for a compose. check-compose reports when a compose is 'missing' expected images, summarizes the results of automated testing (currently only openQA), and compares the images in the compose to those in the previous compose. It uses fedfind to wait for the compose to complete, and openQA-python-client to wait for openQA tests to be complete.

Two-week Atomic check

The two week Atomic push script queries Datagrepper (a cache of fedmsg messages, more or less) to check the autocloud results and find the 'latest successful' post-release nightly compose, in order to release it.

compose-utils(?)

compose-utils has a changelog tool that identifies the packages that changed between two composes, and produces a diff of the changelogs of all changed packages (thus identifying all the changes). This is similar to something the Rawhide compose report does, but that only prints the changelog for each new build (so if that package actually had more than one new build since the previous compose, it does not show the changelog for the earlier new builds). I'm not sure if this is intended for use as a standalone, or if it's expected to be integrated with pungi somehow (and thus should be in stage 0).

Submission of test information to PDC

PDC can also store some information on the test status of a given compose. It's undecided yet whether doing this would be of any use to us, but if so, it should be done after the relevant tests have completed, of course.