Using Zuul CI with Pagure.io

I attended Devconf.cz again this year - I'll try and post a full blog post on that soon. One of the most interesting talks, though, was CI/CD for Fedora packaging with Zuul, where Fabien Boucher and Matthieu Huin introduced the work they've done to integrate a specific Zuul instance (part of the Software Factory effort) with the Pagure instance Fedora uses for packages and also with Pagure.io, the general-purpose Pagure instance that many Fedora groups use to host projects, including us in QA.

They've done a lot of work to make it as simple as possible to hook up a project in either Pagure instance to run CI via Zuul, and it looked pretty cool, so I thought I'd try it on one of our projects and see how it compares to other options, like the Jenkins-based Pagure CI.

I wound up more or less following the instructions on this Wiki page, but it does not give you an example of a minimal framework in the project repository itself to actually run some checks. However, after I submitted the pull request for fedora-project-config as explained on the wiki page, Tristan Cacqueray was kind enough to send me this as a pull request for my project repository.

So, all that was needed to get a kind of 'hello world' process running was:

  1. Add the appropriate web hook in the project options
  2. Add the 'zuul' user as a committer on the project in the project options
  3. Get a pull request merged to fedora-project-config to add the desired project
  4. Add a basic Zuul config which runs a single job

After that, the next step was to have it run useful checks. I set the project up such that all the appropriate checks could be run just by calling tox (which is a great test runner for Python projects) - see the tox configuration. Then, with a bit more help from Tristan, I was able to tweak the Zuul config to run it successfully. This mainly required a couple of things:

  1. Adding nodeset: fedora-31-vm to the Zuul config - this makes the CI job run on a Fedora 31 VM rather than the default CentOS 7 VM (CentOS 7's tox is too old for a modern Python 3 project)
  2. Modifying the job configuration to ensure tox is installed (there's a canned role for this, called ensure-tox) and also all available Python interpreters (using the package module)

This was all pretty small and easy stuff, and we had the whole thing up and running in a few hours. Now it all works great, so whenever a pull request is submitted for the project, the tests are automatically run and the results shown on the pull request.

You can set up more complex workflows where Zuul takes over merging of pull requests entirely - an admin posts a comment indicating a PR is ready to merge, whereupon Zuul will retest it and then merge it automatically if the test succeeds. This can also be used to merge series of PRs together, with proper testing. But for my small project, this simple integration is enough so far.

It's been a positive experience working with the system so far, and I'd encourage others to try it for their packages and Pagure projects!

Uptime

OK, so that was two days longer than I was expecting! Sorry for the extended downtime, folks, especially Fedora folks. It was rather beyond my control. But now I'm (just barely) back, through the single working cable outlet in the house and a powerline ethernet connection to the router, at least until the cable co can come and fix all the other outlets!

Downtime

happyassassin.net - this blog, all email addresses, and anything else hosted there (which should only matter to me...) - will be down for about the next day and a half or so, just For The Record. I'm moving house again, and am now at the whims of the fickle gods of Shaw Cable. Make all homage unto them.

In praise of WebAuthn

tl;dr: I just got a Yubikey 5 and set it up on a bunch of things. You should too, because WebAuthn is awesome.

Now the long version!

Two-factor authentication has been a thing for a while now. You're probably familiar with it in various forms. There's SMS-based 2FA, commonly used by banks, where they text you a code you have to re-type when logging in. Then there are token/one-time-password based systems where you can use a hardware key like a Yubikey or a software authenticator like Google Authenticator or FreeOTP to generate and enter a one-time password when logging into a system.

If you're like I was yesterday, maybe you've got two Yubikeys on your keyring and another in a drawer somewhere and you have to remember which four systems you have set up on which slots on which key, and you've got FreeOTP as a backup.

Maybe you've also kinda heard about "U2F" and had the vague idea that it sounded neat. And also maybe you've read some stuff about "WebAuthn" recently and thought it sounded maybe cool but also maybe confusing and what's going on and maybe this isn't the most important thing you could be figuring out today?

Well, prodded by a couple of mailing list threads I did figure it out a bit, and here's the thing: WebAuthn is spreading, and it's awesome. If you are just a person who wants to log into stuff - it's 2FA done way better.

Here's the cool stuff:

  • You can use one key ('authenticator') to log into as many different WebAuthn-supporting sites as you want (and this is secure, they're not all sharing the same seed or anything)
  • You can register multiple authenticators per site
  • An authenticator can be a hardware key (mainly a Yubikey 5, at this point, but the Solokey is supposed to be a fully open-source WebAuthn-supporting key, only available to backers so far), but you can also use a phone or laptop with a fingerprint reader or facial ID system
  • It works on Linux. Really easily. It works on Firefox (not just Chrome). It works on Firefox on Android. Yeah, all the stuff you kinda automatically assume is going to be a pain in the ass...isn't! It actually fricking works!
  • WebAuthn-compatible keys can still support other systems too...specifically, you can get a Yubikey and use it for WebAuthn but it also still has two OTP slots, and no you don't have to do something stupid to pick which system you're using, it all just magically works somehow, I don't know how and I don't care. The Yubikey and Firefox are also backwards-compatible with U2F, so sites that implemented U2F but didn't update to WebAuthn yet work fine

Seriously, it's awesome. And it actually works, like, right now, really well. On useful sites! Try it! Github supports it, for instance. Go to your Github account, go to the Settings page, go to Security, enable 2FA if you don't have it enabled already, and hit edit on 'Security keys'. Then click 'Register new security key'. Give it a name (like 'phone' or 'yubikey #1' or whatever). If you're using a Yubikey, plug it in and hit the button. If you're using a phone with a fingerprint sensor or facial ID, there'll be an option for 'use this device with a fingerprint' or something like that. Pick it, and touch the sensor or show it your face. And that's it. You're done. Then when you login you just do the same thing (plug in, push button, touch sensor, or show face) and you're in. It's like the fricking future or something.

You can even use a Yubikey via NFC to log in with Firefox on Android (and I assume Chrome too, but I didn't try that). Yeah, I tried it, it worked. First time. (Once I figured out where the NFC sensor was, anyway). You can even apparently use your phone connected via Bluetooth to login on a computer, though I didn't try that yet - the browser should let you pick the Bluetooth-connected phone as the authenticator, then the phone will ask you for your fingerprint or face.

It's all so much frickin' better than re-typing codes from text messages or remembering Yubikey slot numbers. I really did not realize it was gonna be this nice. It is also more secure than OTP-based systems and much more secure than SMS-based systems, which is great, but even if it wasn't it's just nicer. I really hope W3C and Mozilla and Google and Apple and whoever go out and sell it that way as hard as they can.

So far I've set up my Google account (I think Google is still technically using U2F not WebAuthn, but as far as the user experience went it didn't make any difference), Github, Bitwarden (which is a great open-source password management service), and Gandi (I use them for domain registration and DNS, they're great for that), and now I'm busy writing to a ton of other sites to demand they get on the bandwagon already. I used the OTP slots for Fedora and Red Hat internal systems (neither supports WebAuthn yet, unfortunately - one limitation of WebAuthn is that it is fairly 'web-y', it's less suited to systems where you need to authenticate in non-web-protocol scenarios, so FAS and RH auth can't just switch over to it that easily). And my three pre-U2F Yubikeys are wiped and on their way to hardware heaven...

AdamW's Debugging Adventures: "dnf is locked by another application"

Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)

This case - Bugzilla #1750575 - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.

The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) looked into it a bit but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.

Where's the error coming from?

As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an exception handler and also in a method for setting up a connection to dnfdaemon. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - dnfdaemon - which does the heavy lifting.

So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read logger.critical(errmsg). I changed this to logger.exception(errmsg). logger.exception is a very handy feature of Python's logging module which logs whatever message you specify, plus a traceback to the current state, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file dnfdragora.log in the directory you run it from) gave us a traceback showing how we got to the error:

2019-10-14 17:53:29,436 <a href="ERROR">dnfdragora</a> dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)
Traceback (most recent call last):
File "/usr/bin/dnfdragora", line 85, in <module>
main_gui.handleevent()
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent
if not self._searchPackages(filter, True) :
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages
packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags )
File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc
rc = func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search
newest_only, tags)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search
fields, keys, attrs, match_all, newest_only, tags))
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async
result = self._get_result(data)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result
self._handle_dbus_error(user_data['error'])
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error
raise DaemonError(str(err))
dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)</module>

So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have dnfdragora/dnf_backend.py and dnfdaemon/client/init.py included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of framework for a client class called (oddly enough) DnfDaemonBase which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a class called DnfRootBackend, which inherits from both dnfdragora.backend.Backend (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and dnfdaemon.client.Client, which is just a small extension to DnfDaemonBase that adds some dbus signal handling.

So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual dnfdaemon server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that DnfRootBackend.Search() is wrapped by dnfdragora.misc.ExceptionHandler, which handles dnfdaemon.client.DaemonError exceptions - like the one that's ultimately getting raised here! - by calling the base class's own exception_handler() on them...and for us, that's BaseDragora.exception_handler, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have two indications (the dbus error itself, and the code in exception_handler() that the error we're dealing with is "LockedError".

A misleading error...

At this point, I went looking for the text LockedError, and found it in two files in dnfdaemon that are kinda variants on each other - daemon/dnfdaemon-session.py and daemon/dnfdaemon-system.py. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to LockedError is the same in both. There's a function called check_lock() which checks that self._lock == sender, and if it doesn't, raises LockedError. That sure looks like where we're at.

So at this point I did a bit of poking around into how self._lock gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by @dbus.service.method, which makes them accessible as DBus methods. (One of them is Search(), and we can see that the client class's own Search() basically just calls that). There are also methods called Lock() and Unlock(), which - not surprisingly - set and release this lock, by setting the daemon class' self._lock to be either an identifier for the DBus client or None, respectively. And when the daemon is first initialized, the value is set to None.

At this point, I realized that the error we're dealing with here is actually a lie in two important ways:

  1. The message claims that the problem is the lock being held "by another application", but that's not what check_lock() checks, really. It passes only if the caller holds the lock. It does fail if the lock is held "by another application", but it also fails if the lock is not held at all. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.
  2. The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf do have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on the dnfdaemon instance itself.

So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".

So what's going on with this lock anyway?

My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the Lock() and Unlock() methods, and I also made check_lock() log what sender and self._lock were set to before returning. Because it sets self._lock to None, I also added a log line to the daemon's init that just records that we're in it. That got me some more useful information:

2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now!
2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835
2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
[...snip a bunch more calls to check_lock where the sender is the same...]
2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now!
2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is None

so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see check_lock() being called and finding sender and self._lock set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's init running again for some reason, not being locked, and then a check_lock() call that fails because the daemon instance's self._lock is None.

After a couple of minutes, I guessed what was going on here, and the daemon's service logs confirmed it - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this new instance of the daemon (it has no idea it just crashed and restarted), so check_lock() fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.

The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.

OK then...why is dnfdaemon crashing?

So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:

Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]:   D-Bus not built with -rdynamic so unable to print a backtrace

that last line looked like a cue, so of course, off I went to figure out how to build DBus with -rdynamic. A bit of Googling told me - thanks "the3dfxdude"! - that the trick is to compile with --enable-asserts. So I did that and reproduced the bug again, and got a bit of a better backtrace. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was in dnfdaemon.server.DnfDaemonBase.run_transaction(). (Note, this is a different DnfDaemonBase class from dnfdaemon.client.DnfDaemonBase; I don't know why they have the same name, that's just confusing).

So, the daemon's crashing on this self.TransactionEvent('end-run', NONE) call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this DnfDaemonBase, like the other one, is a framework for a full daemon implementation, and it's subclassed by a DnfDaemon class here. That class defines a TransactionEvent method that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, why are we crashing?

At this point I tried to make a small reproducer (which basically just set up a DnfDaemon instance and called self.TransactionEvent in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand why it was happening at all).

Time for the old standby...

So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a bit of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.

But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that self._reset_base() call on the line right before the self.TransactionEvent call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?

I stuck some logging lines into this run_transaction so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like self.logger('XXX HERE 1'), self.logger('XXX HERE 2') etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the self._reset_base() call down under the self.TransactionEvent call...and found that when I did that, we got further, the self.TransactionEvent call worked and we crashed the next time something else tried to emit a DBus signal. I also tried commenting out the self._reset_base() call entirely, and found that now we would only crash the next time a DBus signal was emitted after a subsequent call to the Unlock() method, which is another method that calls self._reset_base(). So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after self._reset_base() is called".

So my next step was to break down what _reset_base() was actually doing. Turns out all of the detail is in the DnfDaemonBase skeleton server class: it has a self._base which is a dnf.base.Base() instance, and that method just calls that instance's close() method and sets self._base to None. So off I went into dnf code to see what dnf.base.Base.close() does. Turns out it basically does two things: it calls self._finalize_base() and then calls self.reset(True, True, True).

Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to self._reset_base() in the daemon to self._base.reset(True, True, True)...and the bug stopped happening! So that told me the problem was in the call to _finalize_base(), not the call to reset(). So I dug into what _finalize_base() does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.

To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite reason for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.

Eventually I lit upon this bit of dnf.rpm.transaction.TransactionWrapper.close(). That was the key, as close as I could get to it: reducing the daemon's self._reset_base() call to just self._base._priv_ts.ts = None (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that self._reset_base() does which caused the problem.

So, of course, I took a look at what this ts thing was. Turns out it's an instance of rpm.TransactionSet, from RPM's Python library. So, at some point, we're setting up an instance of rpm.TransactionSet, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.

Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...this showed up.

Turns out RPM has plugins (TIL!), and in particular, it has this one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a cleanup function which calls something called dbus_shutdown() - aha!

This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package rpm-plugin-systemd-inhibit installed. I poked at dependencies a bit and found that python3-dnf recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.

So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to dbus_shutdown() had been added quite recently, which tied in with the bug not showing up earlier. I looked up the documentation for dbus_shutdown() which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:

"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().

libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.

...

You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."

and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the rpm.TransactionSet instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and dbus_shutdown() happening. The RPM devs had intended that call to clean up the RPM plugin's DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.

So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!

Converting fedmsg consumers to fedora-messaging

So in case you hadn't heard, the Fedora infrastructure team is currently trying to nudge people in the direction of moving from fedmsg to fedora-messaging.

Fedmsg is the Fedora project-wide messaging bus we've had since 2012. It backs FMN / Fedora Notifications and Badges, and is used extensively within Fedora infrastructure for the general purpose of "have this one system do something whenever this other system does something else". For instance, openQA job scheduling and result reporting are both powered by fedmsg.

Over time, though, there have turned out to be a few issues with fedmsg. It has a few awkward design quirks, but most significantly, it's designed such that message delivery can never be guaranteed. In practice it's very reliable and messages almost always are delivered, but for building critical systems like Rawhide package gating, the infrastructure team decided we really needed a system where message delivery can be formally guaranteed.

There was initially an idea to build a sort of extension to fedmsg allowing for message delivery to be guaranteed, but in the end it was decided instead to replace fedmsg with a new AMQP-based system called fedora-messaging. At present both fedmsg and fedora-messaging are live and there are bridges in both directions: all messages published as fedmsgs are republished as fedora-messaging messages by a 0MQ->AMQP bridge, and all messages published as fedora-messaging messages are republished as fedmsgs by an AMQP->0MQ bridge. This is intended to ease the migration process by letting you migrate a publisher or consumer of fedmsgs to fedora-messaging at any time without worrying about whether the corresponding consumers and/or publishers have also been migrated.

This is just the sort of project I usually work on in the 'quiet time' after one release comes out and before the next one really kicks into high gear, so since Fedora 30 just came out, last week I started converting the openQA fedmsg consumers to fedora-messaging. Here's a quick write-up of the process and some of the issues I found along the way!

I found these three pages in the fedora-messaging docs to be the most useful:

  1. Consumers
  2. Messages
  3. Configuration (especially the 'consumer-config' part)

Another important bit you might need are the sample config files for the production broker and stable broker.

All the fedmsg consumers I wrote followed this approach, where you essentially write consumer classes and register them as entry points in the project's setup.py. Once the project is installed, the fedmsg-hub service provided by fedmsg runs all these registered consumers (as long as a configuration setting is set to turn them on).

This exact pattern does not exist in fedora-messaging - there is no hub service. But fedora-messaging does provide a somewhat-similar pattern which is the natural migration path for this type of consumer. In this approach you still have consumer classes, but instead of registering them as entry points, you write configuration files for them and place them in /etc/fedora-messaging. You can then run an instantiated systemd service that runs fedora-messaging consume with the configuration file you created.

So to put it all together with a specific example: to schedule openQA jobs, we had a fedmsg consumer class called OpenQAScheduler which was registered as a moksha.consumer called fedora_openqa.scheduler.prod in setup.py, and had a config_key named "fedora_openqa.scheduler.prod.enabled". As long as a config file in /etc/fedmsg.d contained 'fedora_openqa.scheduler.prod.enabled': True, the fedmsg-hub service then ran this consumer. The consumer class itself defined what messages it would subscribe to, using its topic attribute.

In a fedora-messaging world, the OpenQAScheduler class is tweaked a bit to handle an AMQP-style message, and the entrypoint in setup.py and the config_key in the class are removed. Instead, we create a configuration file /etc/fedora-messaging/fedora_openqa_scheduler.toml and enable and start the fm-consumer@fedora_openqa_scheduler.service systemd service. Note that all the necessary bits for this are shipped in the fedora-messaging package, so you need that package installed on the system where the consumer will run.

That configuration file looks pretty much like the sample I put in the repository. This is based on the sample files I mentioned above.

The amqp_url specifies which AMQP broker to connect to and what username to use: in this sample we're connecting to the production Fedora broker and using the public 'fedora' identity. The callback specifies the Python path to the consumer callback class (our OpenQAScheduler class). The [tls] section points to the CA certificate, certificate and private key to be used for authenticating with the broker: since we're using the public 'fedora' identity, these are the files shipped in the fedora-messaging package itself which let you authenticate as that identity. For production use, I think the intent is that you request a separate identity from Fedora infra (who will generate certs and keys for it) and use that instead - so you'd change the amqp_url and the paths in the [tls] section appropriately.

The other key things you have to set are the queue name - which appears twice in the sample file as 00000000-0000-0000-0000-000000000000, for each consumer you are supposed to generate a random UUID with uuidgen and use that as the queue name, each consumer should have its own queue - and the routing_keys in the [[bindings]] section. Those are the topics the consumer will subscribe to - unlike in the fedmsg system, this is set in configuration rather than in the consumer class itself. Another thing you may wish to take advantage of is the consumer_config section: this is basically a freeform configuration store that the consumer class can read settings from. So you can have multiple configuration files that run the same consumer class but with different settings - you might well have different 'production' and 'staging' configurations. We do indeed use this for the openQA job scheduler consumer: we use a setting in this consumer_config section to specify the hostname of the openQA instance to connect to.

So, what needs changing in the actual consumer class itself? For me, there wasn't a lot. For a start, the class should now just inherit from object - there is no base class for consumers in the fedora-messaging world, there's no equivalent to fedmsg.consumers.FedmsgConsumer. You can remove things like the topic attribute (that's now set in configuration) and validate_signatures. You may want to set up a init, which is a good place to read in settings from consumer_config and set up a logger (more on logging in a bit). The method for actually reading a message should be named call() (so yes, fedora-messaging just calls the consumer instance itself on the message, rather than explicitly calling one of its methods). And the message object itself the method receives is slightly different: it will be an instance of fedora_messaging.api.Message or a subclass of it, not just a dict. The topic, body and other bits of the message are available as attributes, not dict items. So instead of message['topic'], you'd use message.topic. The message body is message.body.

Here I ran into a significant wrinkle. If you're consuming a native fedora-messaging message, the message.body will be the actual body of the message. However, if you're consuming a message that was published as a fedmsg and has been republished by the fedmsg->fedora-messaging bridge, message.body won't be what you'd probably expect. Looking at an example fedmsg, we'd probably expect the message.body of the converted fedora-messaging message to be just the msg dict, right? Just a dict with keys repo and agent. However, at present, the bridge actually publishes the entire fedmsg as the message.body - what you get as message.body is that whole dict. To get to the 'true' body, you have to take message.body['msg']. This is a problem because whenever the publisher is converted to fedora-messaging, there won't be a message.body['msg'] any more, and your consumer will likely break. It seems that the bridge's behavior here will likely be changed soon, but for now, this is a bit of a problem.

Once I figured this out, I wrote a little helper function called _find_true_body to fudge around this issue. You are welcome to steal it for your own use if you like. It should always find the 'true' body of any message your consumer receives, whether it's native or converted, and it will work when the bridge is fixed in future too so you won't need to update your consumer when that happens (though later on down the road it'll be safe to just get rid of the function and use message.body directly).

Those things, plus rejigging the logging a bit, were all I needed to do to convert my consumers - it wasn't really that much work in the end.

To dig into logging a bit more: fedmsg consumer class instances had a log() method you could use to send log messages, you didn't have to set up your own logging infrastructure. (Although a problem of this system was that it gave no indication which consumer a log message came from). fedora-messaging does not have this. If you want a consumer to log, you have to set up the logging infrastructure within the consumer, and tweak the configuration file a bit.

The pattern I chose was to import logging and then init a logger instance for each consumer class in its init(), like this:

self.logger = logging.getLogger(self.__class__.__name__)

Then you can log messages with self.logger.info("message") or whatever. I thought that would be all I'd need, but actually, if you just do that, there's nothing set up to actually receive the messages and log them anywhere. So you have to add a bit to the TOML config file that looks like this:

[log_config.loggers.OpenQAScheduler]
level = "INFO"
propagate = false
handlers = ["console"]

the OpenQAScheduler there is the class name; change it to the actual name of the consumer class. That will have the messages logged to the console, which - when you run the consumer as a systemd service - means they wind up in the system journal, which was enough for me. You can also configure a handler to send email alerts, for instance, if you like - you can see an example of this in Bodhi's config file.

One other wrinkle I ran into was with authenticating to the staging broker. The sample configuration file has the right URL and [tls] section for this, but the files referenced in the [tls] section aren't actually in the fedora-messaging package. To successfully connect to the staging broker, as fedora.stg, you need to grab the necessary files from the fedora-messaging git repo and place them into /etc/fedora-messaging.

To see the whole of the changes I had to make to the openQA consumers, you can look at the commits on the fedora-messaging branch of the repo and also this set of commits to the Fedora infra ansible repo.

Wifi display (Miracast) support on Linux!

I was really pleased to see this blog post today, because I was looking for exactly this a few weeks ago and was sad to find it didn't exist. Now it does!

Displaying on an external screen via wifi is a really handy thing that is increasingly commonly available. Lately I'm finding most TVs I run across - including hotel room TVs! - have the feature available. For whatever reason I don't really see many people using or talking about it, but it's great - you can watch videos from your phone or laptop without having to carry an HDMI cable around everywhere and hope the inputs aren't blocked or full on the TV.

So I'm looking forward to testing that out, and hope all the bits land in official distribution packages soon :) Thanks Benjamin!