happyassassin.net - this blog, all email addresses, and anything else hosted there (which should only matter to me...) - will be down for about the next day and a half or so, just For The Record. I'm moving house again, and am now at the whims of the fickle gods of Shaw Cable. Make all homage unto them.
tl;dr: I just got a Yubikey 5 and set it up on a bunch of things. You should too, because WebAuthn is awesome.
Now the long version!
Two-factor authentication has been a thing for a while now. You're probably familiar with it in various forms. There's SMS-based 2FA, commonly used by banks, where they text you a code you have to re-type when logging in. Then there are token/one-time-password based systems where you can use a hardware key like a Yubikey or a software authenticator like Google Authenticator or FreeOTP to generate and enter a one-time password when logging into a system.
If you're like I was yesterday, maybe you've got two Yubikeys on your keyring and another in a drawer somewhere and you have to remember which four systems you have set up on which slots on which key, and you've got FreeOTP as a backup.
Maybe you've also kinda heard about "U2F" and had the vague idea that it sounded neat. And also maybe you've read some stuff about "WebAuthn" recently and thought it sounded maybe cool but also maybe confusing and what's going on and maybe this isn't the most important thing you could be figuring out today?
Well, prodded by a couple of mailing list threads I did figure it out a bit, and here's the thing: WebAuthn is spreading, and it's awesome. If you are just a person who wants to log into stuff - it's 2FA done way better.
Here's the cool stuff:
- You can use one key ('authenticator') to log into as many different WebAuthn-supporting sites as you want (and this is secure, they're not all sharing the same seed or anything)
- You can register multiple authenticators per site
- An authenticator can be a hardware key (mainly a Yubikey 5, at this point, but the Solokey is supposed to be a fully open-source WebAuthn-supporting key, only available to backers so far), but you can also use a phone or laptop with a fingerprint reader or facial ID system
- It works on Linux. Really easily. It works on Firefox (not just Chrome). It works on Firefox on Android. Yeah, all the stuff you kinda automatically assume is going to be a pain in the ass...isn't! It actually fricking works!
- WebAuthn-compatible keys can still support other systems too...specifically, you can get a Yubikey and use it for WebAuthn but it also still has two OTP slots, and no you don't have to do something stupid to pick which system you're using, it all just magically works somehow, I don't know how and I don't care. The Yubikey and Firefox are also backwards-compatible with U2F, so sites that implemented U2F but didn't update to WebAuthn yet work fine
Seriously, it's awesome. And it actually works, like, right now, really well. On useful sites! Try it! Github supports it, for instance. Go to your Github account, go to the Settings page, go to Security, enable 2FA if you don't have it enabled already, and hit edit on 'Security keys'. Then click 'Register new security key'. Give it a name (like 'phone' or 'yubikey #1' or whatever). If you're using a Yubikey, plug it in and hit the button. If you're using a phone with a fingerprint sensor or facial ID, there'll be an option for 'use this device with a fingerprint' or something like that. Pick it, and touch the sensor or show it your face. And that's it. You're done. Then when you login you just do the same thing (plug in, push button, touch sensor, or show face) and you're in. It's like the fricking future or something.
You can even use a Yubikey via NFC to log in with Firefox on Android (and I assume Chrome too, but I didn't try that). Yeah, I tried it, it worked. First time. (Once I figured out where the NFC sensor was, anyway). You can even apparently use your phone connected via Bluetooth to login on a computer, though I didn't try that yet - the browser should let you pick the Bluetooth-connected phone as the authenticator, then the phone will ask you for your fingerprint or face.
It's all so much frickin' better than re-typing codes from text messages or remembering Yubikey slot numbers. I really did not realize it was gonna be this nice. It is also more secure than OTP-based systems and much more secure than SMS-based systems, which is great, but even if it wasn't it's just nicer. I really hope W3C and Mozilla and Google and Apple and whoever go out and sell it that way as hard as they can.
So far I've set up my Google account (I think Google is still technically using U2F not WebAuthn, but as far as the user experience went it didn't make any difference), Github, Bitwarden (which is a great open-source password management service), and Gandi (I use them for domain registration and DNS, they're great for that), and now I'm busy writing to a ton of other sites to demand they get on the bandwagon already. I used the OTP slots for Fedora and Red Hat internal systems (neither supports WebAuthn yet, unfortunately - one limitation of WebAuthn is that it is fairly 'web-y', it's less suited to systems where you need to authenticate in non-web-protocol scenarios, so FAS and RH auth can't just switch over to it that easily). And my three pre-U2F Yubikeys are wiped and on their way to hardware heaven...
Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)
This case - Bugzilla #1750575 - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.
The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) looked into it a bit but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.
Where's the error coming from?
As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an exception handler and also in a method for setting up a connection to dnfdaemon. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - dnfdaemon - which does the heavy lifting.
So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read
logger.critical(errmsg). I changed this to
logger.exception is a very handy feature of Python's logging module which logs whatever message you specify, plus a traceback to the current state, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file
dnfdragora.log in the directory you run it from) gave us a traceback showing how we got to the error:
2019-10-14 17:53:29,436 <a href="ERROR">dnfdragora</a> dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36) Traceback (most recent call last): File "/usr/bin/dnfdragora", line 85, in <module> main_gui.handleevent() File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent if not self._searchPackages(filter, True) : File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags ) File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc rc = func(*args, **kwargs) File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search newest_only, tags) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search fields, keys, attrs, match_all, newest_only, tags)) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async result = self._get_result(data) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result self._handle_dbus_error(user_data['error']) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error raise DaemonError(str(err)) dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)</module>
So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have
dnfdaemon/client/init.py included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of framework for a client class called (oddly enough)
DnfDaemonBase which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a class called
DnfRootBackend, which inherits from both
dnfdragora.backend.Backend (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and
dnfdaemon.client.Client, which is just a small extension to
DnfDaemonBase that adds some dbus signal handling.
So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual
dnfdaemon server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that
DnfRootBackend.Search() is wrapped by
dnfdragora.misc.ExceptionHandler, which handles
dnfdaemon.client.DaemonError exceptions - like the one that's ultimately getting raised here! - by calling the base class's own
exception_handler() on them...and for us, that's BaseDragora.exception_handler, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have two indications (the dbus error itself, and the code in
exception_handler() that the error we're dealing with is "LockedError".
A misleading error...
At this point, I went looking for the text
LockedError, and found it in two files in dnfdaemon that are kinda variants on each other - daemon/dnfdaemon-session.py and daemon/dnfdaemon-system.py. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to
LockedError is the same in both. There's a function called
check_lock() which checks that
self._lock == sender, and if it doesn't, raises
LockedError. That sure looks like where we're at.
So at this point I did a bit of poking around into how
self._lock gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by
@dbus.service.method, which makes them accessible as DBus methods. (One of them is
Search(), and we can see that the client class's own
Search() basically just calls that). There are also methods called
Unlock(), which - not surprisingly - set and release this lock, by setting the daemon class'
self._lock to be either an identifier for the DBus client or
None, respectively. And when the daemon is first initialized, the value is set to
At this point, I realized that the error we're dealing with here is actually a lie in two important ways:
- The message claims that the problem is the lock being held "by another application", but that's not what
check_lock()checks, really. It passes only if the caller holds the lock. It does fail if the lock is held "by another application", but it also fails if the lock is not held at all. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.
- The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf do have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on the dnfdaemon instance itself.
So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".
So what's going on with this lock anyway?
My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the
Unlock() methods, and I also made
check_lock() log what
self._lock were set to before returning. Because it sets
None, I also added a log line to the daemon's
init that just records that we're in it. That got me some more useful information:
2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now! 2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835 2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 [...snip a bunch more calls to check_lock where the sender is the same...] 2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now! 2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is None
so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see
check_lock() being called and finding
self._lock set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's
init running again for some reason, not being locked, and then a
check_lock() call that fails because the daemon instance's
After a couple of minutes, I guessed what was going on here, and the daemon's service logs confirmed it - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this new instance of the daemon (it has no idea it just crashed and restarted), so
check_lock() fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.
The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.
OK then...why is dnfdaemon crashing?
So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system: dbus: arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system: This is normally a bug in some application using the D-Bus library. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system: D-Bus not built with -rdynamic so unable to print a backtrace
that last line looked like a cue, so of course, off I went to figure out how to build DBus with
-rdynamic. A bit of Googling told me - thanks "the3dfxdude"! - that the trick is to compile with --enable-asserts. So I did that and reproduced the bug again, and got a bit of a better backtrace. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was in
dnfdaemon.server.DnfDaemonBase.run_transaction(). (Note, this is a different
DnfDaemonBase class from
dnfdaemon.client.DnfDaemonBase; I don't know why they have the same name, that's just confusing).
So, the daemon's crashing on this
self.TransactionEvent('end-run', NONE) call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this
DnfDaemonBase, like the other one, is a framework for a full daemon implementation, and it's subclassed by a
DnfDaemon class here. That class defines a
TransactionEvent method that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, why are we crashing?
At this point I tried to make a small reproducer (which basically just set up a
DnfDaemon instance and called
self.TransactionEvent in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand why it was happening at all).
Time for the old standby...
So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a bit of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.
But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that
self._reset_base() call on the line right before the
self.TransactionEvent call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?
I stuck some logging lines into this
run_transaction so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like
self.logger('XXX HERE 1'),
self.logger('XXX HERE 2') etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the
self._reset_base() call down under the
self.TransactionEvent call...and found that when I did that, we got further, the
self.TransactionEvent call worked and we crashed the next time something else tried to emit a DBus signal. I also tried commenting out the
self._reset_base() call entirely, and found that now we would only crash the next time a DBus signal was emitted after a subsequent call to the
Unlock() method, which is another method that calls
self._reset_base(). So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after
self._reset_base() is called".
So my next step was to break down what
_reset_base() was actually doing. Turns out all of the detail is in the
DnfDaemonBase skeleton server class: it has a
self._base which is a
dnf.base.Base() instance, and that method just calls that instance's
close() method and sets
None. So off I went into dnf code to see what
dnf.base.Base.close() does. Turns out it basically does two things: it calls
self._finalize_base() and then calls
self.reset(True, True, True).
Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to
self._reset_base() in the daemon to
self._base.reset(True, True, True)...and the bug stopped happening! So that told me the problem was in the call to
_finalize_base(), not the call to
reset(). So I dug into what
_finalize_base() does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.
To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite reason for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.
Eventually I lit upon this bit of
dnf.rpm.transaction.TransactionWrapper.close(). That was the key, as close as I could get to it: reducing the daemon's
self._reset_base() call to just
self._base._priv_ts.ts = None (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that
self._reset_base() does which caused the problem.
So, of course, I took a look at what this
ts thing was. Turns out it's an instance of
rpm.TransactionSet, from RPM's Python library. So, at some point, we're setting up an instance of
rpm.TransactionSet, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.
Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...this showed up.
Turns out RPM has plugins (TIL!), and in particular, it has this one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a cleanup function which calls something called
dbus_shutdown() - aha!
This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package
rpm-plugin-systemd-inhibit installed. I poked at dependencies a bit and found that
python3-dnf recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.
So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to
dbus_shutdown() had been added quite recently, which tied in with the bug not showing up earlier. I looked up the documentation for
dbus_shutdown() which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:
"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().
libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.
You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."
and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the
rpm.TransactionSet instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and
dbus_shutdown() happening. The RPM devs had intended that call to clean up the RPM plugin's DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.
So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!
Fedmsg is the Fedora project-wide messaging bus we've had since 2012. It backs FMN / Fedora Notifications and Badges, and is used extensively within Fedora infrastructure for the general purpose of "have this one system do something whenever this other system does something else". For instance, openQA job scheduling and result reporting are both powered by fedmsg.
Over time, though, there have turned out to be a few issues with fedmsg. It has a few awkward design quirks, but most significantly, it's designed such that message delivery can never be guaranteed. In practice it's very reliable and messages almost always are delivered, but for building critical systems like Rawhide package gating, the infrastructure team decided we really needed a system where message delivery can be formally guaranteed.
There was initially an idea to build a sort of extension to fedmsg allowing for message delivery to be guaranteed, but in the end it was decided instead to replace fedmsg with a new AMQP-based system called fedora-messaging. At present both fedmsg and fedora-messaging are live and there are bridges in both directions: all messages published as fedmsgs are republished as fedora-messaging messages by a 0MQ->AMQP bridge, and all messages published as fedora-messaging messages are republished as fedmsgs by an AMQP->0MQ bridge. This is intended to ease the migration process by letting you migrate a publisher or consumer of fedmsgs to fedora-messaging at any time without worrying about whether the corresponding consumers and/or publishers have also been migrated.
This is just the sort of project I usually work on in the 'quiet time' after one release comes out and before the next one really kicks into high gear, so since Fedora 30 just came out, last week I started converting the openQA fedmsg consumers to fedora-messaging. Here's a quick write-up of the process and some of the issues I found along the way!
I found these three pages in the fedora-messaging docs to be the most useful:
All the fedmsg consumers I wrote followed this approach, where you essentially write consumer classes and register them as entry points in the project's
setup.py. Once the project is installed, the
fedmsg-hub service provided by fedmsg runs all these registered consumers (as long as a configuration setting is set to turn them on).
This exact pattern does not exist in fedora-messaging - there is no hub service. But fedora-messaging does provide a somewhat-similar pattern which is the natural migration path for this type of consumer. In this approach you still have consumer classes, but instead of registering them as entry points, you write configuration files for them and place them in
/etc/fedora-messaging. You can then run an instantiated systemd service that runs
fedora-messaging consume with the configuration file you created.
So to put it all together with a specific example: to schedule openQA jobs, we had a fedmsg consumer class called
OpenQAScheduler which was registered as a
setup.py, and had a
config_key named "fedora_openqa.scheduler.prod.enabled". As long as a config file in
'fedora_openqa.scheduler.prod.enabled': True, the
fedmsg-hub service then ran this consumer. The consumer class itself defined what messages it would subscribe to, using its
In a fedora-messaging world, the
OpenQAScheduler class is tweaked a bit to handle an AMQP-style message, and the entrypoint in
setup.py and the
config_key in the class are removed. Instead, we create a configuration file
/etc/fedora-messaging/fedora_openqa_scheduler.toml and enable and start the
fm-consumer@fedora_openqa_scheduler.service systemd service. Note that all the necessary bits for this are shipped in the
fedora-messaging package, so you need that package installed on the system where the consumer will run.
That configuration file looks pretty much like the sample I put in the repository. This is based on the sample files I mentioned above.
amqp_url specifies which AMQP broker to connect to and what username to use: in this sample we're connecting to the production Fedora broker and using the public 'fedora' identity. The
callback specifies the Python path to the consumer callback class (our
OpenQAScheduler class). The
[tls] section points to the CA certificate, certificate and private key to be used for authenticating with the broker: since we're using the public 'fedora' identity, these are the files shipped in the fedora-messaging package itself which let you authenticate as that identity. For production use, I think the intent is that you request a separate identity from Fedora infra (who will generate certs and keys for it) and use that instead - so you'd change the
amqp_url and the paths in the
[tls] section appropriately.
The other key things you have to set are the queue name - which appears twice in the sample file as
00000000-0000-0000-0000-000000000000, for each consumer you are supposed to generate a random UUID with
uuidgen and use that as the queue name, each consumer should have its own queue - and the
routing_keys in the
[[bindings]] section. Those are the topics the consumer will subscribe to - unlike in the fedmsg system, this is set in configuration rather than in the consumer class itself. Another thing you may wish to take advantage of is the
consumer_config section: this is basically a freeform configuration store that the consumer class can read settings from. So you can have multiple configuration files that run the same consumer class but with different settings - you might well have different 'production' and 'staging' configurations. We do indeed use this for the openQA job scheduler consumer: we use a setting in this
consumer_config section to specify the hostname of the openQA instance to connect to.
So, what needs changing in the actual consumer class itself? For me, there wasn't a lot. For a start, the class should now just inherit from
object - there is no base class for consumers in the fedora-messaging world, there's no equivalent to
fedmsg.consumers.FedmsgConsumer. You can remove things like the
topic attribute (that's now set in configuration) and
validate_signatures. You may want to set up a
init, which is a good place to read in settings from
consumer_config and set up a logger (more on logging in a bit). The method for actually reading a message should be named
call() (so yes, fedora-messaging just calls the consumer instance itself on the message, rather than explicitly calling one of its methods). And the message object itself the method receives is slightly different: it will be an instance of
fedora_messaging.api.Message or a subclass of it, not just a dict. The topic, body and other bits of the message are available as attributes, not dict items. So instead of
message['topic'], you'd use
message.topic. The message body is
Here I ran into a significant wrinkle. If you're consuming a native fedora-messaging message, the
message.body will be the actual body of the message. However, if you're consuming a message that was published as a fedmsg and has been republished by the fedmsg->fedora-messaging bridge,
message.body won't be what you'd probably expect. Looking at an example fedmsg, we'd probably expect the
message.body of the converted fedora-messaging message to be just the
msg dict, right? Just a dict with keys
agent. However, at present, the bridge actually publishes the entire fedmsg as the
message.body - what you get as
message.body is that whole dict. To get to the 'true' body, you have to take
message.body['msg']. This is a problem because whenever the publisher is converted to fedora-messaging, there won't be a
message.body['msg'] any more, and your consumer will likely break. It seems that the bridge's behavior here will likely be changed soon, but for now, this is a bit of a problem.
Once I figured this out, I wrote a little helper function called
_find_true_body to fudge around this issue. You are welcome to steal it for your own use if you like. It should always find the 'true' body of any message your consumer receives, whether it's native or converted, and it will work when the bridge is fixed in future too so you won't need to update your consumer when that happens (though later on down the road it'll be safe to just get rid of the function and use
Those things, plus rejigging the logging a bit, were all I needed to do to convert my consumers - it wasn't really that much work in the end.
To dig into logging a bit more: fedmsg consumer class instances had a
log() method you could use to send log messages, you didn't have to set up your own logging infrastructure. (Although a problem of this system was that it gave no indication which consumer a log message came from). fedora-messaging does not have this. If you want a consumer to log, you have to set up the logging infrastructure within the consumer, and tweak the configuration file a bit.
The pattern I chose was to
import logging and then init a logger instance for each consumer class in its
init(), like this:
self.logger = logging.getLogger(self.__class__.__name__)
Then you can log messages with
self.logger.info("message") or whatever. I thought that would be all I'd need, but actually, if you just do that, there's nothing set up to actually receive the messages and log them anywhere. So you have to add a bit to the TOML config file that looks like this:
[log_config.loggers.OpenQAScheduler] level = "INFO" propagate = false handlers = ["console"]
OpenQAScheduler there is the class name; change it to the actual name of the consumer class. That will have the messages logged to the console, which - when you run the consumer as a systemd service - means they wind up in the system journal, which was enough for me. You can also configure a handler to send email alerts, for instance, if you like - you can see an example of this in Bodhi's config file.
One other wrinkle I ran into was with authenticating to the staging broker. The sample configuration file has the right URL and
[tls] section for this, but the files referenced in the
[tls] section aren't actually in the
fedora-messaging package. To successfully connect to the staging broker, as
fedora.stg, you need to grab the necessary files from the fedora-messaging git repo and place them into
To see the whole of the changes I had to make to the openQA consumers, you can look at the commits on the fedora-messaging branch of the repo and also this set of commits to the Fedora infra ansible repo.
I was really pleased to see this blog post today, because I was looking for exactly this a few weeks ago and was sad to find it didn't exist. Now it does!
Displaying on an external screen via wifi is a really handy thing that is increasingly commonly available. Lately I'm finding most TVs I run across - including hotel room TVs! - have the feature available. For whatever reason I don't really see many people using or talking about it, but it's great - you can watch videos from your phone or laptop without having to carry an HDMI cable around everywhere and hope the inputs aren't blocked or full on the TV.
So I'm looking forward to testing that out, and hope all the bits land in official distribution packages soon :) Thanks Benjamin!