Wednesday, 26 February 2014

Traffic control for multimedia devices


Multimedia devices traditionally don't manage the network bandwidth required by applications. This causes a problem when users try to watch a streaming video or listen to a web radio seamlessly while other applications are downloading other content in the background. The background downloads can use too much bandwidth for the streaming video or web radio to keep up and users notice unnecessary interruptions in the playback.

I have been working on an approach to improve this using traffic control on Linux. This work was sponsored by Collabora.

What is traffic control


Traffic control is a technique to control network traffic in order to optimise or guarantee performance, low-latency, and/or bandwidth. This includes deciding which packets to accept at what rate in an input interface and determining which packets to transmit in what order at what rate on an output interface. 

On Linux, applications can send the traffic control configuration to the kernel using a Netlink socket with  the NETLINK_ROUTE protocol. By default, traffic control on Linux consists of a single queue which collects entering packets and dequeues them as quickly as the underlying device can accept them. The tc tool (from the iproute2 package) or the more recent "nl-*" tools (part of libnl) are different implementations but they can both be used to configure traffic control. Libnl has an incomplete support for traffic control but is in active development and progressing quickly.

Difficulty of shaping ingress traffic


Traffic control and shaping comes in two forms, the control of packets being received by the system (ingress) and the control of packets being sent out by the system (egress). Shaping outgoing traffic is reasonably straight-forward, as the system is in direct control of the traffic sent out through its interfaces. Shaping incoming traffic is however much harder as the decision on which packets to sent over the medium is controlled by the sending side and can't be directly controlled by the system itself.

However, for multimedia devices, control over incoming traffic is far more important then controlling outgoing traffic. Our use-case is ensuring glitch-free playback of a media stream (e.g. internet radio). In such a case, essentially, a minimal amount of incoming bandwidth needs to be reserved for the media stream.

For shaping (or rather influencing or policing) incoming traffic, the only practical approach is to put a fake bottleneck in place on the local system and rely on TCP congestion control to adjust its rate to match the intended rate as enforced by this bottleneck. With such a system it's possible to, for example, implement a policy where traffic that is not important for the current media stream (background traffic) can be limited, leaving the remaining available bandwidth for the more critical streams.

On Linux, ingress traffic control ("ingress qdisc" on the graph) happens before the Netfilter subsystem:
Netfilter-packet-flow
By Jengelh (Own work, Origin SVG PNG) [CC-BY-SA-3.0], via Wikimedia Commons



Difficulty of shaping on mobile networks


However, to complicate matters further, in mobile systems which are connected wirelessly to the internet and have a tendency to move around it's not possible to know the total amount of available bandwidth at any specific time as it's constantly changing. Which means, a simple strategy of capping background traffic at a static limit simply can't work.

The implemented solution


To cope with the dynamic nature, a traffic control daemon (tcmmd) has been implemented which can dynamically update the kernel configuration to match the current needs of the playback applications and adapt to the current network conditions. Furthermore to address the issues mentioned above, the implementation will use the following strategy:

  • Split the traffic streams into critical traffic and background traffic. Police the incoming traffic by limiting the bandwidth available to background traffic with the goal of leaving enough bandwidth available for critical streams.
  • Instead of having static configuration, let applications (e.g. a media player) indicate when the current traffic rate is too low for their purposes. This both means the daemon doesn't have to actively measure the traffic rate and allows it cope with streams that don't have a constant bitrate more naturally.
Communication between the traffic control daemon and the applications is done via D-Bus. The  D-Bus interface allow applications to register critical streams by passing the standard 5-tuple (source ip and port, destination ip and port and protocol) which uniquely identify a stream and indicate when a particular stream bandwidth is too low.

To allow the daemon to effectively control the incoming traffic, a so-called Intermediate Functional Block device (ifb0) is used to provide a virtual network device to provide an artificial bottleneck. This is done by transparently redirecting the incoming traffic from the physical network device through the virtual network device and shape the traffic as it leaves the virtual device again. The reason for the traffic redirection is to allow the usage of the kernels egress traffic control to effectively be used on incoming traffic. The results in the example setup shown below (with eth0 being a physical interface and ifb0 the accompanying virtual interface).




To demonstrate the functionality as described above, a simple demonstration media application using Gstreamer (tcdemo) has been written that communicates with the Traffic control daemon in the manner described.

Testing, the set-up


The traffic control feature in tcdemo can be enabled or disabled on the command line. This allowed me to compare the behaviour in both cases. 

On my left, I have a web server serving both the files for a video stream and the files for background downloads. On my right, I have a multimedia device rendering a video stream while downloading other files on the same web server.

Traffic control is only useful when the available bandwidth is limited. In order to have meaningful tests, I simulated a low bandwidth with the following commands on the web server:
tc qdisc add dev wlan0 root handle 1: cbq avpkt 1000 bandwidth 10Mbit
tc class add dev wlan0 parent 1: classid 1:1 cbq rate 3Mbit allot 1500 prio 3 bounded isolated
tc filter add dev wlan0 parent 1: protocol ip u32 match ip protocol 6 0xff match ip sport 80 0xffff flowid 1:1

Only the traffic from port 80/http was limited. It is important to note that the background traffic and the stream traffic were both going through the same bottleneck.

Tcdemo was playing a video file streamed over http while 8 wgets were downloading the same file continuously. The 9 connections were competing for the limited bandwidth. Without traffic control, tcdemo would not have got enough bandwidth.

The following graph shows what happened with traffic control. The video streaming is composed of several phases:
  1. tcdemo opened the HTTP connection and its GStreamer pipeline started downloading. At the same time, tcmmd was notified there was a new stream connection and it restricted any potential background traffic to a very low limit. As long as the initial GStreamer queue was buffering, the background traffic limit did not change.
  2. The GStreamer queue became full at t=4s and the video started to be played on the screen. The daemon increased the limit on the background traffic exponentially and the stream bandwidth got reduced as a consequence.
  3. Despite the stream bandwidth degrading slowly, GStreamer managed to keep its queue over 75% full until t=25s. When the queue is more than 75% full, Gstreamer does not report it because tcdemo chose that threshold with the low-percent property on GstQueue2 (the graph shows 100% in this case). 
  4. At t=30s, the GStreamer queue was less than 70% full and that threshold triggered tcmmd to restrict the background traffic to its minimum.
  5. The stream could use most of the bandwidth and the GStreamer queue became full quickly at t=31s. The background traffic could start its exponential growth again.
traffic control stats
Thanks to traffic control, the GStreamer queue never got empty in my test.

Get the sources


git clone git://git.collabora.co.uk/git/user/alban/tcmmd
git clone https://github.com/alban/tcmmd

FAQ


Q: Do I need any privileges to run this?
A: No privileges required for tcdemo, the GStreamer application. But tcmmd needs CAP_NET_ADMIN to change the TC rules.

Q: The 5-tuple contains the TCP source port. How does the application know that number?
A: The application can either call bind(2) before connect(2) to choose a TCP source port, or call getsockname(2) after connect(2) to retrieve the TCP source port assigned automatically by the kernel. The former allows to install the traffic control rules before the call to connect(2) triggers the emission and reception of the first packets on the network. The latter means the first few packets will be exchanged without being shaped by the traffic control. Tcdemo implements the latter to avoid more invasive changes in the souphttpsrc GStreamer element and libsoup. See bgo#721807.

Q: What happens if an application forgets to unregister a 5-tuple when the video stream finishes?
A: That would be bad manners from the application. The current traffic control settings would remain.  And if the application notifies tcmmd that its buffer was empty and forgets to notify any changes, the background traffic would be severely throttled. However, if the application just terminates or crashes, tcmmd would notice it immediately on D-Bus and the traffic control rules would be removed.

Q: Does tcmmd remove its traffic control rules when terminated?
A: It depends how it is terminated. Tcmmd removes its traffic control rules on SIGINT and SIGTERM. But the rules remain in other cases (SIGSEGV, SIGKILL, etc.). If it is a problem in case of crash, tcmmd initialisation properly removes previous rules, so you could start tcmmd and interrupt it with ctrl-c.

Q: Instead of using the 5-tuple, why not using setsockopt-SO_MARK?
A: First, SO_MARK requires CAP_NET_ADMIN which is not something that media player should have. It could be worked around by fd-passing the socket to a more privileged daemon to call setsockopt-SO_MARK but it's not elegant. More importantly, tcmmd's goal is not to shape the egress traffic but the ingress traffic. The shaping of incoming packets is performed very early in the Linux network stack: it happens before Netfilter, and before the packet is associated to a socket. So we can't check the SO_MARK of a socket to shape incoming packets.

Q: Instead of using the 5-tuple, why not using cgroups?
A: The granularity of cgroups are only per-process. So the traffic control would not be able to distinguish between different HTTP connections in a web browser used to render a video stream and used for background downloads. And for the same reason as setsockopt-SO_MARK, it would not work for shaping ingress traffic: we would not be able to link the packet to any process or cgroup.

Q: Instead of sending the 5-tuple to tcmmd, why not set the IP type-of-service (TOS) on outgoing packets with setsockopt-SO_PRIORITY to avoid changes in the application and have an iptables target to feed that information about connections back to the ingress traffic control?
A: It could be possible if the bandwidth was fixed, but on mobile networks, the application needs to be changed anyway to give feedback when the queue in the GStreamer pipeline get emptied.

Q: Why not play with the TCP windows instead shaping the ingress traffic?
A: As far as I know, Linux does not have the infrastructure for that. The TCP windows to manipulate would not be from the GStreamer application but from all other connections, so it can't be done from userspace.

Q: Does tcdemo require any new feature in GStreamer?
A: Yes, souphttpsrc needs this patch: bgo#721807

Q: Does tcmmd require any new feature in the Linux kernel?
A: No.

Q: Does tcmmd work on several network interfaces (e.g. eth0 + wlan0)?
A: No, at the moment tcmmd only support one interface and it has to be started after the interface is up. Patches welcome!

Q: Tcmmd uses both libnl and /sbin/tc via system() calls. Why?
A: My goal is to use libnl and avoid spawning processes to call /sbin/tc. I just didn't have time to finish this. It will involve checking that libnl has the right features. Some needed features such as u32 action support were implemented recently in the last version.

Q: How did you get the graphs?
A: I used tcmmd's --save-stats option and the script tests/plot-tcmmd-log.sh.

Q: Why is there so frequent Netlink communication between tcmmd and the kernel?
A: One part of this is to gather regular statistics in order to generate graphs if the option --save-stats was used. The other part is for implementing the exponential progression of the bandwidth allocated to the background traffic: at regular interval, tcmmd changes the rate assigned to a qdisc. It could be avoided by implementing a specialised qdisc in the kernel for our use case. It would require more thinking how to design the API for that new qdisc.

Q: Does it work with IPv6?
A: No. The architecture is not specific to IPv4 but it is just not implemented yet for IPv6. Tcmmd would need to generate new TC rules because the IP headers are different between IPv4 and IPv6.


Thanks Sjoerd for the architecture diagram and proof-reading.

Friday, 25 February 2011

Importance of the scheduling priority in D-Bus

Context switches are expensive. When do they happen in D-Bus communication?

application1 sends a D-Bus message to application2 through dbus-daemon

application1 sends a D-Bus message to application2 through dbus-daemon

The kernel may perform a context switch:

  • During the sendmsg() system call, when application1 sends a D-Bus message to dbus-daemon
  • During the sendmsg() system call, when dbus-daemon sends a D-Bus message to application2
  • At the end of a time slice

I instrumented libdbus in order to trace events such as message parsing, dispatching messages in dbus-daemon,  sendmsg(), recvmsg(). I took inspiration from Federico’s performance scripts but without using strace because straced processes have a big overhead compared to the duration of events I want to measure. My instrumentation doesn’t save traced events in a file to avoid overhead as much as possible, but just keep them in memory and save them after the experiment is finished. This is triggered:

  1. when a process using libdbus exits. The save function is registered by atexit().
  2. when a process using libdbus receives a SIGUSR2 signal. The save function is registered by sigaction().

I collect logs generated by dbus-daemon and all D-Bus peers and feed them to plot-timeline.py to generate timelines. I estimate the overhead of logging between 3 microseconds and 5 microseconds by logging consecutively 2 events. All the experiments are run in a virtual machine with one processor.

Experiment #1: a D-Bus signal delivered to 10 recipients

I sent one D-Bus signal with dbus-send and received by 10 dbus-monitor. You can reuse this script.

In this experiment, dbus-send was context switched off during the send() system call and that system call returned after dbus-daemon received the message, sent it to the 10 recipients, and all of them received it.

We can see dbus-daemon iterating over every recipients and sending the message on each recipient’s socket.

There is 12 context switches in this experiment:

  • 1 context switch from the sender to dbus-daemon
  • 10 context switches to each recipient
  • 1 final context switch to the sender

Experiment #2: 10 D-Bus signal delivered to 10 recipients

It is the same experiment as before but the sender emits 10 D-Bus signals (as fast as it could) instead of a single one. dbus-send cannot do that, so I used dbus-ping-pong.

The first three D-Bus signal are delivered in the same way; this timeline looks like the first timeline. There is also 12 context switches per D-Bus signal emitted.

But something different happen for the 7 next D-Bus signals: dbus-daemon is context switched off during the send() on every recipient.

The scheduler tries to maintain fairness along all the processes by giving them the same amount of CPU time. But dbus-daemon has about 10 times more work to do than every single recipient because it sends 10 messages and the recipients only receive 1 message. So when dbus-daemon writes on a recipient’s socket, there is two processes in the running state and the scheduler has to take a decision between a CPU hungry process (dbus-daemon) and a recipient. The recipient is chosen and it causes a context switches.

The context switches per D-Bus signal emitted are:

  • 1 context switch from the sender to dbus-daemon
  • 1 context switch from dbus-daemon to the first recipient
  • 1 context switch from the first recipient back to dbus-daemon
  • again 2 times 9 context switches between dbus-daemon and the remaining recipients
  • 1 context switch from dbus-daemon to the sender

That’s 22 context switches in total instead of 12 for the same amount of work and the same amount of messages delivered.

The experiment was run in 0.036121 seconds.

Experiment #3: 10 D-Bus signal delivered to 10 recipients with a high priority dbus-daemon

With the command renice, I change the priority of dbus-daemon to -15.

In this case, the kernel chose the first scheduling strategy with 12 context switches per D-Bus signal emitted. The experiment was run in 0.019320 seconds. It is 1.8 times faster.

Most D-Bus messages don’t have so many recipients. But still, it is interesting to see how much the priority of dbus-daemon influences the performance.

On Debian, dbus-daemons (session and system bus) run with the default priority (nice level: 0). On the N900, they run with a higher priority (nice level: -5).

Experiment #4: details of dispatching

Let’s zoom on dbus-daemon’s dispatching for two different sizes of D-Bus signals:277 bytes and 17kB.

277 bytes D-Bus signal:

  • Total: 524 microseconds
  • Receiving: 36 microseconds (7%)
  • Dispatch: 164 microseconds (31%)
    • 79 microseconds in match rules
  • Sending: 324 microseconds (62%)

17kB D-Bus signal:

  • Total: 1026 microseconds
  • Receiving: 413 microseconds (40%)
  • Dispatch: 230 microseconds (22%)
    • 79 microseconds in match rules
  • Sending: 383 microseconds (37%)

Thursday, 24 February 2011

D-Bus traffic pattern with Telepathy

Telepathy is a big user of D-Bus, both on the GNOME desktop with Empathy and on the N900. When I have a lot of Telepathy accounts and contacts, the start-up time could be about 10 seconds. How much is the D-Bus communication responsible for the start-up time? When I tried the D-Bus in the kernel prototype on the N900, I win 1.3 seconds. It shows it is not negligible and there is room for improvements. But the slowness may not be due only to D-Bus itself but the way it is used. It is interesting to look at the D-Bus traffic pattern. This topic was already largely covered by Will’s  talk on Profiling and Optimizing D-Bus APIs but I want to give a preview on future D-Bus analysis tools. The existing D-Bus tools dbus-monitor, D-Feet and Bustle are still very useful of course.

A glance at D-Bus messages

The easiest to begin with is to record all the D-Bus traffic with dbus-monitor while I start Empathy, send a few messages, and quit. Let’s see if there is any surprise. dbus-monitor generated a 13MB text file containing 8313 D-Bus messages:

  • 2850 method calls
  • 761 methods returns
  • 4699 signals
  • 3 errors

The number of method calls and method returns are quite different! This is because dbus-monitor doesn’t record method returns from the bus driver (org.freedesktop.DBus, implemented by dbus-daemon). Indeed, my log file contains 2066 method calls to the bus driver:

Method calls for D-Bus match rules (AddMatch/RemoveMatch) generate a lot of traffic. The test was done with the upstream dbus-glib 0.88-2.1, without the patch on DBusGProxy (Bug #23846) to have per-member match rules. If that patch was applied, the number of D-Bus match rules could be even bigger.

If the D-Bus match rules were implemented with socket filters on multicast Unix sockets, AddMatch and RemoveMatch messages would be completely replaced by the system call setsockopt(), reducing the number of context switches from D-Bus peers to dbus-daemon. But the situation could be improved with the current dbus-daemon implementation: 628 AddMatch are match rules on NameOwnerChanged (48%), generated by dbus-glib, but there is actually only 25 unique match rules. The duplicate match rules are caused by Bug #33646 on DBusGProxy.

Performance depends on the size of message

Now, let’s have a look on the size of messages and the impact on performances. I can use dbus-ping-pong to estimate how long it takes to send a D-Bus message with a specific size. The following script starts the ping-pong messages between two D-Bus peers, with several message sizes, and I gathered the results in the next graph:

#!/bin/bash

./dbus-ping-pong server org.Echos &
PID=$!
echo "Server is PID=$PID"
sleep 1

for i in 1 3 10 31 100 316 1000 3162 10000 31622 100000 316227 1000000 3162277 10000000 ; do
  echo $i
  ./dbus-ping-pong client org.Echos 100 $i
done

kill $PID

Performance of dbus-ping-pong with various message sizes

Figure 1

I ran the test twice with different values of max_bytes_read_per_iteration, the size of the buffer used to read on the socket; the first time with 2048 bytes and the second time with 65536. When the read buffer is 32 times bigger, there is less recvmsg() system calls for big incoming D-Bus messages and it makes the test 28% faster.

Usual message sizes in Telepathy

Let’s see what are the usual sizes of messages in the context of Telepathy. I just run bustle-dbus-monitor with a patch to get the size of each message (Bug #34067).



Figure 2

Most messages are between 101 bytes and 316 bytes (in orange on the graph).

With piecewise linear interpolation (spreadsheet source) based on the results of dbus-ping-pong, I can approximate how long takes the transmission of each individual message, and then what size of messages are the bottleneck and would be worth optimizing.

It may not be a good estimation because there is more computation in the real Telepathy scenario than in the synthetic dbus-ping-pong scenario: the number of match rules in dbus-daemon is different, there is more connections, and the Telepathy processes are not only sending D-Bus messages so the CPU caches are not behaving the same way. The real time spent in D-Bus communication should be larger than what is computed here with piecewise linear interpolation. But it can still give an idea.



Figure 3

For messages with a small size (under 3000 bytes), the two histograms (figure 2 and figure 3) have the same shape because the ping-pong time is almost constant for those messages. For the larger messages, they take a bit more time than their distribution because the ping-ping time is getting bigger on those messages.

Messages between 101 bytes and 316 bytes are the ones causing the most delay in Telepathy start-up.

The Stats interface

We can look at those messages in more details with the new Stats interface in dbus-daemon. The script messages_stats.py creates a csv file from the Stats interface containing the peak size of messages, the peak number of recipients and number of each D-Bus method call and signal. With that data, I can check that no messages from Telepathy are received too broadly, and if they are, I could use the GetConnectionMatchRules patch to find the application which subscribes to a too broad match rule.

Most of the D-Bus messages generated by Telepathy have a reasonable number of recipients. Only 2 different D-Bus signals have more than 3 recipients, and it is only a few messages in total:

  • org.freedesktop.Telepathy.Connection.Interface.Aliasing.AliasesChanged (4 recipients)
  • org.freedesktop.Telepathy.Connection.Interface.SimplePresence.PresencesChanged (4 recipients)

Repartition of messages per number of recipients

Figure 4

The most sent D-Bus messages by Telepathy are:

  • org.freedesktop.Telepathy.Connection.Interface.Contacts.GetContactAttributes (17 calls)
  • org.freedesktop.Telepathy.Account.AccountPropertyChanged (11 signals)

I don’t see anything obviously bad here. Telepathy is a good D-Bus citizen!

Wednesday, 16 February 2011

Introducing multicast Unix sockets

I have been working on implementing multicast Unix sockets in the Linux kernel. This allows a process to send a message on a socket to a multicast group with one system call sendmsg() and the message will be received by all sockets member of the multicast group.

This work has been sponsored by my employer Collabora.

Several projects could benefit from this new IPC system:

  1. D-Bus

    D-Bus is a message bus system. Applications exchange D-Bus messages traditionally through a central process, dbus-daemon. When dbus-daemon receives the message, it determines the recipients and delivers the message to each recipient’s socket. This architecture causes dbus-daemon to wake up for every single message causing expensive context switches, memory copies and processing. If the D-Bus peers were part of a multicast group, the kernel could deliver D-Bus messages directly to the recipients. It could use socket filters to deliver them only to the correct recipients, according to the D-Bus match rules.
  2. The T IPC system

    In the same manner as D-Bus, multicast Unix sockets and socket filters could be used by the T IPC system.
  3. Udev

    Udev uses Linux’ netlink sockets to send multicast messages from udevd to libudev listeners. Netlink sockets are usually used for communication between the kernel and userspace, but can also be used for userspace-only communication. It has limitations though; there are only 32 multicast groups, system-wide, and only root can send multicast messages.

    Update: netlink does not have that limit anymore since 2005.

My implementation aims to be a general purpose multicast IPC system, without the limitations of netlink multicast. The kernel patches and a test suite are available in git:

git clone git://git.collabora.co.uk/git/user/alban/linux-2.6.35.y/.git unix-multicast18
git clone git://git.collabora.co.uk/git/user/alban/check-unix-multicast

Multicast is implemented on datagram and seqpacket sockets, but not on stream sockets. It would not make sense on stream sockets because the messages are not delimited and there would be no guarantee that several senders’ messages would not be mixed. The semantics are different between datagram and seqpacket sockets.

Multicast on datagram sockets

Communication on datagram multicast Unix sockets

The setsockopt() call which creates the multicast group binds the socket on the multicast address. Messages sent to the group are received by all members, including the sender, if it joined with the “loopback” feature. Socket filtering may be used by a recipient to avoid receiving messages, however this does not affect delivery of the message to other peers in the group.

The daemon controlling the multicast group can receive the messages from the peers if the feature “send-to-peer” is enabled.

Multicast on seqpacket sockets

 Communication on seqpacket multicast Unix sockets

Seqpacket sockets are connection-based and the daemon can control who is able to join the group. The daemon can receive the messages from the peers on its accepted sockets (A1, A2, A3 on the diagram above) with the “send-to-peer” feature. It is useful for D-Bus: dbus-daemon can reply to the method calls on the bus driver.

Socket filter for D-Bus

Each socket can upload a socket filter (or Berkeley Packet Filter, BPF) in the kernel. Socket filters are small programs, executed for every message sent to a socket. If the socket filter returns zero, the message is discarded and the process does not need to wake up.

The socket filter could be modified and uploaded in the kernel every time the D-Bus peer wants to add or remove a D-Bus match rule and get or lose a unique or well-known name. So D-Bus messages are not delivered to every D-Bus peer but only the right recipients.

Socket filters may be applied on SOCK_DGRAM and SOCK_SEQPACKET only. They make little sense on SOCK_STREAM because there are no message boundaries. This limits the size of D-Bus messages to about 110kB, although it can be changed with setsockopt(SO_SNDBUF) up to a maximum of 219kB (or more by tuning /proc/sys/net/core/rmem_max).

Atomic delivery

Messages are delivered atomically to all recipients. This is true even when the sender is interrupted by a signal, killed, lacks memory or is blocked because of the flow control. I don’t want a message to be partially delivered to some recipients. When the system call sendmsg() returns an error (such as EAGAIN), it is guaranteed that nobody received the message.

Ordering

When several senders are sending messages concurrently, the recipients need to receive messages in the same order. Here is a scenario I want to avoid:

Message ordering

A and B are sending one message concurrently to recipients C and D. Without proper locking, the recipients could receive the messages A and B in a different order. My patches take care of this and the test suite checks that messages are received in the same order by all recipients.

Flow control

When a reader is too slow to consume messages from its receiving queue, the receiving queue could be full. There is several ways to manage this situation:

  • Have infinite sized receiving queues. This is not really an option, is it?
  • Drop messages, either silently or with a notification to the recipient (”you have lost some messages”). This is the correct semantic for udev. Netlink sockets notifies recipients about lost messages with ENOBUFS in recvmsg(2).
  • Block the sender. The sender will block or send() will return EAGAIN with non-blocking sockets. Poll() or select() will tell when a message can be delivered again.
  • Disconnect the slow recipient
  • Disconnect the spammy sender

The correct solution for D-Bus is not trivial. This is not a new problem: even without multicast Unix sockets, dbus-daemon already has the same problem. Discussion in bug #33606. The current implementation of multicast Unix sockets either drops messages silently or blocks the sender, depending on the setting of the multicast group.

Wednesday, 15 September 2010

D-Bus in the kernel, faster!

In the last few weeks at Collabora, I worked on having D-Bus directly in the Linux kernel to improve its performances. I based my work on what Ian Molton did. For now it’s just a prototype, but some benchmarks I did show a good performance improvement.

The cost of context switches

When an application sends a message on the bus to another application, the message is first sent to dbus-daemon through an Unix socket. The kernel copies the message to the receiving queue of dbus-daemon and dbus-daemon wakes up. Then dbus-daemon adds the sender field in the header of the message, and sends it to the recipients according to the destination field and the match rules (usually one recipient but there could be more for signals or in case of eavesdropping).

D-Bus message transmission with dbus-daemon

D-Bus message transmission with dbus-daemon

So dbus-daemon wakes up on every message, it costs a context switch and a memory copy.

Don’t wake up, dbus-daemon!

The idea of D-Bus in the kernel is to deliver the message directly to the right recipients without waking up any intermediary process. We added a new kind of socket, “AF_DBUS”, that behaves in a similar way to Unix sockets. Every application using DBus would need to use the new socket type, but that just means changing libdbus and the few other libraries for D-Bus.

Introducing AF_DBUS

Introducing AF_DBUS

The kdbus kernel module reads all the messages and check for the messages “Hello“, “NameAcquired“, “NameLost” and “AddMatch” to get to know the unique names, the well-known names and the match rules. Then it is able to deliver the messages only to the right applications and shortcut dbus-daemon. If there are several recipients, the message is still memcpied only one time thanks to skb_clone().

Introducing AF_DBUS

AF_DBUS does not wake up dbus-daemon unnecessarily

The prototype still uses dbus-daemon for authentication and D-Bus activation. The bus driver org.freedesktop.DBus is still implemented in dbus-daemon.

New performances

The first benchmark is dbus-ping-pong. It is a simple tool to call a D-Bus method and wait for the reply 10000 times. I tried it both on a KVM/i386 virtual machine and on a N900/arm.

  • KVM/i386, without kdbus: 3.887s
  • KVM/i386, with kdbus: 2.085s (x1.8)
  • N900/arm, without kdbus: 28.00s
  • N900/arm, with kdbus: 9.23s (x3)

I tried Adrien Bustany’s benchmark tool designed with Tracker in mind. My test is on KVM/i386 with #define CHUNK_SIZE 8 in order to have D-Bus messages of 6905 bytes (the kdbus prototype has a limitation at the moment: messages bigger than SKB_MAX_ALLOC or 8kB are still delivered to dbus-daemon so it does not bring better performances).

  • KVM/i386, without kdbus: Fetched 300000 rows in 32874ms
  • KVM/i386, with kdbus:Fetched 300000 rows in 24368ms (26% faster - x1.35)

I also tested how long does a N900 take to connect to Jabber and show my contacts’ presence on the desktop widgets. I measured the time manually but ran enough tests to be sure it is consistent.

  • N900/arm, without kdbus:  avg 11.87s
  • N900/arm, with kdbus: avg 10.56s (x1.12)

Sources

Keep in mind this is only a proof-of-concept. It is not ready for merging and has limitations, including security ones. However,  I managed to use kdbus for both the system and session bus on a N900 and, with just a few hacks, everything worked fine.