sabato 22 gennaio 2011

Concepting a new IPC system

This is going to be a long post, and probably obscure for anyone not familiar with D-Bus. When I started thinking about this idea, I decided that I wouldn't write about it until I had some code ready; but a recent thread in the D-Bus mailing list urged me to publish my ideas, in the hope that some people might be able to see through all this vapourware and possibly (but here maybe I'm hoping too much) encourage my employer to let me work on this as a R&D project. It's about an IPC system with somehow similar functionality as D-Bus, but whose main describing features would be:

  • low latency
  • local only
  • message data passing happens via shared memory
  • simple
  • lock-free
  • match rules are tags lists
  • peer-to-peer, with central server acting as a baby sitter :-)
  • client API as compatible with libdbus as possible

Well, if any of the point above captured your curiosity, read on! I hope it won't be too disappointing at the end. Oh, before we continue, let's give it a name: let's temporarily call it T — once (and if ever) this project sees the light we'll surely find a better name. So, let's walk through some of the concepts, and hope not to lose too many readers along the way. :-)

The big picture

Every client is connected to the T-daemon via socket. The T-daemon maintains the list of the registered service names, monitors their lifetime, maintains the match rules, and very little more. It doesn't come into the picture when a message is passed between two clients, not even when the message has multiple recipients. All the information about the active services which is needed to route a message, such as the list of services, their address and the match rules, all these things reside in shared memory blocks maintained by the T-daemon. The service name concept from D-Bus lives in T as well, almost unchanged: service names are unique across the net, a client can register any number of them at any time, and their appearing and disappearing from the net is advertised by the T-daemon with messages to all interested clients (but in a slightly finer way than D-Bus does). Service names are typically used to route method calls. Normally, messages are created on shared memory blocks, and a very short message containing the location of the block is sent into the receiver(s)'s FIFO (or alternate transport chosen by the receiver, such as a POSIX message queue). Messages are tagged with a list of labels, specified by the sender. On the other hand, clients register match rules to the T-daemon, specifying what labels they are interested in (or better said, one client can specify multiple sets of labels). A match happens if one set of labels defined in a rule is a subset of the message's labels. Share memory areas used for message data passing are not reference counted, and might not always be explicitly freed by the clients; instead, the T-daemon runs a garbage collector which will unlink the oldest ones.

If all this seems cloudy, hairy and dark at the same time, it's because I described it that way. :-) In the next paragraphs I'll try to cover all these concepts in greater detail, at least to the level of detail that got me positive about T's feasibility. But it's very likely that I'll forget writing about some issues which I considered long time ago; so, in case there are some parts of this idea which seem unfeasible to you, please let me know in a comment.

Tags and match rules

A tag, or label, is a string. A match rule is an unordered list of tags, plus an optional sender. If all the tags listed in a match rule are also listed in one message, then the message is delivered to the client which registered that rule. For the sake of my fellow D-Bus experts readers, here is how I would convert a D-Bus match rule into a T match rule: "type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='Foo',path='/bar/foo',destination=':452345.34',arg2='bar'" would be a T match rule listing these tags (one tag per line): type='signal'
interface='org.freedesktop.DBus'
member='Foo'
path='/bar/foo'
destination=':452345.34'
arg2='bar'

and having org.freedesktop.DBus as sender.

This way of handling match rules is simpler and more powerful than D-Bus': for instance, with D-Bus if you want to send a message to clients listening on interface A or on interface B, you need to send the same message twice, once per interface (which is what I had to do in libaccounts-glib, for instance). But with T, you could just tag your message with both A and B, and get it delivered to both clients at once. This taglist form of match rules would also be able to cover the functionality offered by the argX and argXpath components of the D-Bus matches, which are uh, a bit hackish. :-) Moreover, the tagged approach would allow for more fine-grained matches than what D-Bus currently provides with its NameOwnerChanged signal; to explicitly mention a case which forces D-Bus clients to broadly listen to this signal, I recall the Telepathy ChannelDispatcher service, which needs to keep track of service names matching org.freedesktop.Telepathy.Client.* — and not being able to install such a rule, is being waken up whenever NameOwnerChanged is emitted (at least until argXprefix becomes available). In T, one could just install a match rules having the labels T.NewService and org.freedesktop.Telepathy.Client, and be done with it (more about this case in the paragraph "Registering a service name").

Messaging protocol

Clients in the T network exchange messages in a p2p way. The sender creates a shared memory area and writes the contents of the message into it. Once it's done, it sends the handle of this area to the destination(s), by using the transport mechanism which the destination itself has requested when registering the service or match rule. In a humanly readable way, we could imagine the T-address of a client to be fifo://my_fifo or mq://my_message_queue (or something else). The receiver reads the shared memory handle from the input channel and mmaps the area in read-only mode, and unmaps it when it doesn't need it anymore. The memory area is unlinked only at a later stage, by the garbage collector running in the T-daemon: the handle of the memory area has been written by the sender into another memory page which the T-daemon handled to the client when the client first entered the T network. So, the T-daemon knows about all memory areas created by any client, and can go through them and delete the old ones (the minimum validity time might even be chosen by the sender).

To determine what destinations should be reached by a message, a logic like this might be applied:

  1. if the message has a destination name set on it (like in a D-Bus method), the message is delivered to it; all destinations, along with their T-address, are found on a shared memory block maintained by the T-daemon.
  2. if the message has a sender name set on it (like in a D-Bus signal), the sending client reads the table of match rules set for this sender name — which, needless to say, resides on a shared memory area named after the service name and maintained by the T-daemon — and computes the list of destination addresses.
  3. the sending client reads the table of generic match rules: this is another table maintained by the T-daemon, consisting of all match rules which do not have a sender set.

So, in a T net with N registered service names, we have N+1 shared memory objects holding match rules: one for each service, plus one for the rules which are not bound to a service. When sending a message, the list of the recipients is composed by the union of the recipients found according to the three rules above (so, in the worst case, three tables have to be read).

Registering to T

A T client enters the T net in a similar way as it enters D-Bus: it connects to the local socket of the T-daemon and after some pleasant handshaking it is given the handles to the following shared memory areas:

  • the table holding the list of the connected clients, with the service names they registered and their T-address.
  • the table holding the match rules which are not bound to a service.
  • an empty table where the client will write the handles of the shared memory areas it creates for the IPC messages.
The client is also registered into the connected clients table, with a unique service name assigned to it. The client keeps the socket connection to the T-daemon open as long as it intends to be part of the network. This connection is used to register match rules (only the T-daemon is allowed to write those shared memory tables) and service names.

Registering a service name

Clients can register any number of service names, at any time, by issuing a request to the T-daemon. The request will contain the desired service name (which must be unique), the T-address which will listen to incoming messages, and a list of labels. If the T-daemon accepts the request, it will prepare a message containing all these labels, plus a predefined label whose meaning is “a new service has been registered” and will send it to all the clients which have setup a matching rule. The same labels (plus another predefined one) will be set on a similar message which will be send when the service is unregistered.

In addition to the above, the T-daemon will tell the client the handle of the shared memory area where the T-daemon will write the match rules that the client must read when emitting signals from that service name.

Lock-free shared memory handling

During its lifetime, a given shared memory area can have many readers, but only one writer. All the match rule tables and the active clients table are maintained by the T-daemon, which is the only process allowed to modify them. All records are fixed size, and have one field which tells whether they are empty, filled or deleted. Records can only be added or marked for deletion, but never reused. When the table needs to be compacted, the T-daemon creates a new shared memory area with a table holding only the valid records from the original table and, once the new table is ready, its handle is written into a special header of the old table, and after this the old table is flagged as no longer valid, and unlinked. So, when clients need to read data from one table, they would follow this logic:

  1. Check if the table is marked for deletion. If it is, get the handle of the new table and repeat this step; if the next one does not exist, get the new address by directly asking the T-daemon
  2. Find your data, ignoring all records which are not filled or which are marked for deletion.
  3. (under consideration — I don't think this is necessary) After reading the data, check if the record and the table are still valid. If not, go back to step 1.

To keep this efficient and to reduce the memory waste, string data will be accessed by index, with the actual data residing on another shared memory area. For instance, the labels and the service names will stay in their own tables, consisting of a column with the integer hash, one column with the unique numeric ID of the string (which will remain constant over time) and one column with the actual string (a maximum length for labels and service names will be enforced). So, clients which need to refer to string (to tag a message, for instance) will always use the unique integer ID. The concept is the same as glib quarks, except that it's living in shared memory. Match rules would also have a limit on the number of labels, so that their record can be of a fixed size.

The most delicate case is about the shared memory areas into where clients store the handles of the memory areas they allocated for their IPC messages. These tables are maintained by the clients (one table per client, typically), and the T-daemon accesses them with the steps described above. In addition to the status flag and the data (the T-handle of the resource), each record would have a field holding its expiry time (computed from the monotonic clock). The T-daemon GC will ignore records which haven't been expired yet; for the expired ones, it will unlink the shared memory area and inform the client (via the socket connection) of the deleted records, so that the client can update the table. But this is just an option: I didn't fully decide on how to implement this GC accounting. Note that a Linux implementation might simplify things here, and have the T-daemon directly inspect /dev/shm/ to find out the stale memory areas and unlink them, without the need for the clients to bookkeep them.

Performance considerations

Here's a few points I thought worth considering when trying to evaluate the possible performance of a T implementation.

Number of open file descriptors

The first concern you may have is the number of file descriptors used in the T network. First of all, we can forget about the descriptors of the shared memory areas, because they can safely be closed as soon as the memory area has been mmap-ed. Secondly, the number of file descriptors in a net consisting of N clients is a value directly proportional to N (because every client has a socket connection to the T-daemon), plus all the FIFOs. In the typical case, every client will just create one FIFO for its incoming channel, and other clients can write to it (though of course the client could open several different incoming channels for different match rules — but I'm not even sure I want to expose this via the client API). In the worst case, at a certain time we might have that the number of open descriptors is O(N2), which doesn't sound that good. This can be reduced back to O(N), if clients don't keep their communication channels open after having sent a message. One reasonable compromise would be to always close the descriptor after having sent a message except in the case where the message had a destination name set (that is the analogue of a D-Bus method call); this file descriptor could be kept open until we are sending a message to a different destination. If the client API will offer a concept of “proxy to a remote service”, similar to the glib DBusGProxy, then maybe the descriptors serving the T-address of the proxy could be kept open as long as the proxy object is alive.

Transport for IPC communication

For the IPC signaling, I've already mentioned the T-address: this could be a POSIX FIFO, message queue, UNIX domain socket or any other transport which can be used to deliver a hundred bytes to another process. Even though I didn't research this deeply, I would expect the POSIX FIFO to be the simplest to use, so it could be the default. POSIX message queues are said to be faster and provide lower latency, but on the other hand they seem to have some limits imposed on their availability (in Linux, the default number of queues is 256, and this is system-wide). One nice possibility is to use message queues only for processes which need lower latency: so, for instance, an implementation of Telepathy on a mobile phone could request the phone UI, the telepathy-mission-control and the SIM connection manager to use a message queue as their T-address, to achieve a better responsiveness on cellular call handling.

Number of match rules

Match rules are not all stored in a single table, but for every service name there is a set of rules which other clients have installed for it (and there's one table for the the rules without service name, which would only be used by programs like dbus-monitor); I wouldn't expect to find many items in these tables, so they could be read linearly. Moreover, we might add to the client API some flag to allow clients to request that a message be sent to its destination only, ignoring all sniffers.

Memory usage

The data of messages exchanged in the T network reside on shared memory blocks, with one block per message. I considered the possibility of using an allocator to optimize the usage of the shared memory, so that the same block could be segmented and used for different messages, but at least for the initial implementation I'd leave this out — it would introduce some complexity, for uncertain benefits.

Another point of concern is related to the garbage collector: if messages expire only after a timeout which could be of the order of several seconds (D-Bus default timeout is 30 seconds, IIRC, and something similar would probably be expected from T), one could easily get into the situation where processes are creating lots of messages in a rapid succession, which would be deleted only after several seconds. But there are several possible solutions for this, or at least to limit the damage: all the messages delivered to only one recipient (that is, most of the messages in a typical D-Bus session) could be marked with a special flag that tells the recipient that no one else needs access to the memory handle, so that it can unlink it as soon as it has opened it. And if a message is delivered to more than one recipient, it could be delivered to the T-daemon as well, which could remove the handle after a shorter timeout than the actual expiration.

If and when an implementation of T will be ready, we'll be able to make more considerations and optimizations. For instance, for smaller messages addressed to only one recipient it might be more convenient to pipe them directly into the transport, without passing the shared memory handle.

Security

Or, I'd rather say, unsecurity. Exposing all the data, as well as the routing tables, in shared memory is problematic. I'm not concerned about buggy clients: the T client library will be careful enough to open the shared memory areas in read-only mode, so that a process cannot corrupt other processes' data. But anyone could write a malicious client which could do much more harm than a malicious client in D-Bus. This can be mitigated by running the T-daemon as a different user, so that the shared memory areas containing the client list, the tags and the match rules can be trusted. But for data exchanged among clients on shared memory segments, I don't see a clear solution (unless clients also run as all different users, but that's quite impractical). One thing that I miss is the possibility for a process to create a file that only this process can write, while all other processes (running under the same user) can only read. This is because the process which created a file isn't any better, in the “eyes” of the Linux kernel, than any other process running with the same user and group. Even if the creator chmods the file to read-only, other processes can chmod it back to read-write. I'd really hope that someone tells me that I'm wrong on this.

For secure communication, a protocol could be created to allow two clients to open a socket connection between each other, so that data exchanged through it could not be seen or modified by a third process. Ideally, this could happen in an almost transparent way to the clients, which would just need to set one flag when creating one proxy object (if the API will expose such concept), and magically their messages will be routed through it.

Conclusion

Thanks for reading through. :-) I hope I managed to explain all the concepts clearly, and I'm really sorry if I didn't: anyway, I'll follow the comments to this blog post and if necessary provide better explanations. I'm especially curious of hearing opinions about the feasibility of this, or if you think that for some reason T is broken by design. I'm sure that during the implementation one will find more problems, but I cannot think of anything that would be unsolvable or that would force me to rethink this completely. Please don't ask me about numbers of performance measurements; as all good mathematicians, I don't take interest in numbers. ;-) Seriously though, I don't have any; if someone happens to have ready data about performance of FIFOs, sockets and message queues, those can be of help. If you ask me whether this will be faster than D-Bus, I have no doubt about that; but this is just because of the different nature of the network, and because we are trading off a few features which D-Bus has and which I don't consider so important (IPC with remote machines, security against malicious applications). For me, the biggest question mark follows the question whether this will ever be implemented. :-)

Etichette: , , ,

11 Commenti:

Blogger Adrien Bustany ha detto...

I wonder if SELinux could be used to control shared memory access... This is a completely wild idea though, I have no idea if you can adapt the SELinux policy dynamically, or what would the performance impact be. But combined with systemd, that would start every process in its own cgroup, we would just need a rule to say "a process can't write to an SHM space outside its cgroup".

23 gennaio 2011 12:06  
Blogger Adrien Bustany ha detto...

PS. I however don't think writing a new DBus is necessarily a good idea... Saying "DBus is slow, let's write something new" is like throwing the baby with the bath water. Let's profile our libs, pinpoint what is slow, and see what we can do!

23 gennaio 2011 12:44  
Blogger Mardy ha detto...

I had only a very quick glance at SELinux, but I didn't find a way to implement discretionary locking. But thanks for the hint, it's another thing to investigate.

About throwing the baby with the bath water, this is not what I'm trying to do. My claim is that given two optimal implementations of the D-Bus protocol and T, T can be faster (besides having a different feature set, which matches more closely the use-cases which are important to me). This post is just a dump of my current ideas, but there's lots of space for optimization. While in D-Bus there isn't much of it, without extending the protocol to let clients switch to p2p connections as transparently as possible.

The main fault of the D-Bus protocol (and client API) is that it's just too convenient: people start using it, and quickly abusing it. Surely, you can tell application writers that they are misusing D-Bus and they should switch to sockets, shared memory or other things, but D-Bus is just simpler, more reliable, and perfectly matches your OO programming style. :-)
With T, I'd like to have something which is as convenient as D-Bus, but also suitable for more use-cases.

23 gennaio 2011 14:06  
OpenID internet ha detto...

remind me again why we need anything like dbus? what do we gain beyond "not having to open sockets"?

24 gennaio 2011 04:00  
Blogger Stanley Lieber ha detto...

http://www.plan9.bell-labs.com/plan9

24 gennaio 2011 04:35  
OpenID kamstrup ha detto...

Regarding the question of "why rewrite DBus" I second your comment from yesterday Mardy: DBus is just too damn convenient to use :-) The consensus among the "experts" is that DBus is meant to be a "control bus" not a bit blasting fire hose. And there is value and sense in that mindset I believe.

I have several places where I am abusing DBus, fully knowingly, and that makes me feel bad; but development wise I just don't have the time to do it the right way. If I had a DBus-like object oriented message passing system optimized to be this low latency fire hose then I'm pretty sure I'd use it.

Regarding the T design you outline - I think there is some good stuff in there. At least at a glance it looks like it can indeed work in a lockless manner and I like how you use a GC in the T-daemon to simplify the memory management a lot.

Where I get my doubts is in all the corner cases. In order to really get this system run fast you need a lot of special casing and non-trivial fiddling. Ideally this can be confined to the T-daemon, which makes it a bit nicer (analogous to replacing the GC in a Java VM fx.); but if you can't make it at least an order of magnitude faster than p2p DBus then I am not sure it'll be worth it.

Also, I have a gut feeling that the system will have some very unfortunate worst case scenarios... I've seen many times where badly written apps can result in 1000+ messages on the bus within the same second. So that's a case you must cater for.

Security-wise I am not that worried - while DBus might be more secure in theory it's all about how apps use it. I'm pretty confident that any system will be utterly screwed if someone with malicious intent gets full access to the system or session bus.

All that said; I think that from an academic viewpoint this'll be a most fun and rewarding exercise. And with the academia hat on who cares about the "why are you doing this?" questions. This is bloody research baby! :-)

If you do decide to continue with this do keep us posted. It's definitely something I can see myself dive into as well. It's sounds like a lot of fun.

24 gennaio 2011 13:10  
Blogger Xavier Claessens ha detto...

I think kernel-dbus is solving most of your issues, and keeps compatibility with current DBus API.

I don't know much about it, but Alban Crequy is working on that.

24 gennaio 2011 14:31  
Anonymous Anonimo ha detto...

I'd start this one by making shared memory D-Bus and later fix D-Bus to have better/improved match rules.

25 gennaio 2011 14:41  
Anonymous Anonimo ha detto...

There is absolutely no justification to start a new system for any perceived deficiency of D-Bus. It totally *is* throwing the baby out with the bathwater.

There are definitely two problems being confused in this discussion:
a) D-Bus is "slow"/"bad"/not designed for certain use cases/whatever
b) people keep using it for those use cases.

I suspect, we can never solve b) by replacing the IPC system - we have to make it easier for people to debug and determine why they are doing things wrongly. Even with the tools to do that (eg, Bustle) people still keep messing it up, so we should assume b) is some kind of universal constant.

So we should solve a) - and we can go about that in two ways:
1) Write a new system that people should use instead. This doesn't necessarily help current D-Bus users/abusers, and nor does it guarantee that it won't be abused itself.
2) Analyse why D-Bus doesn't serve the cases people use it for, and fix D-Bus to better serve those cases. Whether that's latency, scheduling, performance, etc. This has the benefit that it will also assist all of the current users of D-Bus, as well as the perceived "abusers".

Both of your innovations (a shared memory transport, a different way of handling/parsing/etc match rules) can be tested with the current D-Bus code base, and if they are beneficial, applied there to the immediate benefit of hundreds/thousands of developers, or thousands/millions of users.

The reason D-Bus sucks so much is because nobody works on it. The solution is to work on it, not work on something else. The point of open source isn't to create one thousand crappy text editors - the model only actually works if people co-operate for shared mutual benefit.

26 gennaio 2011 18:52  
Blogger Mardy ha detto...

Both of your innovations (a shared memory transport, a different way of handling/parsing/etc match rules) can be tested with the current D-Bus code base, and if they are beneficial, applied there to the immediate benefit of hundreds/thousands of developers, or thousands/millions of users.

I probably didn't put the due stress on those points which I regard as most important. Shared memory transport is not one of them; instead, the main improvements would come from the tagged match rules (functionality wise) and from bypassing the daemon when exchanging messages among clients (performance wise).
Implementing these would require some bigger changes in D-Bus, especially if the existing functionality needs to be preserved (I'm thinking of the remote D-Bus here), though I agree that there is no reason that would make these changes unfeasible. I'm anyway starting to consider this possibility -- I'll recollect the missing points and I'll try to come up with a development plan which I'll propose in the D-Bus mailing list.

26 gennaio 2011 20:42  
Anonymous Anonimo ha detto...

The worst design issue in current D-BUS is that it's so easy for any client to DOS the daemon, all you need is a small bug in the client.

If service stops reading it's messages (imagine a client that sends messages to itself), it's D-BUS server memory usage that bloats, not any of these clients.

If some client is sending too many/large messages, it's the D-BUS that collapses to this (D-BUS implementation is CPU bound instead of IO bound like it should), not the client(s).

Robust & scalable client-server design is such that server process needs to do less work & use-less resources than any single client process.

27 gennaio 2011 12:43  

Posta un commento

Iscriviti a Commenti sul post [Atom]

<< Home page