[MUD-Dev] I/O Event Handling Under Linux by Richard Gooch
J C Lawrence
claw at under.engr.sgi.com
Thu Jun 25 11:57:47 CEST 1998
The following explains why I've been seeing such poor performance when
I moved my multiplexed IO structure (IO processing by a thread pool
which synchonously handles all connections) to Linux, as well as
detailing several other interesting IO handling concerns. You may not
be worried about this sort of thing now. You will be when you're
attempting to cram several thousand players and bots onto the same
box.
URL:http://www.atnf.csiro.au/~rgooch/linux/docs/io-events.html
--<cut>--
I/O Event Handling Under Linux
Richard Gooch
24-JUN-1998
I/O Event handling is about how your Operating System allows you to
manage a large number of open files (file descriptors in UNIX/POSIX,
or FDs) in your application. You want the OS to notify you when FDs
become active (have data ready to be read or are ready for
writing). Ideally you want a mechanism that is scalable. This means a
large number of inactive FDs cost very little in memory and CPU time
to manage.
Traditional Unix systems provide the select(2) and/or poll(2) system
calls. With both of these you pass an array of FDs to the kernel, with
an optional timeout. When there is activity, or when the call times
out, the system call will return. The application must then scan the
array to see which FDs are active. This scheme works well with small
numbers of FDs, and is simple to use. Unfortunately, for thousands of
FDs, this does not work so well.
The kernel has to scan your array of FDs and check which ones are
active. This takes approximately 3 microseconds (3 us) per FD on a
Pentium 100 running Linux 2.1.x. Now you might think that 3 us is
quite fast, but consider if you have an array of 1000 FDs. This is now
3 milliseconds (3 ms), which is 30% of your timeslice (each timeslice
is 10 ms). If it happens that there is initially no activity and you
specified a timeout, the kernel will have to perform a second scan
after some activity occurs or the syscall times out. Ouch! If you have
an even bigger application (like a large http server), you can easily
have 10000 FDs. Scanning times will then take 30 ms, which is three
timeslices! This is just way too much.
You might say that 3 ms for 1000 FDs is not such a big deal: a user
will hardly notice that. The problem is that the entire array of FDs
is scanned each time you want to go back to your polling loop. The way
these applications work is that after checking for activity on FDs,
the application processes the activity (for example, reading data from
active FDs). When all the activity has been processed, the application
goes back to polling the OS for more FD activity. In many cases, only
a small number of FDs are active at any one time (say during each
timeslice), so it may only take a few milliseconds to process all the
activity. High performance http servers can process hundreds or
thousands of transactions per second. A server that takes 2 ms to
process each active FD can process 500 transactions per second. If you
add 3 ms for FD scanning in the kernel, you now have 5 ms per
transaction. That only gives 200 transactions per second, a massive
drop in performance.
There is another problem, and that is that the application needs to
scan the "returned" FD array that the kernel has updated to see which
FDs are active. This is yet another scan of a large array. This isn't
as costly as the kernel scan, for reasons I'll get to later, but it is
still a finite cost.
Some other operating systems provide a mechanism called I/O completion
ports and some have event queues. These are a mechanism to tell the OS
that you want to be notified when there is activity on a FD. Usually,
when a FD becomes active (i.e. becomes ready for reading or writing) ,
the OS will send a message (a "readiness event") to a "port" (perhaps
another FD such as a pipe). This message contains the FD that has
become active. The one port can have many readiness events from many
FDs sent to it. The key difference here is that you do not need to
pass the OS a massive FD array each time you want to listen for
events: you only need to tell the OS once for each FD that you want to
receive readiness events for that FD. The kernel no longer needs to
scan a massive FD array each time through your polling loop, and nor
does the application. This is an appealing approach, and scales very
well.
Unfortunately, I/O readiness queues are not POSIX. They're not even
Unix. We could add them to Linux, but it would mean that applications
that relied on this mechanism would be unportable, Linux-only. Also,
it would involve significant additions to the kernel, which sets off
my bloat alarms. I would hope that it is possible to develop an
effective alternative that uses standard POSIX/UNIX functionality.
There are improvements we can make for the massive FD scanning
problem. Firstly we can optimise the way the scanning is done inside
the kernel. Right now (2.1.106) the kernel has to call the poll()
method for each file structure. This is expensive. Back in the 2.1.5x
kernels, I coded a better implementation for the kernel which sped
things up almost 3 times. While this requires modifications to drivers
to take advantage of this, it has the advantage of not changing the
semantics we expect from UNIX. Note one other interesting feature of
this optimisation: it centralises event notification, which in turn
would make implementing I/O readiness queues simpler. I'm not sure how
closure of FDs before readiness events are read should be
handled. This could complicate their implementation.
Another solution (which would also benefit from the kernel
optimisation discussed above) is for the application to divide the FD
array into a number of smaller FD arrays, say 10. You then create 10
threads, each of which has a polling loop using its smaller FD
array. So each FD array is now 100 entries long. While this doesn't
change the total number of FDs that must be scanned, it does change
when they have to be scanned. Since most FDs are inactive, not all the
threads will be woken up. Too see how this works, consider the
example where, at any time (say during a single timeslice of 10 ms),
only 5 FDs are active. Assuming these FDs are randomly, uniformly
distributed, at most 5 threads will need to be woken up. These threads
then process the activity and go back to the start of their polling
loops. Where we win is that only 5 threads had to go back and call
select(2) or poll(2). Since they each have 100 entry FD arrays, the
kernel only has to scan 500 FDs. This has halved the amount of
scanning required. The scanning load has gone from 30% to 15% by this
simple change. If you were to instead use 100 threads, you would
still only have at most 5 threads woken up for activity, and hence the
total number of FDs scanned this timeslice would be 50. This takes
down the scanning load to 0.15%, which is negligible.
There is one thing to watch out for here: if you use select(2) in your
polling loop, be aware that the size of your FD array is equal to the
value of your largest FD. This is because select(2) uses a bitmask for
its FD array. This means one of your threads will want to poll FDs 991
to 1000. Unfortunately, your FD array is still 1000 long. What's
worse, the kernel still has to do a minimal scan for all those 1000
FDs. The solution to this is to use poll(2) instead, where you only
have to pass as many FDs as you want to poll, and the kernel scans
only those.
This solution sounds ideal: just create lots and lots of threads. At
the extreme, you create one thread per FD. There is a problem here,
however, as each thread consumes system resources. So you need to
compromise between the number of threads and the FD scanning
load. Also, the more threads you have the more cache misses you
induce, so this is something to avoid as well. Fortunately in this
case most threads will be running nearly the same code at the same
time, so cache pollution should not be a significant problem.
A more advanced solution is to have dynamic migration of FDs depending
on whether they are mostly active or inactive. In the simplest case,
you only have two threads. One which polls mostly active FDs and the
other polls mostly inactive FDs. The thread for active FDs will be
woken up very frequently, but on the other hand will have only a small
number of FDs to scan. The other thread will have to scan a large
number of FDs, but it will only be woken up occasionally. For each FD
an activity counter is kept. When a FD on the mostly inactive list is
deemed to be fairly active, it is migrated to the mostly active
list. A reverse operation occurs for fairly inactive FDs on the mostly
active list.
I favour this solution, since it can be implemented solely in
userspace and is portable to other POSIX systems. I have an existing
software library which has (amongst other things) support for managing
events on FDs. I plan on extending this library to use the above
technique. The library is distributed under the LGPL. Watch this space
for results.
My approach is to squeeze as much performance as we can out of the
existing POSIX/UNIX interface by optimising the kernel and doing
clever things in userspace. We can then evaluate how much of a
bottleneck polling is under Linux. If polling overheads (for very
large numbers of FDs) are kept to within a few percent, I firmly
believe there is no need to I/O readiness queues. If polling overheads
remain above 10%, then we should consider I/O readiness queues.
OK, now I'll get back to another point I raised earlier, and that is
the time taken for the application to scan a list of FDs for
activity. We would like to reduce this time as well, and we can
without much effort at all. Instead of using poll(2) we can implement
poll2(2), a new syscall which is very much like poll(2) but returns an
array of active FDs. So now the application only needs to search a
very small array. This new syscall is based on the principle of not
duplicating work. The kernel has just gone and scanned all these FDs
for us, why not keep a record of that work, rather than doing it again
in the application? Note that the current Linux polling implementation
is so slow that implementing poll2(2) will make little
difference. However, once the optimisations I've proposed are added,
it will make a difference (yes, I've done the benchmarks). Oh, and
before you take my comments on the Linux polling implementation too
far, note that I've benchmarked it against several other OSes, and
Linux came out far better in any case. My point is that we can still
do a lot better.
You might ask, why implement poll2(2) if we have this clever userspace
solution that avoids many of the problems? Well, the point is that it
would speed up the application processing the inactive list of FDs,
and that is always a good thing. However, it may turn out that the
gain from poll2(2) is marginal. You could also argue that since
poll2(2) is non-POSIX, non-UNIX, then why bother with it at all, and
why not simply implement I/O readiness queues? Well, I could say that
poll2(2) is only a small departure from UNIX whereas I/O readiness
queues are more radical. However, this isn't a very strong
argument. I'm waiting to see the results of the userspace solution
before considering poll2(2) further.
A note on the "thundering herd" problem: it's not really of much
interest in this discussion. The "problem" arises because people
attempt to increase concurrency of accepting new connections by having
multiple threads all blocking waiting on select(2). Because the kernel
wakes up all threads, rather than one thread per new connection, we
have a "thundering herd" of freshly woken up threads. These problems
are best solved by treating the accepting of incoming connections as
just another case of I/O management, rather than special-casing them.
Readers may find a recent paper presented at USENIX98 of interest. The
author also has a list of other research on this topic. The paper
essentially describes a mechanism for improving the speed of select(2)
by adding internal state information to the kernel about FD activity.
I invite comments or additions to this document. My purpose is to
explain the various issues involved and serve as a primer for more
debate. I hope to avoid recurring debates on the linux-kernel list
which go over the same ground and either never contribute anything new
to the arguments, or take many messages before something new is
added. If you have a strong technical argument that I've missed, I'll
be happy to add it to this document.
Original: 22-JUN-1998
--<cut>--
--
J C Lawrence Internet: claw at null.net
(Contractor) Internet: coder at ibm.net
---------(*) Internet: claw at under.engr.sgi.com
...Honourary Member of Clan McFud -- Teamer's Avenging Monolith...
More information about the mud-dev-archive
mailing list