[DGD] Changing connect() (network package)

Shentino shentino at gmail.com
Wed Jan 2 23:22:54 CET 2008


> Named pipes and named sockets are among the exceptions.  SIGTERM
> probably <should> interrupt a blocking read or write on a named pipe,
> from DGD's point of view.

Ok, the signal comes in, the call is interrupted, and DGD merely marks
a flag to respond to later.  (at least, if I read the sources).

What now...does DGD simply error("File operation interrupted") if LPC
tries to read from something that happens to not be a plain file?
You'd still need to check for an EINTR to produce a meaningful error
message.  Unless of course you wanted to raise a generic error.  An
interrupted system call by its very nature is a transient event.

> Have you ever tried to interrupt a command running on a networked
> filesystem that hung?

Not really, but if the OS decides to issue an EINTR (linux can be
configured this way, for instance), the command should handle that.
Simple, oneshot commands can simply quit and require manual retry.
But something like DGD doesn't answer too well with simply being
rerun.  If the OS issues an EINTR in this case, then DGD should just
try again instead of letting an interrupted system call throw it off.
Especially if the cause of the hang can be remedied, there's no need
for DGD to go belly up on account of an EINTR that was itself induced
by a temporary condition that itself could be fixed.

Suppose that DGD is just booting up, and it's reading driver.c over
NFS, and due to horrendous latency, the OS that DGD is running on
times out (gets tired of waiting), and instead of getting an I/O
error, an EINTR is returned?

I think that object compilation during bootup should at least be
checking for EINTR, even if nothing else is.

NFS can cause EINTR at least with linux.  I dunno about other OS's,
but simply ignoring EINTR would have the side effect of outlawing
anything other than local file access.

Plus, anybody in the same session as DGD is can send DGD a SIGSTOP,
which btw cannot be blocked.  Should I be able to hose DGD's code just
by telling it to stop?  SIGSTOP could legitimately be used to halt
DGD, and a SIGCONT can be used to start it up again.  Having DGD or MP
screw up after a SIGSTOP just because that happened to interrupt a
system call is just plain silly.

Rare is not impossible where I come from.  I've been burned too many
times by freakish coincidences to trust Murphy to keep his grubby
fingers out of my circumstances.

A practical example of this theoretical NFS tangle:

Hypothetically, I have a cluster of diskless slave servers that handle
incoming connections, and each of them is running a copy of DGD.
Their diskless so that I can have a single master server that has all
the disk storage, and furthermore if someone hacks the slave server,
the master server is protected because it exports the mud source files
as read-only.  If a slave server is compromised, I just reboot it.

All of these slaves have the mudlib mounted via NFS...and perhaps the
dumpfile is itself stored on the master server.  Ah, an interrupted
system call while writing the dumpfile could wreck things...

> This won't always work because a signal at the wrong time will not
> be seen until later.  When there is no normal network input, DGD
> will sleep until the next callout is due; when there is no callout
> and no input from connections, DGD will sleep forever.  An interrupt
> just before it starts that long wait will not be seen until whenever
> the next execution round completes.

Er, if my mud suddenly decides to go idle, and it has live state that
hasn't been state-dumped yet, wouldn't this cause data loss if
/sbin/init decides to send a SIGTERM?

Usually when /sbin/init sends out SIGTERMs, its a warning to the tune
of "You are no longer allowed on the system.  Shut yourself down soon,
or I will forcibly terminate you", followed through with a series of
SIGKILLs to any processes that didn't pay attention to the SIGTERM.

Should DGD be in the unfortunate position of not having any callouts
scheduled, and it's a slack time of day with no network activity, this
would mean that a well meaning SIGTERM would go blissfully ignored,
and /sbin/init, noticing that its fair warning was utterly ignored,
would throw a fit and deliver a data-nuking, unblockable SIGKILL.
This has happened to me once before on my own machine, in fact.  It
was very irritating.

TBH, this scenario sounds more like a bug to me, at least if I
understood you right.



With due apology for any bluntness, I honestly think "trust the OS not
to signal me" is a piss poor excuse not to check for EINTR.  From what
I read in various manpages, checking for EINTR is a Good Thing to do,
even if it's a pain in the arse at times.  IIRC, "failing to check for
EINTR" is listed in a troubleshooting FAQ in my .info docs.



More information about the DGD mailing list