[DGD] DGD/MP & OpenMosix revisited?

Mon Feb 9 21:21:39 CET 2004

--- "Felix A. Croes" <felix at dworkin.nl> wrote:
> Greg Lewis <glewis at eyesbeyond.com> wrote:
> > I'd argue that a good quality cluster from a
> > reputable company is now seeing similar MTBF numbers to similar
> > size large machines, at a fraction of the cost.
> 
> I was assuming high quality PC-level components.  I had no
> idea that they were as reliable as you say.

  For any very high-margin part, they tend to be.  J Random PC is still
nowhere near that reliability level, but they're also cheap enough to
replace much more frequently.

>  I take it that's mean time between
> failures for the cluster as a whole, not for each machine
> individually --
> that makes each component in a cluster machine several times as
> reliable
> as each component in a large machine, assuming that the latter has
> fewer components overall.  That's impressive.

  It's not always true, either.  While Greg makes the (valid) point
that the company will have to pay shipping and replacement on an RMA'd
part, as you point out, the parts would have to be several times as
reliable as the equivalent for a server.  That's true when you buy a
"clustering solution", which is absolutely *not* a fraction of the
price of a big server.  Essentially, if you're willing to pay for the
company's clustering software and carefully-selected hardware, you'll
get higher reliability.  That usually involves software failover,
though.  For that matter, large servers tend to be the same way these
days -- you need software failover because large servers tend to
integrate redundant parts to prevent catastrophic failure if, say, the
power supply goes bad.

> > > Then there is the software maintenance cost.  All
> > > of these machines
> > > have an operating system which must be installed and
> > > kept up to date.
> > > Any OS problem is potentially multiplied by the number
> > > of machines.
> > > The hardware could perhaps be managed by a single
> > > person, the software can not.
> >
> > Sorry, but I have to disagree with this entirely.  In a
> > cluster it is
> > customary to manage the operating system for the nodes as
> > images.

  Have you checked into the cost of good-quality professional software
to do this for you?  You're welcome to say "use open source", but the
OSS solutions are simply not equal (yet?).

> That's installation and and upgrading taken care of.  I have my
> doubts about other maintenance, but I'll just cede the point.

  You're correct that it's a nontrivial undertaking, for the record. 
Clustering is a great idea, and for simple highly-parallel problems
it's a really good idea.  However, it works best for the same things
that a SETI at home solution excels at, and for the same reasons.  As you
say, problems that are not highly interconnected.

> > Designing a distributed MUD is certainly a challenge, at
> > least we agree
> > on this one :).  However, this is where a transparent
> > process migration
> > technology like OpenMosix of bproc can help simplify things.
> 
> This point I won't cede.  Some things just don't scale well using
> clusters, and in my opinion that includes MUDs.

  Yup.  I have to agree with Felix here.  Let's assume that
geographical boundaries and things like that are already taken care of
(a highly nontrivial thing).  You still have certain features that are
very difficult.  Nonlocal effects (a switch turns on a light halfway
across the MUD) have a necessary propagation time since you've got
network latency.  Certain events which need to happen simultaneously
don't necessarily happen that way -- for instance, if you have two
gates and no more than one is open at any given time...  Well, more
than one is open at certain given times, generally speaking.  It's very
hard to enforce certain constraints when there's a delay there.

  If you run your MUD over standard protocols like TCP/IP, the delay
will occasionally, and unpredictably, be *very* large, ten seconds or
more.  That may not seem like much, but when you're going for
"instantaneous", ten seconds can be an eternity.  What if your CD
player suddenly took that long to start playing a song, or respond to
your buttonpress in any way?

  Yes, there are non-TCP protocols.  You could retool your MUD to use
them (speaking of "nontrivial").  But your builders would also have to
know about these constraints.  In practice, that means *lots* of subtle
bugs.

=====

__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html
_________________________________________________________________
List config page:  http://list.imaginary.com/mailman/listinfo/dgd