[DGD] DGD/MP & OpenMosix revisited?

Noah Gibbs noah_gibbs at yahoo.com
Tue Feb 10 01:05:58 CET 2004


--- Greg Lewis <glewis at eyesbeyond.com> wrote:
> I would disagree.  A big selling point of Linux clusters in
> comparison to traditional "big iron" has been the price.

  Yes and no.  In the supercomputer space, you're 100% correct.  The
fact that the big iron costs millions is a significant deal.  Just
being able to use Intel hardware there is already a nearly
insurmountable price advantage when compared to custom solutions.

  When comparing against big suns or SGIs, the argument is similar.  So
the question is:  how does it compare to big multiprocessor Intel
servers?

  Partly, I guess I'm curious what we're comparing here.  I'm assuming
DGD/MP is more aimed at Xeon servers than Reality Engines...

  And at that level, the servers are so cheap (by "big iron" standards)
that there's a limit to how much cheaper a cluster can be.  The extra
$1000 in margin for the company is genuinely significant in the
price...

  Where it *really* gets cheap is when you can use commodity hardware,
and in that case you're required to do one form or another of software
failover.

  I disagree with Felix that hardware failover is mandatory, but that's
partially because I've seen some very good, very low-latency software
failover systems.  I do agree that losing significant amounts of player
data is catastrophic, but I beleive that (for instance) there is no way
for ten seconds of playtime to accumulate significant (i.e.
catastrophic to lose) amounts of player information.  So as long as you
back up every ten seconds, you're fine.

  Though again, you've gotta go non-TCP/IP to guarantee things like
that.

> >   Have you checked into the cost of good-quality professional
> > software to do this for you?  [...]
> 
> Yes, I have checked the cost :).

  And you consider it insignificant?  Maybe you've got deeper pockets
than I have...

>  You could use OSS software, but its
> about a generation behind.  You'll still get cloning, but
> it may happen in serial rather than via multicast.  It probably
> won't have a very
> polished GUI management suite associated with it either.

  Nope.  Also, its error checking will generally be questionable, as
will its performance in uncommon hardware situations.  I know enough
sysadmins and sales engineers that do this stuff that if you tell me
there's a good open-source solution that does this right for most
cases, I'll laugh.

  I believe there eventually *will* be, but that's a different issue.

> As I stated before, there should only be two types of nodes.
> Your maintenance
> should be occurring on these two images, not on individual nodes. 
> Or, you
> should perform the maintenance on a single node from that
> group and use it
> to update the other nodes.

  Maybe this is the downside of dealing with all those sysadmins.  I
know too many of them that write custom software for this, or more
often lots of patches to other software.  The problem is that it's
generally easier if your machines have some amount of identity --
specific data for the individual node.  Now if you can guarantee
absolutely uniform hardware and performance, this is less of a problem.
 By the way, that means you'll need to upgrade with 100% identical
hardware for the foreseeable future.  Do you know how difficult it is
to do that with, say, hard drives for three years?  Did you know that
within a given model family of drive, there are generally two, or
eight, or twenty models, and essentially no vendor will ever
distinguish between them, nor guarantee they'll deliever one versus
another?  Sorry, we did this at my place of business recently --
telling me it's trivial will be a hard sell.

  You could also claim that minor patching and fixes to video cards,
motherboards, hard drives et cetera will make no difference in
performance or correctness, and mixing them will cause no trouble. 
Ditto about the hard sell on that, too.

> And in an MP situation you have bus latency.  Yes, thats quite
> a lot less
> than network latency, but you'll find that gap is closing.

  Since when?  Bus latency goes down by an order of magnitude every few
years (maybe a decade)?  When was the last time you saw them reduce the
default timeout for TCP/IP?  Not in twenty years, if memory serves.

  I expect that it's a different situation for local-only networking. 
There's something with a really godawful reputation that was originally
designed as a SCSI bus and is now used for networking whose name I've
forgotten...  That's much lower maximum latency.  But things that are
designed to go significant distances almost always have large built-in
timeouts.  I'm thinking TCP/IP, ATM networking and FDDI here.  What
were you thinking of that is actually reducing its latency *faster*
than onboard buses?  Because I know of nothing in that category.

  Bandwidth, maybe (though probably not).  Latency, no.

> What I'm getting at is that each situation carries similar
> problems and
> I'm wondering why the clustering situation is seen as
> insurmountably harder than the MP situation.

  Because you can block for the amount of time it takes to sync with
memory, and you can do it, say, once every ten seconds without it being
a problem.  Doing that over the network is death.

> Just think of component failure.  A CPU in a cluster node
> can fail, so
> can a CPU in a massive MP system.  Even with hot-swappable
> CPUs, the data on the CPU just went west in either case.
> So why is the MP case any easier to deal with?

  Because the time it takes you to mirror the data to another node is
much smaller, and so is the time to retrieve that data.  I know, I
know, a maximum latency of 10ns is as bad as 10 seconds to you, but to
the rest of the world, the factor of a billion makes a difference.

> If the delay between nodes in a cluster is ten seconds
> then either the
> cluster is poorly architected or the network is saturated.
> This should
> never happen in a properly setup cluster.

  The problem isn't that that'll be the typical case.  It won't be. 
The question is, what's the worst case?  Long-haul networking (as
opposed to on-the-bus networking) tends to have really bad worst-case
latencies even on very fast networks.  What protocol were you thinking
of that doesn't?

> > But your builders would also have to
> > know about these constraints.  In practice, that means
> > *lots* of subtle bugs.
> 
> No, your builders shouldn't need to know this at all.

  Remember that bit I wrote about non-local effects, high latencies and
having difficulty keeping non-local constraints working?  How do you
plan to address that if you have significant latencies?

> I doubt that they
> can or should care that you're using TCP/IP at the moment.

  No, but they *will* care that it takes time for a change they made to
propagate.  For instance, say they have a script that flips a switch in
a distant room.  That switch, when flipped, blows up everything
present.  Then, the original room checks for living creatures (if
fireproof, they survive being blown up) and if any are there, it
teleports them to a third location.

  At this point, you have a script which must trigger a non-local
effect (count one net latency), have it trigger another (locally, no
latency) and then it must query the effect (one more net latency unless
you do it cleverly).  You can make easy examples with conditionals
where you're guaranteed three or four round trips.  If your guaranteed
maximum latency is, say, a second, you're dealing with two seconds of
latency (or more, in other examples).  You can't just block the whole
node for that long.  But if you don't block the node, the script will
be doing the wrong thing because the effect hasn't happened yet.  So
what do you do?

  You could say "package up the action, and all queries, into a single
request, send it out, have it do the action, then send back all
information."  Do you block the acting script in the mean time?  Do you
block the acting player?  What if there *is* no acting player?  Note
that in this case it's not a matter of a *player* seeing your
inconsistency, but your *script* seeing your inconsistency.

  You can't just used shared memory for this, because that will block
your acting node.  Not good.  But you can't just let it slip because
then a script like "*a = 7;  return (*a == 7);" will potentially return
incorrect results if 'a' points somewhere nonlocal.



=====


__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html
_________________________________________________________________
List config page:  http://list.imaginary.com/mailman/listinfo/dgd



More information about the DGD mailing list