[DGD] DGD/MP & OpenMosix revisited?

Greg Lewis glewis at eyesbeyond.com
Tue Feb 10 18:37:29 CET 2004


I think we're getting way off topic here, so I'll just reply to a few
points.

On Mon, Feb 09, 2004 at 04:05:58PM -0800, Noah Gibbs wrote:
> --- Greg Lewis <glewis at eyesbeyond.com> wrote:
> > I would disagree.  A big selling point of Linux clusters in
> > comparison to traditional "big iron" has been the price.
> 
>   Yes and no.  In the supercomputer space, you're 100% correct.  The
> fact that the big iron costs millions is a significant deal.  Just
> being able to use Intel hardware there is already a nearly
> insurmountable price advantage when compared to custom solutions.
> 
>   When comparing against big suns or SGIs, the argument is similar.  So
> the question is:  how does it compare to big multiprocessor Intel
> servers?
> 
>   Partly, I guess I'm curious what we're comparing here.  I'm assuming
> DGD/MP is more aimed at Xeon servers than Reality Engines...

And that, to me is the interesting question.  I don't believe you can
buy a Xeon server larger than say 16 way (maybe its only 8 way?).  What
do you do when your mud outgrows this?  I'm not thinking about hobbyist
or small commercial muds here, but rather the large commercial games
(things of the scale of Everquest, etc.).

Do you cluster?  Do you get a huge MP machine?  Even the largest Sun is
limited to roughly 100 processors.

> > >   Have you checked into the cost of good-quality professional
> > > software to do this for you?  [...]
> > 
> > Yes, I have checked the cost :).
> 
>   And you consider it insignificant?  Maybe you've got deeper pockets
> than I have...

No...I said I have checked the cost, I made no comment on its
significance ;).

> >  You could use OSS software, but its
> > about a generation behind.  You'll still get cloning, but
> > it may happen in serial rather than via multicast.  It probably
> > won't have a very
> > polished GUI management suite associated with it either.
> 
>   Nope.  Also, its error checking will generally be questionable, as
> will its performance in uncommon hardware situations.  I know enough
> sysadmins and sales engineers that do this stuff that if you tell me
> there's a good open-source solution that does this right for most
> cases, I'll laugh.

I wasn't about to tell you that at all.  I work for a clustering company
and my day job is writing commercial cluster management software.

> > As I stated before, there should only be two types of nodes.
> > Your maintenance
> > should be occurring on these two images, not on individual nodes. 
> > Or, you
> > should perform the maintenance on a single node from that
> > group and use it
> > to update the other nodes.
> 
>   Maybe this is the downside of dealing with all those sysadmins.  I
> know too many of them that write custom software for this, or more
> often lots of patches to other software.  The problem is that it's
> generally easier if your machines have some amount of identity --
> specific data for the individual node.  Now if you can guarantee
> absolutely uniform hardware and performance, this is less of a problem.
>  By the way, that means you'll need to upgrade with 100% identical
> hardware for the foreseeable future.  Do you know how difficult it is
> to do that with, say, hard drives for three years?  Did you know that
> within a given model family of drive, there are generally two, or
> eight, or twenty models, and essentially no vendor will ever
> distinguish between them, nor guarantee they'll deliever one versus
> another?  Sorry, we did this at my place of business recently --
> telling me it's trivial will be a hard sell.

I don't believe you'll have to do identical hardware upgrades at all.
The significance of the hardware differences is quite variable and
may not be a factor at all.  When you start getting completely different
generations of nodes, then you'll need to maintain more images.

But to take drives for instance.  The only real difference this should
make in terms of your image is in the partitioning scheme.  Even then
it may make no difference whatsoever if your last partition is tagged
to grow to the maximum size of the drive.

>   You could also claim that minor patching and fixes to video cards,
> motherboards, hard drives et cetera will make no difference in
> performance or correctness, and mixing them will cause no trouble. 
> Ditto about the hard sell on that, too.

Almost all large clusters don't have video cards.  Motherboard changes can
be a problem, mostly because of the onboard components which are likely
to be different.  Its can also be a problem if you're using a custom
BIOS for high speed booting (e.g. LinuxBIOS).  My personal experience
is that hard drive changes aren't as big a problem as they seem to have
been for you.

> > And in an MP situation you have bus latency.  Yes, thats quite
> > a lot less
> > than network latency, but you'll find that gap is closing.
> 
>   Since when?  Bus latency goes down by an order of magnitude every few
> years (maybe a decade)?  When was the last time you saw them reduce the
> default timeout for TCP/IP?  Not in twenty years, if memory serves.
> 
>   I expect that it's a different situation for local-only networking. 
> There's something with a really godawful reputation that was originally
> designed as a SCSI bus and is now used for networking whose name I've
> forgotten...  That's much lower maximum latency.  But things that are
> designed to go significant distances almost always have large built-in
> timeouts.  I'm thinking TCP/IP, ATM networking and FDDI here.  What
> were you thinking of that is actually reducing its latency *faster*
> than onboard buses?  Because I know of nothing in that category.

Consider memory latency vs. network latency as those are issues with
MP and clusters respectively (bandwidth is also an issue but we're
talking specifically about latency).  My contention is that network
latency, although 100 or more times larger than memory latency, has
been reducing at a more rapid rate in recent times.

I'm talking about high speed interconnects.  Have a look at Myrinet,
Quadrics or Dolphin for instance.  For small message sizes, all of them
claim latencies in the few microsecond range.  Compare this with the
latency of their cluster predecessors (e.g. 100M/Bit ethernet has a
latency of around 100 microseconds).  Now compare the drop in memory
latency over a similar time period.

You might also want to compare bandwidth.  I believe you'll see a
similar pattern (i.e. network bandwidth is growing faster than memory
bandwidth).

-- 
Greg Lewis                          Email   : glewis at eyesbeyond.com
Eyes Beyond                         Web     : http://www.eyesbeyond.com
Information Technology              FreeBSD : glewis at FreeBSD.org

_________________________________________________________________
List config page:  http://list.imaginary.com/mailman/listinfo/dgd



More information about the DGD mailing list