[DGD] A good bigmap

bart at wotf.org bart at wotf.org
Fri Aug 3 13:07:10 CEST 2018


Fast, (memory) efficient and scalable, pick 2 because you can't have all 3 (at
least not while also supporting arbitrary, unstructured data).

Anyway, a very similar question is what got me to build lpcdb, a 'mapping'
kinda thing which would work for an arbitrary amount of arbitrary data with
fast access to all that data and without excessive memory use.

I ended up with a solution very similar to what Quixadhal suggests, use a
hashing function to divide the keyspace and distribute it over a number of
'buckets' (represented by individual objects). This same approach is also used
by many database systems to 'partition' tables and distribute them over
multiple servers (this is just a much larger scale variation on the exact same
problem you are trying to address). 

This approach has one potential disadvantage, key:value pairs will be stored
such that its difficult (as in, cpu or memory intensive) to get a sorted (by
key) index of all pairs, but if that sort order doesn't matter much, this
approach can scale to very large sets of key:value pairs while still being
fairly memory and cpu efficient, and fast. 

For lpcdb I went one step further and am using 2 levels of 'bucketing', or
actually, a 2d grid of buckets. This is a bit slower and less memory
efficient, but in theory it scales upto 2 to the power 45 key:value pairs
(more practical, 2 to the power 42 approx).. provided you have enough memory
and swap to keep track of that amount of data.

lpcdb however is not very memory efficient in this because of wanting a way to
 sort the pairs by key or value, and to do lookups by value instead of by key
(requiring an 'index', ie, a value:key mapping, doubling the amount of memory
used for a given amount of data)

While one can obsess over the quality of hashing functions, in this case the
things which really matter are how cheap it is, and if it gets you a decent
approximation of a normal distribution with a set of arbitrary data.
Collisions? you actually want those because those cause data to go into the
same 'bucket'. A normal distribution will take care of filling those buckets
at approx the same rate. 

Given the hashing functions DGD supports natively, the otherwise 'weak' crc16
and crc32 hashes seem to actually provide a good enough distribution at a
pretty low cost, and provide output which is very easy to use for this purpose
(you will probably want to do a few right shifts or use a bit mask to make the
 output range match your chosen number of 'buckets')

A note on my choice for a 2d grid for this. Its more expensive memory and cpu
wise for small data sets, but, it lets you scale up much further before
running into excessive reallocation penalties. In general, keeping the
underlying mappings relatively small (say <= 1024 entries) prevents this. If
your dataset is small enough to fit in say 256 'buckets', this is all of
little concern, but when you need say upto 16384 buckets, things tend to work
a bit smoother when using a 64x64 grid instead of a 'line' of 16384 'buckets'. 

Ah yes, I use mappings and not arrays for this simply because mappings can be
'sparse'. When using pre-allocated arrays you will be wasting a fair bit more
memory until you have a lot of data, but won't have to deal with the cost of
reallocations.

Anyway, many if not most database systems take a very similar approach to this
problem, use hashing on the access key to 'bucket' the data and distribute it
over multiple engines/storage systems. I'm sure there are alternatives which
will be better for special cases, but for arbitrary sets of data I doubt you
will find a better solution.

Bart.

On Wed, 1 Aug 2018 21:27:21 -0700, Raymond Jennings wrote
> Ok, so, I've been grappling with this for awhile.
> 
> What's a good way to make a general purpose "big mapping"?
> 
> My bigstructs have served well as a fallback, but they have weaknesses.
> 
> skotos's code appears to be special case.
> 
> basically, I need a way to, with O(log N) complexity and O(N) space,
> store a mapping of arbitrary size.
> ____________________________________________
> https://mail.dworkin.nl/mailman/listinfo/dgd


--
https://www.bartsplace.net/
https://wotf.org/
https://www.flickr.com/photos/mrobjective/




More information about the DGD mailing list