[DGD] A good bigmap

Fri Aug 3 23:30:10 CEST 2018

I actually wound up implementing a b-tree like implementation that
subdivides the key space recursively.  Each level indexes the next
lower one by the bottom key in the subnode's keyspace, with the leaf
buckets allowed to be heavier than the branch buckets, and an attempt
to insert into a bucket that's too full causes a node split
(recursively if necessary) before reattempting.

Can anyone see any weaknesses in such a scheme?  Branch nodes are
allowed to have 4 subnodes each and are split into subnodes of 2 each
when necessary.

I might consider hashing though as an alternative.
On Fri, Aug 3, 2018 at 4:01 AM <bart at wotf.org> wrote:
>
> Fast, (memory) efficient and scalable, pick 2 because you can't have all 3 (at
> least not while also supporting arbitrary, unstructured data).
>
> Anyway, a very similar question is what got me to build lpcdb, a 'mapping'
> kinda thing which would work for an arbitrary amount of arbitrary data with
> fast access to all that data and without excessive memory use.
>
> I ended up with a solution very similar to what Quixadhal suggests, use a
> hashing function to divide the keyspace and distribute it over a number of
> 'buckets' (represented by individual objects). This same approach is also used
> by many database systems to 'partition' tables and distribute them over
> multiple servers (this is just a much larger scale variation on the exact same
> problem you are trying to address).
>
> This approach has one potential disadvantage, key:value pairs will be stored
> such that its difficult (as in, cpu or memory intensive) to get a sorted (by
> key) index of all pairs, but if that sort order doesn't matter much, this
> approach can scale to very large sets of key:value pairs while still being
> fairly memory and cpu efficient, and fast.
>
> For lpcdb I went one step further and am using 2 levels of 'bucketing', or
> actually, a 2d grid of buckets. This is a bit slower and less memory
> efficient, but in theory it scales upto 2 to the power 45 key:value pairs
> (more practical, 2 to the power 42 approx).. provided you have enough memory
> and swap to keep track of that amount of data.
>
> lpcdb however is not very memory efficient in this because of wanting a way to
>  sort the pairs by key or value, and to do lookups by value instead of by key
> (requiring an 'index', ie, a value:key mapping, doubling the amount of memory
> used for a given amount of data)
>
> While one can obsess over the quality of hashing functions, in this case the
> things which really matter are how cheap it is, and if it gets you a decent
> approximation of a normal distribution with a set of arbitrary data.
> Collisions? you actually want those because those cause data to go into the
> same 'bucket'. A normal distribution will take care of filling those buckets
> at approx the same rate.
>
> Given the hashing functions DGD supports natively, the otherwise 'weak' crc16
> and crc32 hashes seem to actually provide a good enough distribution at a
> pretty low cost, and provide output which is very easy to use for this purpose
> (you will probably want to do a few right shifts or use a bit mask to make the
>  output range match your chosen number of 'buckets')
>
> A note on my choice for a 2d grid for this. Its more expensive memory and cpu
> wise for small data sets, but, it lets you scale up much further before
> running into excessive reallocation penalties. In general, keeping the
> underlying mappings relatively small (say <= 1024 entries) prevents this. If
> your dataset is small enough to fit in say 256 'buckets', this is all of
> little concern, but when you need say upto 16384 buckets, things tend to work
> a bit smoother when using a 64x64 grid instead of a 'line' of 16384 'buckets'.
>
> Ah yes, I use mappings and not arrays for this simply because mappings can be
> 'sparse'. When using pre-allocated arrays you will be wasting a fair bit more
> memory until you have a lot of data, but won't have to deal with the cost of
> reallocations.
>
> Anyway, many if not most database systems take a very similar approach to this
> problem, use hashing on the access key to 'bucket' the data and distribute it
> over multiple engines/storage systems. I'm sure there are alternatives which
> will be better for special cases, but for arbitrary sets of data I doubt you
> will find a better solution.
>
> Bart.
>
> On Wed, 1 Aug 2018 21:27:21 -0700, Raymond Jennings wrote
> > Ok, so, I've been grappling with this for awhile.
> >
> > What's a good way to make a general purpose "big mapping"?
> >
> > My bigstructs have served well as a fallback, but they have weaknesses.
> >
> > skotos's code appears to be special case.
> >
> > basically, I need a way to, with O(log N) complexity and O(N) space,
> > store a mapping of arbitrary size.
> > ____________________________________________
> > https://mail.dworkin.nl/mailman/listinfo/dgd
>
>
> --
> https://www.bartsplace.net/
> https://wotf.org/
> https://www.flickr.com/photos/mrobjective/
>
> ____________________________________________
> https://mail.dworkin.nl/mailman/listinfo/dgd