[MUD-Dev] Naming and Directories?
Hans-Henrik Staerfeldt
hhs at cbs.dtu.dk
Wed Mar 17 11:28:42 CET 1999
On Wed, 17 Mar 1999, Jon A. Lambert wrote:
> On 16 Mar 99, Mik Clarke wrote:
> > Ummm. Yick. 26 4 byte pointers by, say 16 characters long just to ind=
ex 100-200 names?
> > Even with a sensible implementation you're going to end up wasting 90=
% of the storage you
> > allocate (and a ten letter name means you have to allocate 1K pointer=
s, each different
> > name then requires another 900 bytes or so).
>=20
> But you have changed your requirements above from speed to memory.=20
> No fair. ;)
>=20
> The alphabet trie above has some pros and cons:
>=20
> 1) avoids the overhead of string hashing=20
> 2) provides intrinsic support for abbreviations
>=20
> 1) based on English and may not support blanks, puntuation, or other=20
> characters (of course this can be altered or enforced)
> 2) distribution of names may tend to cluster in certain nodes=20
>=20
Some additional pros and cons on a search trie (for large problems) are:
pro : distribution of names may tend to cluster in certain nodes
its a pro, because storing names with the same prefix does not make the
trie grow linearly with name length, so storing "green bottle" and=20
"green dragon" would require the same space for "green ", only "bottle"
and "dragon" would be distinct.
pro : for the large _static_ case, you can easily store the trie on disk,
using file pointers instead of memory pointers. I did this for lookups=20
in a large genetic database (Genbank, size ~11 Gb) with much success,=20
the trie (of ~3 million keywords) taking up 82Mb. Note that since
the namespace is well filled (or clustered), the memory usage is about=20
30 bytes pr. word, since the trie 'reuses' the common prefixes (words=20
are approx 6-10 characters long). (the trie is a bit compressed, so=20
lookup actually takes O(m * sigma), m being the keyword length, with=20
O(m) disk reads.)
con : it _also_ takes time to insert/delete entries in the trie, so
you may get another drag on load, as you ease lookup time. You really
should only use a trie (compared to linear search), if=20
#lookups*#objects > #movements*sizeof(names) for the locality in=20
which you use the trie. Most usefull, will probably be to apply it to=20
global lookups only, since 'movement' here is limited to=20
creation / deletion.=20
Some optimizations i have heard about (but not trie-d):
memory saver: Instead of a full 26 letter alfabet, you could=20
cluster individual letters, say 'ab' 'cd' ... such that the number of=20
pointers is reduced, making a (small)hash or linear search in the end
(of a greatly reduced number of objects) to see which actually match.
As mentioned earlier (i forget who *blush*): compressing the trie (long=20
chains that does not branch) is a sure memory saver.
=20
Another compression is possible (it has some girls name): if some (any)
subtree of the trie is fully 'filled', say f.inst. that the 4 letter=20
prefixes of your used names covers all combinations, then you can=20
compress this to a simple list, doing direct lookup, instead of 4=20
lookups.
Hans Henrik St=E6rfeldt | =20
email: bombman at diku.dk | voice: +45 40383492=20
hhs at cbs.dtu.dk | voice work: +45 45252425
phone-mail: | address:
40383492 at sms.tdm.dk | Hans Henrik St=E6rfeldt,
WWW-home | Dybendalsvej 74 2. th,
http://www.cbs.dtu.dk/hhs/ | 2720 Vanl=F8se, Danmark.
|
Student of Computer Science | Scientific programmer at Center for
and Information Psychology. | Biological Sequence Analysis,
at University of Copenhagen | Technical University of Denmark.
_______________________________________________
MUD-Dev maillist - MUD-Dev at kanga.nu
http://www.kanga.nu/lists/listinfo/mud-dev
More information about the mud-dev-archive
mailing list