[DGD] LPC algorithm question

Sat Feb 23 22:47:07 CET 2002

  I'm writing a function, filter, which looks like this:

private mixed* filter(mixed* entries);

  Entries is an array, where each element looks something like this:

["linkdead", <phrase>, (stuff)]

  The first element, a string, is a keyword for a help entry.  The second
element, "<phrase>" is an LWO containing, among other things, an array of
string sorted by locale (English, Spanish, etc).  There is a one-to-many
mapping between keywords and help entries, so for instance a particular
help entry might be listed by the keywords "policy", "policies",
"rule" and "rules".  The <phrase> structure will be the same for all four 
entries for this help but the leading keyword string, the first element of
an entry, will be different.  In other words:

  ["policy", <phrase>, ...]
  ["policies", <phrase>, ...]
  ["rule", <phrase>, ...]
  ["rules", <phrase>, ...]

Note <phrase> points to the same LWO in all four cases -- all of this is
stored in the same heavyweight object (/usr/common/sys/helpd), so that
should work fine.

I want to take a list of entries like this and eliminate all duplicate
<phrase> structures so that if the Soundex algorithm cheerfully came back
with both "policy" and "policies" as a match for "help pollissee" only one
of the two would be shown since they have the same text as their
description.

My first thought was to keep a hash table just within the function and
hash each entry into the table using the phrase structure as a key, then
just return the map_values array from the table.  This doesn't work since
you can't use an LWO as an index into a hash table.  Hashing on the
object_name of an LWO doesn't help.  In C I'd have a pointer...

I could just grab the English string out of each one in turn and use
*that* as the hash key, though that will take more CPU since the help
descriptions can be a page or more in length.  It would also be more
complex and error-prone since some entries may conceivably only exist in
non-English languages so I'd need a number of lines of code to fish
through the locales and come up with the right one to hash on in each
case.

Is there some easy way to do this without checking each one and
eliminating the duplicates with a linear search?  I'd like to avoid that
since it would be O(n^2) time and I'm trying to add some more permissive
Soundex matching which would return a much larger number of entries for 
sorting...

-- 
angelbob at monkeyspeak.com
See my page of DGD documentation at
"http://www.angelbob.com/projects/DGD_Page.html"
If you post to the DGD list, you may see yourself there!

_________________________________________________________________
List config page:  http://list.imaginary.com/mailman/listinfo/dgd