[MUD-Dev] Speech to Text, etc. (was: On socialization and convenience )

Adam Martin amsm2 at cam.ac.uk
Sat Jun 23 12:46:34 CEST 2001


----- Original Message -----
From: "Eli Stevens" <listsub at wickedgrey.com>
To: <mud-dev at kanga.nu>
Sent: Thursday, June 21, 2001 8:40 AM
Subject: [MUD-Dev] Speech to Text, etc. (was: On socialization and
convenience )

> ----- Original Message -----
> From: "John Buehler" <johnbue at msn.com>
> To: <mud-dev at kanga.nu>
> Sent: Monday, June 18, 2001 8:07 PM
> Subject: RE: [MUD-Dev] On socialization and convenience

> One limitation of STTTS (heh heh) is that the intermediate form of
> communication is text.  I know that seems obvious, but I only
> point it out because it seems rather assumed, and it shouldn't be.
> What if the intermediate form was an entry into a lookup table of
> various sounds (not mapped directly to letters per se)?

> Perhaps 16 bits of index, 4 of volume, 4 of duration, and 4 to
> indicate how much this sound blends into the next (with 4 left
> over).  65k sounds should be enough to get just about every
> phonetic sound that a human can make (though recognizing it might
> be hard - I suspect that this would work better with English than
> it would with Chinese ;).  The system doesn't need to know the
> ASCII representation (unless you wanted pure STT too - like a chat
> window, but I am assuming you don't).

I've been trying to find the links for a few days now and failing,
but some researchers have indeed found a very limited set of
phonemes that represent all sounds - ISTR that the key to their
research was splitting syllables up into two halves, a starting and
an ending.

The result was that around a few dozen sounds were enough
(allegedly) to represent any language, and that the researchers
expected to be able to make STT systems at least an order of
magnitude faster than current ones, and without requiring loads of
memory (today's mainstream STT systems are context based - that
being the only reason they are language dependent, and also being
the reason they won't run on e.g. a low end pentium, and even on
modern machines they tend to be severe system hogs).

FYI I very much doubt that STT for eastern languages (e.g. Mandarin)
is any harder than for western languages. Certainly inflection has a
strong effect on meaning, but the work I've seen so far seems to
indicate that that is not where the real difficulty lies anyway -
the problem is more that with sloppy enunciation multiple words
become almost indistinguishable (the system I refer to earlier
claimed to be able to uniquely decode such situations more easily
than current systems do).

Since I've not heard any more about this research, I wonder whether
the IP got gobbled up and is currently being sold only in some very
expensive markets, or perhaps if the researchers weren't quite as
successful as they were claiming.

Another possible interesting line of research is some work in lip
reading @ Carnegie Mellon -
http://www.theregister.co.uk/content/archive/6143.html

Adam M

_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev



More information about the mud-dev-archive mailing list