[MUD-Dev] [TECH] Voice in MO* - Phoneme Decomposition and Reconstruction

Tue May 21 05:33:08 CEST 2002

Raph Koster writes:

>   - Cybertown, a graphical mud with differently themed chat areas,
>   makes use of text-to-speech. It sounds like the good old
>   fashioned Talking Moose on the Mac--also familiar from countless
>   weather channels.

>   - MPath aka HearMe, which wedded realtime voice to what were
>   essentially IRC channels. Brian Moriarty has a great videotape
>   of this in action called "Whispering Pines."

> It was extremely disjoint reading text in the chat box on
> Cybertown in advance of hearing it, plus it of course did not
> handle the challenges of typical Netspeak very well. However, you
> began to start to hear the glimmerings of emotional intonation in
> the voices, perhaps simply because of the jerky intonation of the
> algorithm.

I checked out Cybertown and it's voice synth.  It is quite confusing
just hearing the speech come out of apparently nowhere.  Especially
in a crowded room, I just didn't know who was speaking, even if the
voices were slightly modulated.  I wonder then if it is essential
that we, as humans, need to tag sounds to specific spatial sources.
In movies, we see peoples' lips move, letting us tag that specific
speech to them, even if they're in the middle of a crowd.  Inner
monologues, which have no spatial representation (aside from being
in the narrator's head), are somewhat jarring on screen.  Once that
link is established though, the speaker can actually move off-screen
at brief intervals.  This suggests that speech in text muds might be
a bad idea and graphical muds and MMORPGs are a better medium to
support it.

However, I can easily bring up an exception to the would-be rule in
radio dramas.  But that perhaps is a special case, in that most
radio shows have three or four speakers at the most, each with very
distinctive voices.  Just random musings on my part.  Any thoughts?

Back to the topic: Perhaps as testament that TTS algorithms can
indeed have their own personality though, I think recognize the
cybertown one as the same engine as that in the MS Speech SDK v4.x
(good ol' Sam).  However, while that synth has quite a good deal of
flexibility, the default heuristics it uses makes it talk in a very
controlled manner.  Good for weather channels.  Not so good for
portraying people.

Personally, I suggest looking at the L&H speech engine you can
download through the MS Agent site.  While it still has the same
type of gruff comp-gen voice, I personally think it handles
intonation and fluidity much more naturally.  Mainly because, in
addition to processing punctuations such as commas, it seems to be
varying the inter-sound spacing based on whether it's within or
between words (and maybe even sound types).  It's a subtle effect,
but one that helps break the monotone voice.

And speaking of voices (ugh), if you've got enough dev resources to
spare, you can always try creating your own speech set for a
concatenative synthesis mode instead of the usual formant synth.
Below are two of the better examples

  Speechwords Demo
    http://www.speechworks.com/demos/speechify.cfm

  IBM TTS Demo
    http://www-3.ibm.com/software/speech/enterprise/dcenter/demo-tts.html

Although the practicality of that approach might be questionable
given the number of different voices you would want your players to
choose from, it does produce the most normal human voice.
Regardless though, TTS is only a fallback in case the speaker
doesn't have a microphone setup or would for some reason, prefer to
use text entry.  In the ideal scenario, speech would be encoded as
phonemes and then back again, thus preserving to some extent the
original intonation, if not their voice.

A big problem with relying on TTS instead of PTS is the dictionary
required.  That is, certain words just aren't sounded phonetically.
Each of these special cases need a specific entry, manually added.
If no entry is found, it tries it best and usually comes out
mangled.  Just for kicks, I'm using the L&H to read these emails.  A
lot of words aren't in the dictionary so I've had to write a wrapper
that does it.  So I guess just from that experience, I'm biased
against using TTS in a client :)

TLC

_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev