[MUD-Dev] [TECH] Voice in MO* - Phoneme Decomposition and Reconstruction

Thu May 16 03:45:48 CEST 2002

Ted L. Chen writes:
> John Buehler writes:
>> In response to Ted L. Chen:

> This draws a close analogy to the problems in hand-writing
> recognition.  Taking a cue from the PalmOS graffiti, perhaps users
> can be expected to utilize standardized tags.  "How are *YOU*
> today?"  Some people already type it like this for emphasis.
> Other encoding flags such as "-" or "..." could be used to denote
> pauses as well.  In essence, these tags need not be the same as
> the tags used internally by the TTS.  More likely, they would be
> meta-tags which encompass a way of speaking, rather than the
> mechanics (i.e.  volumn, pitch) of the phonemes.

That's the sort of thing I was thinking of.  Whatever common
conventions we use in email to denote emphasis, pauses, inflections,
etc, should be recognized by the 'speech transfer' system.  I
consider that to be the limit of what's necessary for text input.
The speed of text-based communication versus voice-based
communication is going to be very noticeable.  Text is a fallback
that I hope most players will not want to use.

> As a slight tangent, the text input can also expand oft-used
> acronyms such as rotfl or lol.

Yup.  Including any number of 'emotes' that people might put into
their normal written text.  Such as: <shrug>

> Note that the default TTS would tend to have some built-in
> heuristics that seem to be common.  For instance, the L&H TTS
> engine currently puts your standard raised inflection when it
> encounters a question mark at the end of an input stream.

Absolutely necessary.

>> The goal here is to have players both typing and speaking to the
>> program, with the information efficiently conveyed to those who
>> should receive it, to be output as written text or spoken word as
>> desired by the receiver.

> Ah, that's the rub.  At least with the phoneme method.  The
> difficulty that most STT encounter is in that final stretch where
> you determine what string of phonemes can constitute a word - or
> more precisely, which word.  What I'm prescribing is more like STP
> (speech to phoneme).  So, at least in the near future, where
> processing capability is still growing, we may need to restrict
> the output to speech only.  Like the old days of TV before close
> captioning became available.

Well, I'll take what I can get.  I actually don't care about text
output at all, but I include it in discussions as a transitional
element.  Current games use text.

> That is of course, if players would forgo that option in exchange
> for speech capability.

You know users.  What's new doesn't always get the thumbs up.

> As for the expressive quality of the standard TTS, it will sound
> rather bland or dead pan after a while because everyone is talking
> exactly the same way.  If everyone used text as the primary method
> of input, it might seem like we walked into a bad voice actor's
> convention.  That's why I made the comment about encoding more
> tags.  It's not required for basic communication on the order of
> what we currently have with text, but it does help in breaking up
> the monotony.  And hence the suggestion that be included in the
> speech->phoneme decomposition.

As before, I want speech in and speech out.  I keep including text
in the requirements for compatibility.

>> I believe that both original speech and manufactured speech are
>> needed.  Original speech transport is needed when players are
>> speaking to players (telephone).  Manufactured speech is needed
>> when characters are speaking to characters (acting).  I want both
>> in the same game so that I can have a clear separation of in-game
>> and out-of-game conversations available to players.  If I want to
>> talk about baseball, I can do it via my own voice.  If I want to
>> have my character discuss the balance of its weapon, I can do it
>> via my character's voice.  Note that my own voice can be sent to
>> any player in the game willing to receive it, while my
>> character's voice is limited to how far it carries in the game
>> environment.

>> I would be content with current primitive STT and TTS systems
>> such that I can speak and the characters can talk.  The
>> differentiation of which character is saying what can be worked
>> out via graphical cues and such.  I just want somebody to put the
>> thing in.

> Interesting.  How close to your own voice does it need to be for
> the player to player communication?  I fully understand that using
> a speech->phoneme->speech method isn't a full reproduction of your
> voice, so would it be enough that it has the same patterns and
> somewhat same tone as your voice?  It might be similar to trying
> to establish a conversation on a noisy telephone line - it says
> it's Bubba, and it sounds kinda like Bubba, but is it really
> Bubba?  You can tell at least that it's definitely not Buffy.

Ideally, it's perfect fidelity to the speaker's voice.  Eventually,
I want the capability for video phone conferencing.  Here, I'm
thinking of more than just games and start to consider commercial
collaborative actions in a virtual space.  But in the short-term, I
take what I can get.  I took the phoneme approach to be a sort of
compact textual equivalent with more information.  As bandwidth
increases, compressed audio would be used for the player-to-player
audio anyway.

I guess at this point, I have to ask myself: What is the phoneme
approach bringing to the table?

>> The issue of phonemes as the specific technology is not
>> significant to me, any more than whether the database being used
>> is relational or object-oriented, so long as it has the
>> operational characteristics that I'm after.

> Perhaps I'm too much of an engineer, but I see no value in giving
> treatment to only the initial conceptual stage of a design and
> assuming the rest as idealized black-boxes.

I assume the rest as idealized black boxes becuase I don't know the
technology.  I'm an engineer myself, but this is hardly my area of
expertise.  You know how marketing guys work.  They just assemble
their wish list.  It's up to us to educate them as to the real
possibilities.

> So in the case of phonemes, the limits it imposes is that it
> allows for decent generation of tag data for the speech engine,
> but at the cost of not being able to display text on the
> recipients' side.  That's an awful strong limit if your design has
> a hard requirement to give the recipient the choice between the
> output as either text or speech.  Same type of limits can be
> derived from RDBS and OODB foundations and do impact design
> downstream (and to a lesser extent, upstream).

Ah, my education begins.  It seems that the requirement of voice
output is far more tractable than a requirement of voice input.
Voice output can use a headset to ensure that the player is the only
one who hears the output.  Voice input is rather more difficult to
manage such that only the computer hears what the player is saying.
So text output is the least critical capability.

> Speaking of which, anyone have a good design structure matrix
> (DSM) for MMORPGs?  Or is this too nascent or wide a field for one
> to exist?

Heh.  DSMs typically don't even exist for mainstream applications,
let alone games.  Methodology is such a new concept for software
'engineers'.  While not exactly nascent, MMORPGs certainly have a
long ways to go and are evolving all the time.  That's a separate
debate that lots of folks weigh in on regularly.

JB

_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev