[MUD-Dev] Text-tospeech (Was Shift in time)

Mike Rozak Mike at mxac.com.au
Mon Oct 4 07:30:26 CEST 2004


Arnau Rosselló Castelló wrote:

> I'd like to ask, then, about phoneme recognition and synthesis. My
> idea is to translate a voice input from the into a string of
> phonemes(with pitch, speed and volume hopefully), but not
> translating it to words. ...

> Would such a thing be feasible?

Yes and no.

If you know WHAT words the user spoke, and have a recording of what
they said, I already have a tool (at www.mxac.com.au/m3d) that will
extract the phonemes, volume, timing, and pitch. It's called
"transplantedy prosody" by speech-type people. An example of a
verbose form of transplanted prosody is:

  <TransPros>
   <OrigText>This is a sample recording whose phonemes have been
   generated using speech recognition.</OrigText>
   <Word text="This" ph="dh ih1 s" TPPitchRel=0.92,1.16,TPDurAbs=0.04,0.17,0.17/>
   <Break time=139ms/>
   <Word text="is" ph="ih1 z" TPPitchRel=0.97,0.91 TPDurAbs=0.08,0.05/>
   <Word text="a" ph="ah0" TPPitchRel=0.92 TPDurAbs=0.11/>
   <Word text="sample" ph="s ae1 m p ah0 l" TPPitchRel=,1.07,1.24,1.33,1.35,1.28 TPDurAbs=0.15,0.09,0.03,0.10,0.03,0.05/>
   <Word text="recording" ph="r iy0 k ao1 r d iy0 ng" TPPitchRel=1.22,1.34,,0.90,0.86,1.03,0.84,1.14 TPDurAbs=0.03,0.11,0.14,0.07,0.05,0.04,0.09,0.10/>
   <Break time=279ms/>
   <Word text="whose" ph="hh uw1 z" TPPitchRel=,0.88, TPDurAbs=0.05,0.10,0.11/>
   <Break time=59ms/>
   <Word text="phonemes" ph="f ow1 n iy1 m z" TPPitchRel=,1.15,0.97,0.86,0.90, TPDurAbs=0.04,0.12,0.05,0.07,0.19,0.07/>
   <Word text="have" ph="hh ae1 v" TPPitchRel=,0.90,0.84 TPDurAbs=0.07,0.11,0.05/>
   <Break time=139ms/>
   <Word text="been" ph="b iy0 n" TPPitchRel=1.02,0.91,0.90 TPDurAbs=0.02,0.05,0.11/>
   <Word text="generated" ph="jh eh1 n er0 ey1 t ih0 d" TPPitchRel=1.11,0.97,1.06,0.91,0.83,,0.92, TPDurAbs=0.06,0.08,0.05,0.11,0.08,0.06,0.05,0.04/>
   <Break time=59ms/>
   <Word text="using" ph="y uw1 z iy0 ng" TPPitchRel=0.86,0.89,0.88,0.97,1.00 TPDurAbs=0.10,0.06,0.07,0.05,0.08/>
   <Word text="speech" ph="s p iy1 ch" TPPitchRel=,,1.08, TPDurAbs=0.09,0.08,0.09,0.10/>
   <Break time=59ms/>
   <Word text="recognition" ph="r eh1 k ih0 g n ih1 sh ih0 n" TPPitchRel=0.95,0.87,,0.83,,0.85,0.85,,0.73, TPDurAbs=0.05,0.06,0.14,0.03,0.03,0.07,0.05,0.11,0.03,0.06/>
   <Punct text="." DurAbs=0.01/>
  </TransPros>

(This example uses relative pitch so I can translate the prosody
onto a female voice, if I wish. I can also produce absolute pitch,
which can be good for singing, although if you really want singing
there are specialized TTS engines just for that. This example
doesn't include volume information because it doesn't make that much
difference to the quality of the transplanted prosody.)

> How much bandwidth would the phoneme stream need(well i could time
> myself saying something and then count... but maybe someone has
> more elaborate aproximations)?

Wrap it up into a binary format and it will only be a few times
larger than text.

> You could also swap phonemes in the stream to create accents
> and/or unintelligible languages. These are off the top of my head,
> I'm sure there are many more interesting things that can be done
> with it.

I have code to do the accents, as you mentioned. (My web site gives
an example. I can also change phoneme duration and pitch contours to
create different voices.) The other technique for generating new
languages you talked about is written into the VW I'm working on.

Back to the chainging one's voice issue:

  The problem is getting a transcription of the speech. To do that,
  you need speech recognition to basically do dictation. Current
  dictation systems are only 96%-98% accurate... assuming the right
  phase of the moon and whatever other finegaling marketing people
  do. The real numbers are usually 2x the error rate that marketing
  claims (92%-94% accurate). That means 1 in 10-20 words are wrong,
  often in confusing ways that make for humourous anecdotes. I was
  testing the dictation system at Microsoft and dictated, "I gave a
  penut to the squirrel." It came up with "I gave a penis to the
  squirrel." After that we made sure to get rid of potentially
  insulting/offensive words so they didn't accentally get dictated.

There are ways to improve the accuracy:

  1) Have the VW log EVERYTHING every typed in chat, and use it to
  create a "langauge model"... which is speech-guy terminology for
  predicting what words come next. A good language model is produced
  from 1 billion+ words, creating a database called a tri-gram. A
  tri-gram just remembers, "If I heard word X followed by word Y,
  what is the chance of word Z being spoken?". Thus, in a dictation
  system, if you speak "new york" it'll be expecting you to say
  "city". If you say "pity" instead, it will think it heard
  incorrectly and write out "city" anyway. All speech recognition
  engine writers have internal tools for generating language models;
  I don't know how many of them allow 3rd parties to use them.

  2) Dynamically update the language model based on the
  context... if there are orcs nearby, the user is more likely to be
  speaking about "orcs" than "forks". This requires some coding on
  the part of the VW so it knows what words will be in context. I'm
  not sure how many SR engine providers support the dynamic language
  model.

> How could such a system deal with ambient noise, or tapping the
> mic?

The player has to wear a headset microphone. They have to talk to it
in a calm and measured voice (SR accuracy goes down when you scream
at it). Taping on the microphone can easily be ignored.

> Is there any king of public project(i know about festival only),
> creating these kind of programs? papers maybe?

Festival does TTS. I have my own TTS, up on my web site. AT&T has
the best TTS I've heard recently, but the best apparently takes
several hundred MB of RAM.

There are public SR engines. I think you can download the Sphinx
recognizer. Microsoft has a recognizer. IBM has one. Dragon has
another. Etc.

Papers - I think I saw some books on how to create your own TTS
engine. You're probably better off with a book than an academic
paper since the papers are usually based on prior knowlege and are
more esoteric. There was one book by Dennis Klatt ("From text to
speech" MIT press???) that I read about 10 years ago. TTS is no
longer synthesized that way, but it's a start in case you can't find
anything more recent.

The other approach is to use algorithms to change the voice without
knowing what was spoken. These can change from male to female, but
won't get rid of a player's Bronx accent. And by the way, there are
better algorithms than what you've probably heard, but they take
more CPU than most games want to surrender. My software will do
algorithmic voice mucking too (a non-technical term), although its
ability to go from my male voice into female is questionable. Male
to male is easy. Male to midget/giant (what you've probably heard)
is trivial. (My web site, http://www.mxac.com.au/m3d/tts.htm, has a
sample of my "female" TTS voice, derived from my voice model. My
wave editor, included in the 3d app, has tools for doing this.) If
you search on the web you'll find research for gender changing.

Mike Rozak
http://www.mxac.com.au
_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev



More information about the mud-dev-archive mailing list