T.E.O.'s Draft--Cascading Speech Style Sheets (txt)

JuanJo Miguez (JuanJo.Miguez@esat.kuleuven.ac.be)
Wed, 28 Feb 1996 11:12:47 +0100 (MET)

T.E.O.'s Draft--Cascading Speech Style Sheets
K.U. Leuven

Ing. to be Juan Jose Miguez Iglesias mailto:Juanjo.Miguez@KULeuven.ac.be
ir. Filip Evenepoel mailto:Filip.Evenepoel@KULeuven.ac.be
ir. Bart BAwens mailto:Bart.Bauwens@KULeuven.ac.be
Prof.dr.ir Jan Engelen mailto:Jan.Engelen@KULeuven.ac.be
Prof.ing Antonio S. Pena from the E.T.S.I.Telecomunication of Vigo (Spain)


The T.E.O. group at the Katholique University of Leuven in Belgium
believe that the best way to include Speech within the CSS is to make it
simple and general, so that it's easy to use. We agree with the Raman T.V.
Initial Draft:


that is very interesting to include Speech in the CSS but we don't want
to make it very complicated. Many people doesn't even know decibels, most
actual speech synthesizers are mono and it's easier to give values to
some features with numbers (in a more theoretical way, then this values
will be mapped to the real values for each synthesizer). You can see this
page with your browser in HTML in the URL:


We have defined the set of properties for Cascading Speech Style Sheets
like in the CSS1 Working draft:

Value: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Initial: 0
Applies to: All elements
Example: volume: 5

The reason why the default value is 0 is because normally there
will not be sound, but in the case that other value is specified
the speech syntetizer will start working. There are many sets of
values in the volume range (and all the other set of properties)
depending on which speech synthesizer you use, so theese theoretical
values will be mapped into the real values used by the synthesizer.

We think this way is easier than Raman's one, where the user
should know to make his own style sheet how what decibels are. In
fact really few people know about this (engineers, Physics and so on).
To make it easy we let people decide between a set of ten values
that will be mapped by expert people to the real values in the

Value: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8| 9 | 10 |
Initial: UA specific
Applies to: All elements
Example: speed: 6

Some users (specially between blind people) prefers very high
speed speech because they have a very good hearing so they could
go very fast reading web pages. That is the reason why we chose this
big range. Of course "speed: 0" is not allowed because you could
not hear anything.

Value: | child1 | child2 | male1 | male2 | female1 | female2 |
Initial: UA specific
Applies to: All elements
Example: voice-type: female1

This is the way to set the phisical features of the articulating
voice. For example the voice of a boy, a woman, a man, sounds
different, and that is the reason.

Value: | 1 | 2 | 3 | 4 | 5 | 6 |
Initial: UA specific
Applies to: All elements
Example: pitch: 4

This is a small range for the medium frequency (F0). The same
person (the same voice type) can talk (in media) more grave or
less, which gives the appearance to be a different voice. If we
try to combine "Pitch" and "voice-type" for example:

if voice-type=child1,F0=1 (low voice)--> real medium frequency:150Hz
if voice-type=child1,F0=6 (high voice)-> real medium frequency:350Hz
if voice-type=male2, F0=1 (low voice)--> real medium frequency: 50Hz
if voice-type=male2, F0=6 (high voice)-> real medium frequency:150Hz

All this voices sounds different. We have a big range of different
voices because F0 (Pitch frequency) is mapped to different values
of real frequency depending on the voice-type. That's why 6
possible values of pitch are enough to make a simple definition with
36 different voices.

When a user wants to write his personal CSSS, he can try any of the
available values, and it will work because they will be mapped to real
and typical values. With Raman's specification someone could try with
an average-pitch of 5 Hertzs, but it will sound bad. We prefer to let
people choose a relative number than an exact and perhaps wrong number
of average pitch.

Value: | on | off |
Initial: on
Applies to: All elements
Example: prosidy: off

With prosidy activated the synthesizer gives the entonation (the
evolution of F0 along the time) which will sound hard, soft, angry
questionable..... If you have "prosidy:off" the result will be
like the voice of a robot (blind people prefer this kind of voice
and also hearing very fast voice)

Value: defined in the ISO 639 (Codes for the representation of
the names of languages)
Initial: en
Applies to: All elements
Example: language: fr

You can specify any language because the way to pronounce the same
message is different between countries (e.g. fr,nl,es,en....).
For example the Apollo II (multilingual speech syntesizer)
supports 7 languages (russian, english, french, spanish...). The
default value is english because it's the most used language in
the web, and although many languages are not supported nor
perhaps will be in the future, it's better to include all than a
little part of them.

We try to make understandable speech, but we think that it's
difficult to make a speech synthesizer speaking in all the dialects
of all the world's countries, as Raman suggests in his draft. It
could be possible, but not many people could afford it. We are just
thinking to make easy for the final user and with the devices that
are now mostly used, so that this could be working soon because there
are many people that needs it very much as soon as possible (blind or
impaired people)

This is a DRAFT, we have discussed about it, and now is your turn to say if
you like as it is, or you would like to talk about some features. I hope
you will tell us what you think about it. Thank you!

Kath. Universiteit Leuven--Dept.Electrotechniek (ESAT), T.E.O.