multimodal style

David Seibert (
Mon, 4 Mar 1996 15:06:13 -0500 (EST)

This is a proposal for a multimodal styling language, as opposed to a
linear combination of visual/audio/whatever styling languages. I suggest
a few natural multimodal attributes, and discuss why and how to encourage
their use. I also suggest allowing attribute values to be in a fixed
range of numbers, which both simplifies the styling language and
minimizes the dependence on English.

I would be very happy to hear comments from any interested reader. This
proposal is publicly available at I am also
temporarily storing the audio style sheet proposal of T.V. Raman at, so that this
manuscript is also publicly available.


David Seibert

Multimodal document styling

Encouraging the production of stylish and accessible documents

  1. Introduction
  2. Design goals
  3. Unimodal attributes
  4. Multimodal attributes
  5. Standardization and language independence
  6. Independent specification of attributes
  7. Encouraging authors to produce multimodal documents
  8. Precise specification of multimodal attributes
  9. Summary


HTML (HyperText Markup Language), the standard markup language of the World Wide Web, is an SGML (Standard Generalized Markup Language) document type. Rules have been specified to transform HTML tags to a set of "canonical" elements in accordance with SDA (SGML Document Access) standards, so that HTML documents can easily be presented using Braille, large print, audio, or any other type of display. The reason that the ICADD (International Committee on Accessible Document Design) recommends the definition of mappings to a standard tag set by DTD (Document Type Definition) authors is that this practice will "minimize the burden on writers and editors of understanding the requirements of markup for Braille, large print and voice synthesized delivery" [Y. Rubinsky, "Description of the ICADD Mechanism"].

The use of HTML styling languages, such as DSSSL (Document Style Semantics and Specification Language) Online and CSS (Cascading Style Sheets), has been suggested as a simple way for Web publishers to control the presentation of HTML documents. The hope is that this mechanism will encourage publishers and software vendors to use HTML rather than creating their own DTDs using SGML, or inventing new HTML markup tags. Some advantages of continued HTML use are:

However, documents created using customized styles will not be presented well in all display modes, unless the style designer spends time creating the proper specification for each mode. This extra work would again make it less likely that web authors would create documents and styles that can be presented well in all formats.

Current styling language proposals concentrate on giving authors control of the visual presentation of text and images, for the most part ignoring the possibility of alternative formats. HTML style sheets for audio presentation have been proposed by T.V. Raman and by the TEO group at the Katholique University of Leuven, but in these schemes the audio controls are totally independent of visual controls. Little attention has been paid to the fact that, in most cases, authors use markup to express an idea, so that audio and visual presentations (along with presentations in any other modes) are related because they represent the same idea, just as visual presentations of markup in different languages are related by the semantics of the tags.

In this document, I propose to create a multimodal HTML styling language, in which visual, audio, and other style descriptions are integrated as much as is practical. The purpose of this unification is to make it as easy as possible to produce better web documents for people with disabilities, by reducing the work for the author or style designer to enrich a document or style for all display modes. I suggest design goals for such a styling language and discuss means to implement those goals. I give concrete examples for five multimodal presentation attributes that can be derived from visual and audio attributes. Finally, I discuss how to combine multimodal and unimodal attributes to create a styling language that not only allows but encourages authors to produce multimodal documents and styles.

Design goals

A well-designed multimodal styling language should
  1. contain a fairly complete set of attributes, so that authors can specify a wide range of properties for visual, audio, and other displays.
  2. contain multimodal attributes that allow authors to simultaneously specify properties for multiple presentation modes.
  3. standardize attribute values as much as possible, to make it easier for casual authors to use the styling language.
  4. reduce language dependence by minimizing the use of English.
  5. allow authors to specify visual and audio properties independently if they wish to do so, for maximal control over presentation.
  6. encourage authors to use multimodal attributes (that control presentation in more than one mode), which will produce documents that can be presented well in any mode, rather than unimodal attributes (that control presentation in a single mode).
  7. allow precise physical specification of presentation properties when feasible.

Unimodal attributes

The first design goal, providing a wide range of attributes, can be met by simply combining current proposals for visual and audio style attributes. In this section, I give the names of some proposed attributes and their natural language and numerical values (without actual or implied units). The definitions are usually fairly obvious; when they are not, readers should refer to the proposals for visual and audio style sheets. I do not discuss physical values for these attributes, as these cannot be translated as simply to values of multimodal attributes as can the less precise (but more intuitive) natural language or numerical values.


I list here all attributes described by Raman, with the exception of speech-other, which is suggested for experimental purposes, and spatial-audio, which is suggested for possible use in the future. The attributes proposed by the TEO group are a subset of those proposed by Raman, so they are also included below. The emphasized attributes are those that can be naturally combined with visual attributes. For simplicity, I use the CSS syntax, although the proposal could be written using the notation of either CSS or DSSSL Online.

Attribute name: Natural language and numerical values
volume: soft | medium | loud | [0-10]
[left | right]-volume: <none>
voice-family: <string> (name)
speech-rate: slow | medium | fast | [1-10]
average-pitch: [1-10]
pitch-range: [0-200]
stress: [0-100]
richness: [0-100]
pause-[before | after | around]: <none>
pronunciation-mode: <string>
language: <string>
country: <string>
dialect: <string> (name)
[before | after | during]-sound: <uri>


I will not list the full range of visual attributes that can be controlled by proposed HTML style sheets. Instead, I give only the attributes that are naturally linked with audio attributes. I use the nomenclature of CSS, although these attributes can be equally well expressed using the terminology of DSSSL Online.

Attribute name: Natural language and numerical values
font-size: xx-small | x-small | small | medium | large | x-large | xx-large
font-style: normal | italic | oblique | small-caps | [ italic | oblique ] small-caps
font-weight: extra-light | light | demi-light | medium | demi-bold | bold | extra-bold
padding: auto
background: transparent | <uri>

Multimodal attributes

Here I give an example of the solution to the second design goal by proposing a set of multimodal attributes designed for simultaneous control of visual and audio displays. In a number of cases, visual and audio attributes given above can be expressed by a common meaning. In these cases, the visual and audio attributes can be combined in a natural manner to produce multimodal attributes. I propose the multimodal style attributes given in the following table as they are defined below.

Multimodal attribute: Audio name, Visual name
size: volume, font-size
range: pitch-range, font-style
weight: stress, font-weight
separation: pause-[before | after | around], padding
background: [before | after | during]-sound, background

Definitions, proposed values, and mnemonics

size: 1 | 2 | 3 | 4 | 5 | 6 | 7
The relationship here is obvious - larger text, louder speech, and higher numbers will usually be associated. If they are not, authors should use a suitable combination of unimodal style attributes, but if they are, authors will minimize their work by using the multimodal forms. Possible mnemonics for the values (from musical notation): pianissimo | piano | mezzopiano | mezzo | mezzoforte | forte | fortissimo.
range: 1 | 2 | 3 | 4 | 5 | 6 | 7
Here again the relation is fairly obvious if you consider how printed words are normally spoken (e.g., "It's not really important ..."). The mapping is a bit trickier, mainly because voices are so much more expressive in this regard than print. Probably values 1-4 would map to normal type, and 5-7 would map to italics or oblique type. Possible mnemonics (could use work): dead | dull | boring | normal | happy | excited | wild.
weight: 1 | 2 | 3 | 4 | 5 | 6 | 7
Stress and font-weight are again fairly naturally related, and the mapping from numbers to current natural language values is obvious. Possible mnemonics (more or less from boxing): feather | light | midlight | middle | midheavy | heavy | superheavy.
space: 1 | 2 | 3 | 4 | 5 | 6 | 7 {above/right/below/left specified following CSS}
Here the attribute values should tied to the visual presentation, which is richer because printed spaces are two-dimensional while audio spaces can only be one-dimensional. Space should be tied to the visual attribute of padding or margin; I picked padding, but I think that either could be chosen. Possible mnemonics: none (a bit counter-intuitive at 1) | narrower | narrow | normal | wide | wider | widest.
background: <uri>
Here you just save a little time, but again the meanings match so it makes sense to allow authors to simultaneously specify audio and visual backgrounds. The allowed values are the same, so the presentation software must interpret the URI, but that is trivial - visual backgrounds go with visual presentations, audio with audio, and so on. Maybe style sheets should also provide visual before- and after-cues, to go along with the audio cues that Raman suggests? Once one is allowed, the other would follow naturally in the same way that background can be used naturally for both audio and visual presentation without the need for any extra notation.

Other values

Other values can (and generally should) be allowed for most of these attributes. Physical values obviously should be allowed, to give authors detailed control over document formats; however, the allowed values were selected to be as useful and intuitive as possible, to encourage casual authors to use them rather than physical values. They should be granular enough to give good control, but not so granular as to be confusing. The mnemonics could also be allowed values, although I am not sure that I would recommend this in general.

Standardization and language independence

I simultaneously address the third and fourth design goals, standardization of allowed values and language independence, by allowing numbers, e.g. [1-7], to be used to represent the values of multimodal attributes for which this procedure seems to be intuitively reasonable, in their natural order (smallest=1, normal=4, largest=7). This is similar to the practices proposed by Raman and the TEO group; only the significant difference is the proposal to use the same range of numbers for all attributes. I allow 7 values because that seems to provide a reasonable amount of granularity for most attributes. Using the same range for most attributes makes it simpler for authors to remember the allowed values and their meanings (e.g., 4 is always the default), so I suggest that the range remain the same across attributes if possible, even if a different global range is preferred.

Using numbers gives relative language independence because numerical notation is more widespread than any language. If numbers are allowed, authors can learn the definitions of the numbers in whatever language they prefer. There will still be some difficulty for authors who are not familiar with Arabic numbers, but this could be dealt with simply by allowing non-Arabic numbers with the same meanings as well, because numbers are well defined so translation is trivial.

Because it simultaneously solves two design problems, the practice of using a standard numerical range as allowed values would also be an advantageous practice for general styling language design. An additional minor benefit is also obtained: because each integer is represented by a single character, the amount of typing needed to create style descriptions is reduced.

Independent specification of attributes

The fifth design goal, allowing independent specifications for audio and visual presentations, is also easily met. Under CSS, the use of multimodal attributes would not preclude the specification of refinements to any single mode of presentation. Rather, authors should first specify the document style as accurately as possible using multimodal attributes, and then add further refinements through modifications of unimodal attributes, so that the document is presented well to all users.

Encouraging authors to produce multimodal documents

My sixth design goal is to encourage authors to use the multimodal attributes provided by the styling language. Establishing and standardizing multimodal attributes is necessary to enable authors to easily produce rich web documents for multimodal display, but it is not sufficient to ensure that authors regularly produce rich documents and styles suitable for multimodal presentation. Additional steps should be taken to encourage authors to create customized multimodal style descriptions in place of unimodal descriptions. For example, in a well-designed styling language, authors should be pushed to use multimodal attributes as the first step of designing styles, in preference to unimodal attributes.

There will be some resistance to using multimodal attributes, as many authors, regardless of the level of experience, are accustomed to working primarily with unimodal (usually visual) attributes. To counteract this resistance, styling language designers should not give an overcomplete set of attributes, i.e., all unimodal and multimodal attributes. Instead, for each group of unimodal attributes that combines to form a multimodal attribute, the multimodal attribute should replace the richest of the unimodal attributes (so that authors are likely to need less refinement of unimodal attributes). For the attributes proposed above, this scheme could be implemented as follows.

could probably replace either attribute reasonably well. It should probably replace the visual attribute, font-size, as authors are probably more likely to prescribe visual style than audio style. In this case, the name would be a bit less intuitive than before, but this could be an advantage as it would serve to remind the author that the attribute is multimodal rather than unimodal.
would replace the audio attribute, pitch-range, which has more granularity and therefore carries more information than the visual attribute, font-style. The association with an audio property might discourage visual authors from using this attribute, however, especially because the two attributes overlap in meaning but are not equivalent, so I suggest a replacement of font-style as discussed below.
would replace the visual attribute, font-weight, following the case for size.
would again replace the visual attribute, padding, as in the cases above (although padding may be a better name for this attribute).
would replace both attributes. The use of multiple URIs of different types should be allowed, as presentation software can tell easily which to use from the context (e.g., use a visual background for visually displayed text, not an audio background).

In addition to the new multimodal attributes, two changes to the CSS scheme would be needed. I propose that small-caps be added to the set of allowed values for the CSS attribute text-transform, and a new element emph-style be added, as given below. The old attribute, font-style, would then be expressed through combinations of range, text-transform, and emph-style.

emph-style: italic | oblique
controls whether high-range text is presented in italic or oblique type.

Precise specification of multimodal attributes

One difficulty with the scheme proposed here is that many authors will want to specify physical quantities, such as the font-size, very precisely. To solve this problem requires simply a standard mapping from the interval [1-7] to the range of reasonable physical values for each attribute. Then, if the author specifies a physical value (with units) for a multimodal attribute, the units will enable the presentation software to determine to which mode the value applies, and the mappings can be used to calculate the appropriate values for the other modes connected with the attribute.

Alternatively, if such a mapping exists, the value could be more precisely specified by allowing any number in the range [1-7], and not just integers. This is probably preferable, as it decreases the device-dependence of the style, and so should probably be allowed. However, physical values (with units) should be allowed in any case, as a large body of authors are accustomed to using those values.

I have not attempted to produce the required mappings here. That is left for the present as a research problem, as it will be best solved experimentally by having a wide range of subjects evaluate a large number of displays with a variety of mappings.


Although I have used HTML style sheet proposals as examples here, the design goals and the methods used to achieve them would apply to multimodal styling languages for use with any DTD. The third and fourth, standardization of allowed attribute values and minimal language dependence, are also useful goals for unimodal styling language designers (as are the first and seventh, but they are so intuitive that they are almost universally followed).

This is not a complete proposal for a multimodal HTML styling language. I have, however, proposed the creation of five multimodal style attributes, and the elimination or modification of five related visual or audio attributes in order to encourage authors to use the multimodal attributes. As proposed, the creation of one new visual attribute and the slight modification of another would also be necessary to recover the full proposed functionality of CSS.

The changes proposed to current HTML styling languages are small but important. Without these changes, it will be significantly harder for web authors to produce rich documents that can be presented well in all modes, and therefore most documents will be designed for unimodal (probably visual) display. These changes could be implemented in the future, but this would result in a significant diminution of their full power, as once visual authors become accustomed to visually oriented styling languages there will be more resistance to the multimodal forms. In addition, there may be some problems with backward compatibility of attributes, as in the case of font-style, which may be easily eliminated if designers plan now for eventual conversion to multimodal styling languages. Thus, I suggest that the proposed changes be made early, before visual styling technology has a chance to become widespread and the current technology is locked in.

Last modified 1 March 1996 by David Seibert.