Darn good question. Your approach appears to have the correct
results, but I'm not sure it's practical for many implementations
(global search-and-replace operations are inconvenient for
sequential processing models), and it certainly isn't a healthy
way to think about SGML documents.
The way to think about SGML documents, IMHO, is this: the sequence
of characters comprising an SGML document are presented to an
SGML parser, which parses the markup from the data and passes
the "results" to the processing application.
[Much of this is covered in
http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/921203/Text.html
and what isn't there should be, but bear with me...]
So many of the details of the syntax of SGML are invisible to
the processing application: the fact that <>'s delimit tags
in stead of {}'s for example.
The information delivered to the application by the parser
is called the Element Structure Information Set. It contains
things like tags, attributes, attribute values, and data
characters.
So there are two questions, to my mind:
1. How does the SGML parser treat newlines?
2. How does the WWW processing application treat newlines?
Question 1 is answered by the SGML standard. Question 2
is for us to decide.
SGML defines several types of content, which determine
the kinds of markup that are recognized inside an element.
The simplest is EMPTY, for example:
<!ELEMENT P - O EMPTY>
When you see a P start tag, you know there is no content,
and you assume that a P tag follows, effectively.
The next simplest is CDATA, for example:
<!ELEMENT TITLE - - CDATA>
When you parse the content of a TITLE element, the only
thing you look for is an end tag. Everything else is
reported by the SGML parser as data characters.
Then there is RCDATA, which is just like CDATA, except
for character and entity references.
The most common content type is MIXED, where all kinds
of markup are recognized: tags, entities, as well as data.
For example:
<!ELEMENT ADDRESS - - (#PCDATA|A|P)>
The parser should report start tags, end tags, entities,
and data inside an ADDRESS element.
Then there is ELEMENT content, where only tags are
recognized, for example:
<!ELEMENT HEAD - - (TITLE? & ISINDEX? & NEXTID?)>
The parser will report any data inside a HEAD element
as an error. But [and this is the reason I went through
this whole excercise] whitespace is ignored between
tags in element content. So the text:
<head>
<title>sample</title>
</head>
will be reported to the application by the parser as
a HEAD start tag,
a TITLE start tag,
the data string "sample",
a TITLE close tag,
and a HEAD close tag,
whereas the text:
<address>
<a HREF="#tim">Tim Berners-Lee</a>
</address>
will be reported as
an ADDRESS start tag,
the data string "\n"
an A start tag (with an HREF attribute and value),
the data string "Tim Berners-Lee",
the data string "\n"
an A close tag,
and an ADDRESS close tag.
[There's another content type called ANY, but it's just
like MIXED for our purposes.]
>I spent quite a bit of time thinking about what is intuitively the right
>way to do it, and I came up with this method.
>
>0. Convert all new-lines inside of tags to spaces.
Newlines inside tags are the responsibility of the
SGML parser. I suggest you use the excellent sgmls
parser to test your rules by trial and error, or
consult the standard.
I have done both, and the results of my labors
are available in libHTML. I have also done an elisp
implementation, if anybody's interested.
The one tricky case is newlines inside attribute
value literals, e.g.
<foo bar="12
3">
This one is a little tricky. SGML section 7.9.3 says:
"An attribute value literal is interpreted as an attribute value by
replacing references within it, ignoring Ee and RS, and replacing RE
or SEPCHAR with SPACE."
The reference concrete syntax assigns the conventional unix newline
character, ASCII code 10, to the role of RS. So strictly speaking,
it should be ignored, and the value of the attribute is "123".
On the other hand, the sgmls parser does a little behind-the-scenes
magic on newlines. From the sgmls man page:
An external entity resides in one or more files. The entity
manager component of sgmls maps a sequence of files into an
entity in three sequential stages:
1. each carriage return character is turned into a non-
SGML character;
2. each newline character is turned into a record end
character, and at the same time a record start charac-
ter is inserted at the beginning of each line;
3. the files are concatenated.
[This sort of thing _does_ still conform to the SGML standard.
You're allowed to do magic while assembling entities]
So using sgmls, the newline in this case would be treated as
RE, and converted to SPACE, i.e. ASCII character 32, by the
parser. So the value of the bar attribute is "12 3".
It's a question of how we construct SGML entities from HTML
data streams.
>1. For each tag NOT in
> <PRE> </PRE> <A> </A> <PLAINTEXT>
> remove ALL surrounding new-lines.
First, let's get one thing straight: the PLAINTEXT element as
described by the original HTML documentation is not representable
in SGML. For my purposes, I consider the HTML document to
end at the <PLAINTEXT> tag, and I consider the rest of the
data stream to be an RFC-822 message body or a MIME text/plain body,
and not SGML at all.
Next, let's keep in mind that you can't do things like the following
global substitition,
s/\n+(<(H1|H2|ADDRESS...))>/$2/g;
because it might find things that look like tags but aren't,
for example
<foo bar="
<H1>this is a little cooky, but nontheless legal and possible.">
But even if you're using a proper SGML parser, consider:
<H1>Here we go!
<a href="#xyz">click here</a>
There we went!
</H1>
The parser will return an H1 start tag, and then the
string "Here we go!\n". At this point, your rule doesn't
tell me what to do with the newline. I have to get
the next object before I decide.
Hmm... I guess that's reasonable. But I'd rather just pass all the
data charcters on the the text formatter and let it figure all this out.
Do we want to specify rules for the text formatter? If so, we need to
go beyond just newlines. I see some data providers writing things
like:
<H1>Here are some things to consider:</H1>
<p> thing one
<p> thing two
<p> thing three
The MidasWWW browser displays this as
Here are some things to consider:
thing one
thing two
thing three
which I think is reasonable. The provider should either use
<H1>Here are some things to consider</H1>
<UL>
<li>thing one
<li>thing two
<li>thing three
</ul>
or at a minimum,
<H1>Here are some things to consider</H1>
<PRE>
thing one
thing two
thing three
</PRE>
>2. For each tag in
> <PRE> <PLAINTEXT>
> remove ALL new-lines to left, and one new-line to the right.
Why remove one new-line to the right? Just for HTML source file aesthetics?
>If XMP and LISTING sections are being used, they would be treated the
>same as PRE.
>
>Note that this converts new-lines around anchors into spaces UNLESS they
>appear immediately at the beginning or end of some other element.
There are also some new elements that act like A: EM, CODE, SAMP, etc.
>If browsers use this method, it would allow html-generators to put in
>new-lines all over the place for readability of HTML, without introducing
>lots of annoying extra spaces in the output. This is what seems like
>the most useful thing to do, although I'm not sure it is "correct".
>
>So is it correct? And are there any obvious flaws?
We have not specified the rules for typesetting elements other
than XMP, LISTING, and PRE before now, so what you suggest is
as correct as anything else.
I think it's important that we agree on how to typeset the <PRE>
element. [And I think getting rid of the first newline after a <PRE>
tag is a Bad Thing.]
It's not important to me that we agree how to typeset other elements.
I'm inclined to give formatters great leeway with how they treat whitespace.
I wouldn't mind at all if something like
<H1>testing 123</H1>
foo bar blech icky<a>wicky</a>
woo
<p> abc defghi
jhkl sldjkf sld lsdjkf
were typeset as:
TESTING 123
foo bar blech ickywicky woo
abc defghi jhkl sldjkf sld lsdjkf
My point is: don't use whitespace to represent significant
information except in the PRE elemnt. Use the tags that
are defined to have significance.
Dan