Re: dealing with new-lines

Thomas A. Fine (fine@cis.ohio-state.edu)
Fri, 8 Jan 93 15:38:20 -0500

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Thomas A. Fine: "Re: dealing with new-lines"
Previous message: Michael Leventhal: "SGML newline processing"
Maybe in reply to: Thomas A. Fine: "dealing with new-lines"
Next in thread: Thomas A. Fine: "Re: dealing with new-lines"

>Darn good question. Your approach appears to have the correct
>results, but I'm not sure it's practical for many implementations
>(global search-and-replace operations are inconvenient for
>sequential processing models), and it certainly isn't a healthy
>way to think about SGML documents.

But most browsers seem to have cacheing anyway, which means they can do
global search/replace. But you can still do it more or less sequentially.
Just buffer strings of new-lines until you know what follows them, and
then deal with it. There's no method you can propose which is correct
and doesn't involve storing something somewhere.

>The way to think about SGML documents, IMHO, is this: the sequence
>of characters comprising an SGML document are presented to an
>SGML parser, which parses the markup from the data and passes
>the "results" to the processing application.

This is another alternative I considered. But I figured that I have to
deal with various parsing things when I read the HTML anyway. I was
just going to take each chunk of data, (with anchors pre-processed out)
and remove all whitespace at the beginning and end (except for PRE sections
and such). But if someone put in whitespace, why should I muck with it?
Who knows, they might have even wanted it there.

>>1. For each tag NOT in
>> <PRE> </PRE> <A> </A> <PLAINTEXT>
>> remove ALL surrounding new-lines.
>
>First, let's get one thing straight: the PLAINTEXT element as
>described by the original HTML documentation is not representable
>in SGML. For my purposes, I consider the HTML document to
>end at the <PLAINTEXT> tag, and I consider the rest of the
>data stream to be an RFC-822 message body or a MIME text/plain body,
>and not SGML at all.

I hadn't meant otherwise. But you have to read it in anyway, and since
my method deals with things prior to any other parsing, you treat it
all as one clump.

>Next, let's keep in mind that you can't do things like the following
>global substitition,
>s/\n+(<(H1|H2|ADDRESS...))>/$2/g;
>because it might find things that look like tags but aren't,
>for example
>
><foo bar="
><H1>this is a little cooky, but nontheless legal and possible.">
>
>But even if you're using a proper SGML parser, consider:
>
><H1>Here we go!
><a href="#xyz">click here</a>
>There we went!
></H1>
>
>The parser will return an H1 start tag, and then the
>string "Here we go!\n". At this point, your rule doesn't
>tell me what to do with the newline. I have to get
>the next object before I decide.

Like I said before, You have to do some sort of storage at some point
anyway.

>Hmm... I guess that's reasonable. But I'd rather just pass all the

Like I said before, You have to do some sort of storage at some point
anyway.

>My point is: don't use whitespace to represent significant
>information except in the PRE elemnt. Use the tags that
>are defined to have significance.

I suppose I agree with this more or less, at least from the point of view
of generating my own code. But we have to make something clear - can
a browser keep all the whitespace if it wants to? Or in other words,
can an html generator assume collapsing whitespace, or just be aware
that it might happen?

tom

Next message: Thomas A. Fine: "Re: dealing with new-lines"
Previous message: Michael Leventhal: "SGML newline processing"
Maybe in reply to: Thomas A. Fine: "dealing with new-lines"
Next in thread: Thomas A. Fine: "Re: dealing with new-lines"