Re: Accurate user-based log file analysis

Brian Behlendorf (brian@organic.com)
Mon, 17 Jul 1995 21:22:58 -0700 (PDT)


On Mon, 17 Jul 1995, Terry Myerson wrote:
> You are speaking to extremes. We have log files from over 100 organizations
> in our test suite. The data has been scrutinized, and indeed both accurate
> enough and extremely valuable.

I have no doubt the data's valuable, but the accuracy is what I'm
questioning. Could you elaborate in what way the accuracy was tested?

> We are indeed talking about user sessions, and not users. My usage of the term
> users was indeed a marketing decision, I apologize. But user sessions are still
> a much better statistic to base business decisions upon than hits or unique
> hostnames.

Agreed. But PLEASE let's get the terminology right - when marketers talk
about numbers, they are *not* talking about one-time sessions. They want
to know if Bob came back 20 times in 20 days or only once, which just
looking at sessions can't tell you.

> >Could you elaborate on these DC's? What can you key off of except
> >hostnames from CLFF data?
>
> There are other DC's in there.

Well, let's walk through the CLFF, and tell me where the other
"distinguishing characteristics" are:

RFC931 identd information - only a couple sites will supply this, and it
introduces a huge latency on a server so most people turn it off

authenticated username - again, it's something the site has to enable at
their expense, which most don't do

date/time - you can perform heuristics on the date in conjunction with
hostnames by arguing that a gap in time represents the end of
one user and the beginning of a second - but that ignores the
situation where someone follows a link *out* and then comes back
much later, and the more proxies are used the less useful it would
be. Our current estimates are that 20% of our accesses are coming
from behind proxy servers, and that number has been going quite
steadily upward.

request - you can lay out paths in a web site if you have a
directed-graph model of how the pages connect. The analysis
program must be able to inspect this hierarchy, using a robot or by
being able to read the HTML files themselves. The more links on a
page, the less useful this is - and you also don't know when people
step *back*, which makes building Markov models very difficult.
In short, chaining paths is also a very weak link.

Error response - most of the time it's either a success (200) or a
"not-modified" (304). I don't see how you can determine from
this whether the request represents a new user or not - if the
object is first fetched from a given host and returns a 200,
and then 20 minutes later from the same host and gets a 304, what
does that mean? Either 1) it's the same person refreshing that
object, using a browser that implements caching, 2) another user behind a
proxy server getting that object for the first time. If the second
response is a 200, then it's from 1) the same user whose browser doesn't
implement caching or 2) the same user whose browser implements caching
but the object wasn't in the cache for whatever reason, or 3) a totally
different user coming from behind a non-caching proxy.

File size - insignificant.

So what's there?

I apologize if I'm making a ruckus on this issue - but it's something
we're heavily involved with as well, and I have to *constantly*
*constantly* deal with clients and prospective clients who have been told
be overeager marketers from other companies what can and can't be done in
this and other technical arenas. I don't doubt that there is a large
market for a good analysis tool a la getstats or wwwstat or the other
free analysis tools out there. But I am very wary of unfulfillable
claims which have been getting way too much press. There are real
solutions to this coming down the pipe that will give marketers a better
idea of who's visiting them without having to guess or derive flaky
heuristics that work one day and not the next, while still strongly
protecting the user's right to privacy.

I've said too much on the subject... next!

Brian

--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com brian@hyperreal.com http://www.[hyperreal,organic].com/