|
Web Log Stats Lesson 5
|
|
|
In order to get an accurate count of "web hits", we must first know a little
something about the structure of the web server log. More importantly, we must
define exactly what we mean by a "web hit". Once that is done, we are going to
need something to store these IP addresses in so we can count them.
|
||||
|
This lesson introduces a specific Collection class called a Set. It explains why
we are using a Set and how it helps solve the problem at hand.
|
||||
|
Suppose a person hits our web site at 9:00am, then again at 3:00pm and again at
9:00pm. Some people would say that's 3 hits whereas someone else would say it's
only 1 hit because it's the same person. So it depends on how you treat return
visitors. For the sake of this workshop, we will treat each "hit" by a return
visitor as NOT being a separate hit (kind of like stuffing a ballot box - it
skews the statistics). So our count will reflect how many different (unique)
people hit our site.
The next issue is with the structure of the log file itself. If we look at the log file, we will see that our first IP address (209.67.247.201) occurs seven times (because that person visited seven different pages on our site). But our next visitor (IPAddress 63.21.189.239) occurs 29 times and not all of those 29 entries represented web pages. Some of them are GIF (graphic) files. So as far as counting goes, after reading in 36 lines, we really only had 2 different visitors. Since this file is organized in time sequence, you will have multiple entries for an IP address. The number of entries varies depending on a lot of factors that currently don't concern us. The main issue is that when the IP address changes, a new (or returning) visitor has hit our site. This is a challenge. Since an IP address change indicates that a new (or returning) visitor has hit our site, we have to distinguish between the two. So, starting from the top of the file (as if you had to do this manually), our "logic" would be this:
Whether you are an experienced programmer or not, it is safe to say that what we are about to do is not an easy task for most programming languages (and deifinitely no place in a beginners tutorial). However, in Smalltalk, it is insanely simple. Again, this is a task that computer software does over and over again, so the authors of the Smalltalk language developed a "built-in" mechanism for doing these universally repeated types of tasks. The "built-in" mechanism mentioned above is called a Collection. We will use one type of collection called a Set. It is highly recommended that you view the Sets primer before proceeding if you truly want an understanding of what the code is doing. |
||||
|
Proceed to the sets primer.
|
||||
|
1.
Modify the text as follows, highlight all of it,
<Operate-Click> and then select Do it
| myFile myStream myLine addrIP mySet |
Note that a new (Inspector) window will appear and
the caption of the window is a Set. mySet := Set new. myFile := 'ws000101.log' asFilename. myStream := myFile readStream. [ myStream atEnd ] whileFalse: [ myLine := myStream upTo: Character cr. addrIP := myLine copyUpTo: $,. mySet add: addrIP.]. myStream close. mySet inspect. 2. In Figure 5-1, you will see the collection of IP addresses in the Set. You can see that it is indeed a collection of IP addresses. In Figure 5-2, the Basic tab is shown with the word tally selected. Note the value of tally is 93, which is the number of unique items in the set, in our case, IP addresses or "hits".
Figure 5-1. Our collection (set) of IP addresses
Figure 5-2. The size of (items in) the collection |
||||
|
We're basically done!
It only took 9 lines of code to determine the number of unique web hits from
the log file. If you know any friends or family members who are programmers,
tell them to write this program in their favorite language and see how many
lines of code it took them! No doubt a lot more.
Now that we have the guts of the program working, let's put a little polish on it. First, we will display the answer in something other than an Inspector window. Secondly, there are other log files out there. Wouldn't it be nice to be prompted for a specific log file? That way, we could determine the number of "hits" for any log file on the system. |
||||
|
3.
Modify the text as follows, highlight all of it, <Operate-Click> and
then select Do it.
| myFile myStream myLine addrIP mySet |
mySet := Set new. myFile := 'ws000101.log' asFilename. myStream := myFile readStream. [ myStream atEnd ] whileFalse: [ myLine := myStream upTo: Character cr. addrIP := myLine copyUpTo: $,. mySet add: addrIP.]. myStream close. Dialog warn: (mySet size printString). Everything is the same except for the last line. Here, we are using our friendly neighborhood Dialog object (class) to display our answer. We send the size message to the Set object. This will return a number. The warn: message doesn't like to be passed numbers, just strings, so we send the printString message to the number returned to us by mySet size to convert it to a string. This makes the warn: message happy and we will see our dialog box. 4. Modify the text as follows, highlight all of it, <Operate-Click> and then select Do it.
| myFile myStream myLine addrIP mySet |
|
||||
|
SummaryIn this lesson, you saw the power of using a Set to store the IP addresses. In the next workshop, now that we know how to count hits for one file, we will attempt to do that for all files in a given directory. This will provide statistics for number of hits for a given week, month or year.You now should know how to:
|