Cincom

Web Log Stats Lesson 5
Starting a Collection


| Table of Contents | Lesson 4 | Lesson 6 |
In order to get an accurate count of "web hits", we must first know a little something about the structure of the web server log. More importantly, we must define exactly what we mean by a "web hit". Once that is done, we are going to need something to store these IP addresses in so we can count them.

This lesson introduces a specific Collection class called a Set. It explains why we are using a Set and how it helps solve the problem at hand.

Suppose a person hits our web site at 9:00am, then again at 3:00pm and again at 9:00pm. Some people would say that's 3 hits whereas someone else would say it's only 1 hit because it's the same person. So it depends on how you treat return visitors. For the sake of this workshop, we will treat each "hit" by a return visitor as NOT being a separate hit (kind of like stuffing a ballot box - it skews the statistics). So our count will reflect how many different (unique) people hit our site.

The next issue is with the structure of the log file itself. If we look at the log file, we will see that our first IP address (209.67.247.201) occurs seven times (because that person visited seven different pages on our site). But our next visitor (IPAddress 63.21.189.239) occurs 29 times and not all of those 29 entries represented web pages. Some of them are GIF (graphic) files. So as far as counting goes, after reading in 36 lines, we really only had 2 different visitors. Since this file is organized in time sequence, you will have multiple entries for an IP address. The number of entries varies depending on a lot of factors that currently don't concern us. The main issue is that when the IP address changes, a new (or returning) visitor has hit our site.

This is a challenge. Since an IP address change indicates that a new (or returning) visitor has hit our site, we have to distinguish between the two. So, starting from the top of the file (as if you had to do this manually), our "logic" would be this:
  • Take the first IP address that comes in and store it somewhere. Increment Counter for that IP address.
  • Continue reading the file until this IP address changes.
  • Store the new (changed) IP address somewhere. Increment Counter for that IP address.
  • Continue reading the file until that IP address changes.
  • When the next IP address comes in, compare it to the IP addresses that you have stored. If it is not in the list, add it to the list and increment Counter for that IP address. If it is in the list, continue reading the file until this IP address changes.
  • Repeat the above steps until you reach the end of file.
We have 2 challenges. First, where are we going to store the IP addresses? Second, how we going to determine if a new IP address is already in our list?

Whether you are an experienced programmer or not, it is safe to say that what we are about to do is not an easy task for most programming languages (and deifinitely no place in a beginners tutorial). However, in Smalltalk, it is insanely simple. Again, this is a task that computer software does over and over again, so the authors of the Smalltalk language developed a "built-in" mechanism for doing these universally repeated types of tasks.

The "built-in" mechanism mentioned above is called a Collection. We will use one type of collection called a Set. It is highly recommended that you view the Sets primer before proceeding if you truly want an understanding of what the code is doing.

Proceed to the sets primer.

1. Modify the text as follows, highlight all of it, <Operate-Click> and then select Do it

| myFile myStream myLine addrIP mySet |
mySet := Set new.

myFile := 'ws000101.log' asFilename.
myStream := myFile readStream.
[ myStream atEnd ] whileFalse: [
myLine := myStream upTo: Character cr.
addrIP := myLine copyUpTo: $,.
mySet add: addrIP.].
myStream close.
mySet inspect.

Note that a new (Inspector) window will appear and the caption of the window is a Set.

2. In Figure 5-1, you will see the collection of IP addresses in the Set. You can see that it is indeed a collection of IP addresses. In Figure 5-2, the Basic tab is shown with the word tally selected. Note the value of tally is 93, which is the number of unique items in the set, in our case, IP addresses or "hits".


Figure 5-1. Our collection (set) of IP addresses


Figure 5-2. The size of (items in) the collection
We're basically done! It only took 9 lines of code to determine the number of unique web hits from the log file. If you know any friends or family members who are programmers, tell them to write this program in their favorite language and see how many lines of code it took them! No doubt a lot more.

Now that we have the guts of the program working, let's put a little polish on it. First, we will display the answer in something other than an Inspector window. Secondly, there are other log files out there. Wouldn't it be nice to be prompted for a specific log file? That way, we could determine the number of "hits" for any log file on the system.

3. Modify the text as follows, highlight all of it, <Operate-Click> and then select Do it.

| myFile myStream myLine addrIP mySet |
mySet := Set new.
myFile := 'ws000101.log' asFilename.
myStream := myFile readStream.
[ myStream atEnd ] whileFalse: [
myLine := myStream upTo: Character cr.
addrIP := myLine copyUpTo: $,.
mySet add: addrIP.].
myStream close.
Dialog warn: (mySet size printString).

Everything is the same except for the last line. Here, we are using our friendly neighborhood Dialog object (class) to display our answer. We send the size message to the Set object. This will return a number. The warn: message doesn't like to be passed numbers, just strings, so we send the printString message to the number returned to us by mySet size to convert it to a string. This makes the warn: message happy and we will see our dialog box.

4. Modify the text as follows, highlight all of it, <Operate-Click> and then select Do it.

| myFile myStream myLine addrIP mySet |
mySet := Set new.
myFile := (Dialog request: 'Enter file' initialAnswer: 'ws000101.log') asFilename.
myStream := myFile readStream.
[ myStream atEnd ] whileFalse: [
myLine := myStream upTo: Character cr.
addrIP := myLine copyUpTo: $,.
mySet add: addrIP.].
myStream close.
Dialog warn: (mySet size printString).

We have learned a new message for the Dialog object (request:). Technically, this isn't the real message (although there really is a message called request:). Since there are two colons in this expression, this message requires two parameters. The actual message we are sending the Dialog object is request:initialAnswer: (simply concatenate the "colon" messages together).

Figure 5-3. The prompt for our file name Figure 5-4. The number of hits for the log file

Summary

In this lesson, you saw the power of using a Set to store the IP addresses. In the next workshop, now that we know how to count hits for one file, we will attempt to do that for all files in a given directory. This will provide statistics for number of hits for a given week, month or year.

You now should know how to:
Identify a Collection. In particular, a Set
Add items to a Collection
Count the number of items in a Collection

| Table of Contents | Lesson 4 | Lesson 6 |