Cincom

Web Log Stats Lesson 9
Popularity Contest


| Table of Contents | Lesson 8 | Lesson 10 |


Now that we've counted the number of unique hits, it's time that we examine what pages on our web site were the most popular. Therefore, the task at hand is now to count the number of times a particular page was accessed. Counting should be fairly simple; each time a particular page appears in the log file, it was hit. The challenge here is that the log file also lists graphic files that accompany a web page, such as JPG and GIF files. Those we do not want to count.

Having the counts of each web page will not be enough. Management wants to know which pages were the most popular, say the 10 most popular. So the next challenge will be to sort these counts.

This lesson introduces 2 new Collection classes, the Bag and the SortedCollection. It explains why we are using them and how they help solve the problem at hand.

All of the web pages on our web site have an ASP extension. So we'll need to read through the log file line by line and determine if the file listed is an ASP page and count it.

Smalltalk makes this counting very easy. One of the collection objects at Smalltalk's disposal is a Bag. Just like its physical counterpart, you can throw anything into it and it keeps track of what you put in it. Moreover, if you toss in a duplicate item, it also counts how many of that particular item are in it.

Once we have our counts, we will need some way to sort them. Fortunately, that's another type of collection that comes with Smalltalk. It is called a SortedCollection and because we are using 2 types of Collections, Smalltalk makes it easy to move the contents of one collection to another, thus making it simple to determine our most popular web pages.

So, starting from the top of the file (as if you had to do this manually), your "logic" would be this:

  • Take the first entry in the log file. Determine if it is an ASP page. If so, toss it in the bag
  • Continue reading the file until you are at the end. Close the file
  • Move the counts of the bag to a SortedCollection (along with the page name)
  • Display the top 10 items of the SortedCollection

Before you start, you may want to view the primer on Bags and SortedCollections. Although they will be used in the lesson, they will not be explained as thoroughly as they are in the primers.

Proceed to the Bags primer.

Proceed to the SortedCollections primer.

Take another look at the web server log file.


Figure 9-1. The Web server log file in the VisualWorks File Editor

Note that the web page is found at the very end of each line. Each web page is preceded by the forward slash (/) character so this gives us a way of identifying where the page name starts. However, the forward slash (/) character is not unique - it is also used in the date field. Somehow, if we look past the date field, then the next forward slash (/) character will denote the start of a web page. This then will be our strategy.

1. First, open a new Workspace. Enter the text as follows, highlight all of it, <Operate-Click> and then select Do it.

|stream line |
stream := 'ws000101.log' asFilename readStream.
line := stream upTo: Character cr.
line := line copyFrom: 50 to: line size.
line := line copyFrom: (line indexOf: $/) to: line size.
stream close.
line inspect.

By viewing the Inspector window, you will be able to verify that the code above did indeed extract the full web page from the first line of the log file.


Figure 9-1. A successful extract of the web page

Some new methods were thrown into that routine, especially that monster 5th line. Let's look at these new lines of code and make sure you understand what each one is doing.

|stream line |
stream := 'ws000101.log' asFilename readStream.
line := stream upTo: Character cr.

There is nothing new here - you have seen this before. In short, we are opening a file as a Stream and reading in the first line. The variable line will actually end up being an instance of the String class.

line := line copyFrom: 50 to: line size.

To get past the first occurrences of the forward slash (/) character, we go about 50 characters into the line. This puts us well past the forward slash (/) characters found in the date field. The method copyFrom:to: allows us to take the entire string of characters (the whole line from the log file) and chop off the first 50 characters of the string, leaving us with a string starting at the 51st character and going all the way out to the end of the string (line).

line := line copyFrom: (line indexOf: $/) to: line size.

At first glance, this looks like a very complicated line of code. But in reality, it is no different than the line above it. It's the same copyFrom:to: method. But instead of using copyFrom: 50, we are using the position of the forward slash (/) character which precedes our web page. The indexOf: method, when passed a parameter of a character (in our case, the forward slash (/) character), returns the position of where that character occurs in a string. Remember, since we were looking for the forward slash (/) character, we had to precede it with the dollar sign ($). So again, we chopped off the line removing everything prior to the forward slash (/) character. The rest of the line (string) remained intact.

stream close.
line inspect.

There is nothing new here either - you have seen this before. In short, we are closing the file (by closing the stream) and then telling Smalltalk to send the line variable to the Inspector.

We still have some "cleaning up" to do (note in the inspector window there are some leftover commas and dashes). We will clean them up when we loop through the file and gather all ASP pages.

2. Modify the lines below accordingly, highlight all of it, <Operate-Click> and then select Do it.

|stream line bag xFound|
bag := Bag new.
stream := 'ws000101.log' asFilename readStream.
[ stream atEnd ] whileFalse: [
  line := stream upTo: Character cr.
  line := line copyFrom: 50 to: line size.
  line := line copyFrom: (line indexOf: $/) to: line size.
  line := line copyUpTo: $,.
  xFound := line findString: '.asp' startingAt: 1.
  xFound > 0
  ifTrue:[ bag add: line. ]. ].
stream close.
bag inspect.

By viewing the Inspector window, you will be able to verify that the code above did indeed extract all ASP pages from log file (click self). Also, if you click content, you will see the counters for each page.


Figure 9-2. Our collection (bag) of ASP pages


Figure 9-3. The count of each ASP page in the bag collection

We now need to sort our counts. To do this, we will need to copy the contents of our bag into a SortedCollection. However, a Bag is not such a simple structure. Note in Figure 9-3 that when you selected contents, the Inspector says that what you really have is a Dictionary. When you think about it, a Bag is more than just a single collection of things (unlike a Set) - it is actually a double collection of things. We first have a web page (that's 1) and a count that's associated with it (that's 2). We just can't move a Bag into a Set.

Fortunately, the authors of Smalltalk already ran into this problem, and because of that, they solved our dilemma. They created a collection called an Association. That's exactly what we have - a count associated with a web page. But when we iterate through a collection such as a Bag, how do we extract both the count as well as the web page (i.e. both parts)? Again, the authors knew this was an issue so instead of creating a simple do: method they created a more sophisticated method called valuesAndCountsDo: which expects both parts of a Bag.

Enough talk. Let's get to the code.

3. Modify the lines below accordingly, highlight all of them, <Operate-Click> and then select Do it.

|stream line bag xFound sort|
bag := Bag new.
stream := 'ws000101.log' asFilename readStream.
[ stream atEnd ] whileFalse: [
  line := stream upTo: Character cr.
  line := line copyFrom: 50 to: line size.
  line := line copyFrom: (line indexOf: $/) to: line size.
  line := line copyUpTo: $,.
  xFound := line findString: '.asp' startingAt: 1.
  xFound > 0
  ifTrue:[ bag add: line. ]. ].
stream close.
sort := SortedCollection sortBlock: [:a :b| a <= b].
bag valuesAndCountsDo: [ :each :count |
sort add: (Core.Association key: count value: each)].
sort do: [ :each | Transcript cr; show: each printString.].

Use the Transcript window to verify that the code above did indeed count the web pages and displayed them in sorted order (lowest to highest).


Figure 9-4. Our most popular web pages

Those last four lines do quite a bit. Let's look at these lines of code and make sure you understand what each one does.

sort := SortedCollection sortBlock: [:a :b | a <= b].

Because we want to sort in reverse order (lowest to highest), we need to use the sortBlock: method of the SortedCollection class. Try not to read too much into the syntax of the block [:a :b| a <= b]. Think of it like this: "Of any 2 items in the SortedCollection (a and b), make the item following another greater than or equal to the other". Other than the sort block, it is just a normal initialization of a SortedCollection.

bag valuesAndCountsDo: [ :each :count |

This is just like a typical iterative do: block with just one exception. The valuesAndCountsDo: methods passes two elements instead of one. As such, that's why the first part of the block declares two temporary variables (:each and :count). The :each parameter contains the name of a web page while the :count parameter contains the web page’s associated count.

sort add: (Core.Association key: count value: each)].

This line is the real workhorse of the routine. We first create an association with the key:value: method. This method requires two parameters. Since we want to sort on the page counts, we make that parameter the first key:parameter while the value: parameter will be the web page. Once we have our association, we simply add it to our SortedCollection (sort). This line also is the terminating point to our valuesAndCountsDo: iteration block.

Note also that the Association is prefixed with Core. The syntax of Core.Association means "use the Association Class in the Core namespace." If you were to omit the Core namespace, then VisualWorks would have prompted you for the correct namespace since there are three Association classes delivered with the system.

sort do: [ :each | Transcript cr; show: each printString.].

Finally, we simply display the contents of the SortedCollection in the Transcript. A simple do: block will suffice.

We're basically done! It took 13 lines of code to determine the number of page counts from the log file and sort them in ascending order. If you know any friends or family members who are programmers, tell them to write this program in their favorite language and see how many lines of code it took them.

Now that we have the guts of the program working, let's put the same polish on it as we did for our web hits example. First, let's prompt for a specific log file. That way, we could determine the number of "popular pages" for any log file on the system. Second, let's see if we can display our results in something other than the Transcript. How about writing them out to a file?

4. Modify the lines below accordingly, highlight all of it, <Operate-Click> and then select Do it.

|stream line bag xFound sort file|
bag := Bag new.
file := (Dialog request: 'Enter file' initialAnswer: 'ws000101.log') asFilename.
stream := file readStream.

[ stream atEnd ] whileFalse: [
  line := stream upTo: Character cr.
  line := line copyFrom: 50 to: line size.
  line := line copyFrom: (line indexOf: $/) to: line size.
  line := line copyUpTo: $,.
  xFound := line findString: '.asp' startingAt: 1.
  xFound > 0
  ifTrue:[ bag add: line. ]. ].
stream close.
sort := SortedCollection sortBlock: [:a :b| a <= b].
bag valuesAndCountsDo: [ :each :count |
sort add: (Core.Association key: count value: each)].
sort do: [ :each | Transcript cr; show: each printString.].

Note that we are now prompted for the name of the log file. This solves our first enhancement. Now let's try to output our results to an external file.

5. Modify the lines below accordingly, highlight all of them, <Operate-Click> and then select Do it.

|stream line bag xFound sort file out|
bag := Bag new.
file := (Dialog request: 'Enter file' initialAnswer: 'ws000101.log') asFilename.
stream := file readStream.
[ stream atEnd ] whileFalse: [
  line := stream upTo: Character cr.
  line := line copyFrom: 50 to: line size.
  line := line copyFrom: (line indexOf: $/) to: line size.
  line := line copyUpTo: $,.
  xFound := line findString: '.asp' startingAt: 1.
  xFound > 0
  ifTrue:[ bag add: line. ]. ].
stream close.
sort := SortedCollection sortBlock: [:a :b| a <= b].
bag valuesAndCountsDo: [ :each :count |
sort add: (Core.Association key: count value: each)].
out := 'websitestats.txt' asFilename writeStream.
sort do: [ :each | out cr; nextPutAll: each printString.].
out close.

Those last three lines do quite a bit. Let's look at them and make sure you understand what they do.

out := 'websitestats.txt' asFilename writeStream.

Note that the syntax to open/create a file is exactly the same as when you want to read a file. The only difference is that you use a writeStream method instead of a readStream method. If the file websitestats.txt exists, it will get clobbered (overwritten). If it doesn't exist, it will get created.

sort do: [ :each | out cr; nextPutAll: each printString.].

Here we are iterating through the sort collection with the do: method and the temporary variable each will be used to hold the value of each element of the collection.

The very first thing we do is write out a Carriage Return (cr) to the output file. Note the semi-colon. This means that whatever expression comes next, our out object will be the receiver. The nextPutAll: method for a Stream object means just that - take the parameter that's passed to it (each printString with each being the given element of the sort collection) and "put all" of it into the stream. This process will repeat until we have reached the end of the collection.

out close.

This simply terminates the stream and because the stream is an external file, it closes it.

6. Why not take a look at the output file? Let's use the VisualWorks File Browser for that. Open the File Browser (from the main VisualWorks Launcher window) by selecting the menu option File >> File Browser or clicking the icon with the folders and eyegalsses on the toolbar. Enter website* in the Show Files: field and press 'Enter' or 'Return'. The file websitestats.txt should have appeared in the top right-hand pane. By clicking (selecting) this file, you can then see the contents of this file. See Fig 9-5.


Figure 9-5. The contents of our output file

Summary

In this lesson, you saw the power of using a Bag, an Association and a SortedCollection to store the page names and their counts in a sorted order. In the next lesson, would it be asking too much to tally the page counts from all log files in a given directory? This will provide statistics for the number of page counts for a given week, month or year.

You now should know how to:

Identify a Collection. In particular, a Bag

Identify a Collection. In particular, a SortedCollection

Identify a Collection. In particular, an Association

Count the number of items in a Bag

Move elements from one collection to another

Create an external file


| Table of Contents | Lesson 8 | Lesson 10 |