|

|
Now that we've counted the number of unique hits, it's
time that we examine what pages on our web site were the most popular.
Therefore, the task at hand is now to count the number of times a particular
page was accessed. Counting should be fairly simple; each time a particular page
appears in the log file, it was hit. The challenge here is that the log file
also lists graphic files that accompany a web page, such as JPG and GIF
files. Those we do not want to count.
Having the counts of each web page will not be enough. Management wants to
know which pages were the most popular, say the 10 most popular. So the next
challenge will be to sort these counts.
|
|

|
This lesson introduces 2 new
Collection classes, the Bag and the SortedCollection. It explains why we are
using them and how they help solve the problem at hand.
|
|

|
All of
the web pages on our web site have an ASP extension. So we'll need to read through
the log file line by line and determine if the file listed is an ASP page and
count it.
Smalltalk makes this counting very easy. One of the collection objects at
Smalltalk's disposal is a Bag. Just like its physical counterpart, you
can throw anything into it and it keeps track of what you put in it.
Moreover, if you toss in a duplicate item, it also counts how many of that
particular item are in it.
Once we have our counts, we will need some way to sort them. Fortunately,
that's another type of collection that comes with Smalltalk. It is called a SortedCollection
and because we are using 2 types of Collections, Smalltalk makes it easy to
move the contents of one collection to another, thus making it simple to
determine our most popular web pages.
So, starting from the top of the file (as if you had to do this manually),
your "logic" would be this:
- Take the first entry in the
log file. Determine if it is an ASP page. If so, toss it in the bag
- Continue reading the file
until you are at the end. Close the file
- Move the counts of the bag
to a SortedCollection (along with the page name)
- Display the top 10 items of
the SortedCollection
Before you start, you may want to view the primer on Bags
and SortedCollections. Although they will be used in the lesson, they will
not be explained as thoroughly as they are in the primers.
|
|

|
Proceed to the Bags
primer.
|
|

|
Proceed to the SortedCollections
primer.
|
|

|
Take another look at the web server log file.

Figure 9-1. The Web server log file in the VisualWorks File Editor
Note that
the web page is found at the very end of each line. Each web page is preceded
by the forward slash (/) character so this gives us a way of identifying
where the page name starts. However, the forward slash (/) character is not
unique - it is also used in the date field. Somehow, if we look past the date
field, then the next forward slash (/) character will denote the start of a
web page. This then will be our strategy.
|
|

|
1. First, open a new Workspace.
Enter the text as follows, highlight all of it, <Operate-Click>
and then select Do it.
|stream line |
stream := 'ws000101.log' asFilename readStream.
line := stream upTo: Character cr.
line := line copyFrom: 50 to: line size.
line := line copyFrom: (line indexOf: $/) to: line
size.
stream close.
line inspect.
By viewing the Inspector window, you
will be able to verify that the code above did indeed extract the full web
page from the first line of the log file.

Figure 9-1. A successful extract of the web page
|
|

|
Some new methods were thrown into that routine,
especially that monster 5th line. Let's look at these new lines of code and
make sure you understand what each one is doing.
|stream line |
stream := 'ws000101.log' asFilename readStream.
line := stream upTo: Character cr.
There is nothing new here - you have seen this before. In short,
we are opening a file as a Stream and reading in the first line. The
variable line will actually end up being an instance of the String
class.
line := line copyFrom: 50 to: line
size.
To get past the first occurrences of the forward slash (/)
character, we go about 50 characters into the line. This puts us well past
the forward slash (/) characters found in the date field. The method copyFrom:to:
allows us to take the entire string of characters (the whole line from the
log file) and chop off the first 50 characters of the string, leaving us with
a string starting at the 51st character and going all the way out to the end
of the string (line).
line := line copyFrom: (line indexOf:
$/) to: line size.
At first glance, this looks like a very complicated line of code.
But in reality, it is no different than the line above it. It's the same copyFrom:to:
method. But instead of using copyFrom: 50, we are using the position
of the forward slash (/) character which precedes our web page. The indexOf:
method, when passed a parameter of a character (in our case, the forward
slash (/) character), returns the position of where that character occurs in
a string. Remember, since we were looking for the forward slash (/)
character, we had to precede it with the dollar sign ($). So again, we
chopped off the line removing everything prior to the forward slash (/)
character. The rest of the line (string) remained intact.
stream close.
line inspect.
There is nothing new here either - you have seen this before. In
short, we are closing the file (by closing the stream) and then telling
Smalltalk to send the line variable to the Inspector.
We still have some "cleaning up" to do (note in
the inspector window there are some leftover commas and dashes). We will
clean them up when we loop through the file and gather all ASP pages.
|
|

|
2. Modify the lines below
accordingly, highlight all of it, <Operate-Click> and then
select Do it.
|stream line bag xFound|
bag := Bag new.
stream
:= 'ws000101.log' asFilename readStream.
[ stream atEnd ] whileFalse: [
line := stream upTo: Character cr.
line := line copyFrom: 50 to: line size.
line := line copyFrom: (line indexOf: $/) to: line size.
line := line copyUpTo: $,.
xFound := line findString: '.asp'
startingAt: 1.
xFound > 0
ifTrue:[ bag add: line. ]. ].
stream close.
bag inspect.
By viewing the Inspector window, you
will be able to verify that the code above did indeed extract all ASP pages
from log file (click self). Also, if you click content, you
will see the counters for each page.

Figure 9-2. Our collection (bag) of ASP pages

Figure 9-3. The count of each ASP page in the bag collection
|
|

|
We now need to sort our counts. To do this, we will need
to copy the contents of our bag into a SortedCollection. However, a Bag
is not such a simple structure. Note in Figure 9-3 that when you selected contents,
the Inspector says that what you really have is a Dictionary.
When you think about it, a Bag is more than just a single collection
of things (unlike a Set) - it is actually a double collection of
things. We first have a web page (that's 1) and a count that's associated
with it (that's 2). We just can't move a Bag into a Set.
Fortunately, the authors of Smalltalk already ran into this problem, and
because of that, they solved our dilemma. They created a collection called an
Association. That's exactly what we have - a count associated
with a web page. But when we iterate through a collection such as a Bag,
how do we extract both the count as well as the web page (i.e. both parts)?
Again, the authors knew this was an issue so instead of creating a simple do:
method they created a more sophisticated method called valuesAndCountsDo:
which expects both parts of a Bag.
Enough talk. Let's get to the code.
|
|

|
3. Modify the lines below
accordingly, highlight all of them, <Operate-Click> and then
select Do it.
|stream line bag xFound
sort|
bag := Bag new.
stream := 'ws000101.log' asFilename readStream.
[ stream atEnd ] whileFalse: [
line := stream upTo: Character cr.
line := line copyFrom: 50 to: line size.
line := line copyFrom: (line indexOf: $/) to: line size.
line := line copyUpTo: $,.
xFound := line findString: '.asp' startingAt: 1.
xFound > 0
ifTrue:[ bag add: line. ]. ].
stream close.
sort := SortedCollection sortBlock: [:a :b| a <=
b].
bag valuesAndCountsDo: [ :each :count |
sort add: (Core.Association key: count value: each)].
sort do: [ :each | Transcript cr; show: each
printString.].
Use the Transcript window to verify
that the code above did indeed count the web pages and displayed them in
sorted order (lowest to highest).

Figure 9-4. Our most popular web pages
|
|

|
Those last four lines do quite a bit. Let's look
at these lines of code and make sure you understand what each one does.
sort := SortedCollection sortBlock:
[:a :b | a <= b].
Because we want to sort in reverse order (lowest to highest), we
need to use the sortBlock: method of the SortedCollection
class. Try not to read too much into the syntax of the block [:a :b| a
<= b]. Think of it like this: "Of any 2 items in the SortedCollection
(a and b), make the item following another greater than or
equal to the other". Other than the sort block, it is just a normal
initialization of a SortedCollection.
bag valuesAndCountsDo: [ :each :count
|
This is just like a typical iterative do: block with just
one exception. The valuesAndCountsDo: methods passes two elements
instead of one. As such, that's why the first part of the block declares two
temporary variables (:each and :count). The :each
parameter contains the name of a web page while the :count parameter
contains the web page’s associated count.
sort add: (Core.Association key: count
value: each)].
This line is the real workhorse of the routine. We first create
an association with the key:value: method. This method requires two
parameters. Since we want to sort on the page counts, we make that parameter
the first key:parameter while the value: parameter will be the
web page. Once we have our association, we simply add it to our SortedCollection
(sort). This line also is the terminating point to our valuesAndCountsDo:
iteration block.
Note also that the Association is prefixed with Core. The
syntax of Core.Association means "use the Association Class in
the Core namespace." If you were to omit the Core namespace, then
VisualWorks would have prompted you for the correct namespace since there are
three Association classes delivered with the system.
sort do: [ :each | Transcript cr;
show: each printString.].
Finally, we simply display the contents of the SortedCollection
in the Transcript. A
simple do: block will suffice.
|
|

|
We're basically done! It took 13 lines of code to
determine the number of page counts from the log file and sort them in
ascending order. If you know any friends or family members who are
programmers, tell them to write this program in their favorite language and
see how many lines of code it took them.
Now that we have the guts of the program working, let's put the same polish
on it as we did for our web hits example. First, let's prompt for a specific
log file. That way, we could determine the number of "popular
pages" for any log file on the system. Second, let's see if we can
display our results in something other than the Transcript. How about
writing them out to a file?
|
|

|
4. Modify the lines below
accordingly, highlight all of it, <Operate-Click> and then
select Do it.
|stream line bag xFound
sort file|
bag := Bag new.
file := (Dialog request: 'Enter file' initialAnswer:
'ws000101.log') asFilename.
stream := file readStream.
[ stream atEnd ] whileFalse: [
line := stream upTo: Character cr.
line := line copyFrom: 50 to: line size.
line := line copyFrom: (line indexOf: $/) to: line size.
line := line copyUpTo: $,.
xFound := line findString: '.asp' startingAt: 1.
xFound > 0
ifTrue:[ bag add: line. ]. ].
stream close.
sort := SortedCollection sortBlock: [:a :b| a <= b].
bag valuesAndCountsDo: [ :each :count |
sort add: (Core.Association key: count value: each)].
sort do: [ :each | Transcript cr; show: each printString.].
Note that we are now prompted for the name of the log file. This
solves our first enhancement. Now let's try to output our results to an
external file.
5. Modify the lines below accordingly, highlight all of them, <Operate-Click>
and then select Do it.
|stream line bag xFound
sort file out|
bag := Bag new.
file := (Dialog request: 'Enter file' initialAnswer: 'ws000101.log') asFilename.
stream := file readStream.
[ stream atEnd ] whileFalse: [
line := stream upTo: Character cr.
line := line copyFrom: 50 to: line size.
line := line copyFrom: (line indexOf: $/) to: line size.
line := line copyUpTo: $,.
xFound := line findString: '.asp' startingAt: 1.
xFound > 0
ifTrue:[ bag add: line. ]. ].
stream close.
sort := SortedCollection sortBlock: [:a :b| a <= b].
bag valuesAndCountsDo: [ :each :count |
sort add: (Core.Association key: count value: each)].
out := 'websitestats.txt' asFilename writeStream.
sort do: [ :each | out cr; nextPutAll: each
printString.].
out close.
|
|

|
Those last three lines do quite a bit. Let's look
at them and make sure you understand what they do.
out := 'websitestats.txt' asFilename
writeStream.
Note that the syntax to open/create a file is exactly the same as
when you want to read a file. The only difference is that you use a writeStream
method instead of a readStream method. If the file websitestats.txt
exists, it will get clobbered (overwritten). If it doesn't exist, it will get
created.
sort do: [ :each | out cr; nextPutAll:
each printString.].
Here we are iterating through the sort collection with the do:
method and the temporary variable each will be used to hold the value
of each element of the collection.
The very first thing we do is write out a Carriage Return (cr) to the output
file. Note the semi-colon. This means that whatever expression comes next,
our out object will be the receiver. The nextPutAll: method for
a Stream object means just that - take the parameter that's passed to
it (each printString with each being the given element of the
sort collection) and "put all" of it into the stream. This process
will repeat until we have reached the end of the collection.
This simply terminates the stream and because the stream is an
external file, it closes it.
|
|

|
6. Why not take a look at the output file? Let's use the
VisualWorks File Browser for that. Open the File Browser (from
the main VisualWorks Launcher window) by selecting the menu option File
>> File Browser or clicking the icon with the folders and eyegalsses
on the toolbar. Enter website* in the Show Files: field and
press 'Enter' or 'Return'. The file websitestats.txt should have
appeared in the top right-hand pane. By clicking (selecting) this file, you
can then see the contents of this file. See Fig 9-5.

Figure 9-5. The contents of our output file
|
|

|
Summary
In this
lesson, you saw the power of using a Bag, an Association and a SortedCollection
to store the page names and their counts in a sorted order. In the next
lesson, would it be asking too much to tally the page counts from all log
files in a given directory? This will provide statistics for the number of
page counts for a given week, month or year.
You now should know how to:
|
Identify
a Collection. In particular, a Bag
|
|
Identify
a Collection. In particular, a SortedCollection
|
|
Identify
a Collection. In particular, an Association
|
|
Count
the number of items in a Bag
|
|
Move
elements from one collection to another
|
|
Create
an external file
|
|