Cincom

Web Log Stats Lesson 4
Reading External Files and Parsing Strings


| Table of Contents | Lesson 3 | Lesson 5 |


The previous workshop verified that you could successfully locate a web server log file on your machine and that you were able to view it in the VisualWorks File Editor. This brought the file in "all at once", but in order to count the "hits", we need to be able to read in the file line by line.

In this lesson, you will learn how to perform loops, a process where you repeat something until you determine it's time to quit. Specifically, the steps we will repeat are reading in the file line by line and telling it to quit when we have reached the end of the file. Along the way, we will also extract the IP address from each line.

1. In your Workspace, modify the text as follows, highlight all of it, <Operate-Click> and then select Do it.

| myFile myStream |
myFile := 'ws000101.log' asFilename.
myStream := myFile readStream.
myStream inspect.
myStream close.

A new Inspector window will appear. The caption on the window reads an ExternalReadStream with self and eleven other properties appearing in the left hand pane. The important part is that VisualWorks treats sequential files as a stream of characters. See Figure 4-1.


Figure 4-1. The log file as a Stream in the VisualWorks Inspector Window

Just a reminder. You will have to supply the "fully qualified path" for the location of this file if you did not place it in the VisualWorks "default directory". Remember, from this point forward, when you are asked to reference this file or any other file, the example code will simply use the "non-qualified" file name. It will be up to you to make sure you are correctly referencing this file. Remember, the benefit of placing this file (and all other log files) in the VisualWorks "default directory" is that you will be able to copy and paste the code snippets from this lesson into your Workspace.

2. In the left pane of the Inspector window, click (highlight) the word lineEndCharacter.

Note that in the right pane you will see the value of lineEndCharacter which is Core.Character cr. For Unix/Linux, it will display as Core.Character lf. Keep this in mind as we will use this information later on in this lesson.

3. Close the Inspector window.

A lot is going on here so let's dissect this step by step.

| myFile myStream |

On the first line, we are declaring not one but 2 temporary variables. This is done by placing the vertical bar first, listing the variable names separated with a space and then finally ending the list with the vertical bar character.

myStream := myFile readStream.

This is where things get interesting. On assignment statements (those that contain the := symbol), Smalltalk evaluates the right side first and then assigns the result to what's on the left side. So the myFile object is being sent the message readStream. This means that we wish to treat the log file as a "stream" of characters (after all, that's all a file really is to a computer) and assign that stream to the temporary variable myStream.

myStream inspect.

We tell Smalltalk to inspect the stream (i.e. this is just like the <Operate> menu option of Inspect It but this is how you can invoke it programmatically (Cool !!).

myStream close.

And while you were looking at the Inspector window, Smalltalk went ahead and executed the last line which physically closed the file (always a good idea to close the file when you are done with it).

The advantage of using a stream is that by determining the lineEndCharacter, we can read through a file line by line. So let's try to read in just the first line of the file.

4. Modify the text accordingly, highlight all of it, <Operate-Click> and select Do It.
For UNIX/Linux users, you will have to change the cr to lf .

| myFile myStream myLine |
myFile := 'ws000101.log' asFilename.
myStream := myFile readStream.
myLine := myStream upTo: Character cr.
myLine inspect.

myStream close.


Note that a new (Inspector) window will appear with the caption of the window being a MSCP1252String (MSCP1252 is a Windows character set designation). Displayed in the Inspector window is the first line of our file. See Figure 4-2.


Figure 4-2. The first line of our log file

If the above code does not produce a "single line" (i.e. it returns the entire contents of the file on one line), then the operating system you are using might need a little more help in determining how the "end of line" character is interpreted.

Since we know that this file has "carriage return-line feeds" at the end of each line, then we can tell VisualWorks to expect them when we open the file. The way to do that would be as follows:

myStream := myFile readStream lineEndCRLF.


We are getting a little ahead of ourselves as far as syntax goes (note that this expression is not your typical object method syntax) but this wiil be explained shortly when you see the link to the syntax primer. At this point, it would suffice to say that the above line is two lines of code combined into one. Smalltalk expressions are evaluated left-to-right so myFile readStream is evaluated first (returning an instance of the Stream class) and that object is then sent the message lineEndCRLF. The result is that VisualWorks now knows that when it encounters "carriage return-line feed" characters, it will know that it is at the end of a single line within the file.

5. In the Inspector window, click the Basic tab.

In the left-hand pane is self followed by a sequence of numbers. The numbers run from 1 through 122. This indicates that the first line of our file contains 122 characters (one number for each character).


Figure 4-3. The first line of the log file contains 122 characters

We need to explain the 4th line from the code above. It does a lot of work.

myLine := myStream upTo: Character cr.

On the third line, when we issued the readStream message to the myFile object, Smalltalk converted a Filename object into a Stream object. The entire contents of the file is now in the form of a stream of characters. Think of this stream of characters as stored in the temporary variable of myStream.

Now on to the fourth line where we issue an upTo: message to the Stream object of myStream. Note there is a colon in this message meaning that what follows the colon is a parameter. In plain English, this line could be translated as "take all the characters of the myStream Stream up to the Character cr and assign that to myLine. It's kind of like a copy statement. The upTo: message just wants to know "where do I stop?". The character cr stands for Carriage Return (hence the cr) and in most line-by-line files such as this, a Carriage Return always marks the end of one line and the beginning of the next.

Note that the 4th line is a bit more complicated than
4 squared or 3 + 4. Not only is this a statement that contains a parameter, but it involves an object (myStream), a class (Character) and 2 messages (upTo: and cr). To understand how Smalltalk interprets this line (i.e. which method gets evaluated first), refer to the syntax primer, which also contains information about a widely-accepted naming convention used in Smalltalk.

We have read the first line of the file. We now need to extract the IP address from this line. Because this is a "comma delimited" file, all "fields" in this file are separated by a comma. Because the IP address is always the first field on each line, we simply need to read in all the characters of a line "up to" the first comma and that will be an IP address. Let's try it.

Proceed to the syntax primer.

6. Modify the code accordingly, highlight all of it, <Operate-Click> and then select Do it.

| myFile myStream myLine addrIP |
myFile := 'ws000101.log' asFilename.
myStream := myFile readStream.
myLine := myStream upTo: Character cr.
addrIP := myLine copyUpTo: $,.
addrIP inspect.
myStream close.

Note that a new (Inspector) window will appear and the caption of the window is a MSCP1252String. Here you will see the value of the a MSCP1252String which is an IP address (specifically 209.67.247.201). Clicking the Basic tab, you will see the numbers 1-14 indicating that the IP address contains 14 characters (one number for each character).


Figure 4-4 A successful extract of an IP address

7. Close the Inspector window.

We need to explain the 5th line from the code above.

addrIP := myLine copyUpTo: $,.

Now that the temporary variable myLine contains the first line of characters, we need to take all the characters up to the comma and store that in the temporary variable of addrIP. So we gave the String object of myLine (remember, this is an instance of a ByteString) the message copyUpTo: and passed it the parameter of $,. Since we are looking specifically for the comma character, we use the dollar sign in front of the comma to tell Smalltalk that we want the comma character. For Smalltalk, the comma has a special meaning (actually, it’s a message - surprise, surprise!) so by using the dollar sign we are asking to just look for the comma character rather than treating it as a message.

Note that we have used 2 different messages that basically did the same thing. The upTo: message copied stuff "up to" some parameter. The copyUpTo: message copied stuff "up to" some parameter. The reason why we used 2 different messages is because we had 2 different objects. The upTo: message is used on Stream objects whereas the copyUpTo: message is used on String objects.

At this point, you might be saying to yourself, "Wouldn't it be nice if there was just one message that 'copied stuff up to some parameter' regardless of which class I was using?". The answer is yes, and this concept is referred to as polymorphism, a fundamental tenet of object-oriented principles. For now, you just have to remember to use the upTo: message on Stream objects and the copyUpTo: message on String objects.

Proceed to the polymorphism primer.

OK. We have an IP address. But each line contains an IP address. We need a way to loop through the file line by line and collect the IP address as we go. Would you believe that just one additional line of code will do that?

8. Modify the code accordingly, highlight all of it, <Operate-Click> and then select Do it.

| myFile myStream myLine addrIP |
myFile := 'ws000101.log' asFilename.
myStream := myFile readStream.
[ myStream atEnd ] whileFalse: [
myLine := myStream upTo: Character cr.
addrIP := myLine copyUpTo: $,.
Transcript show: addrIP.].
myStream close.

Note that a whole bunch of text (numbers) has appeared in the white area under the main VisualWorks window. As you recall from a previous lesson, this is called the System Transcript and it is one of the 4 basic ways of displaying output in Smalltalk (Lesson 2).


Figure 4-5. All IP addresses displayed in the Transcript

You guessed it! Another line of code that does quite a bit.

This is basically the same code as before. Since we already had code to read in a line and extract the IP address, we just needed a way to repeat that process from the beginning of the file to the end of the file. And aside from the Transcript statement that displays our IP address, all it took to perform this loop was just 1 line of code !!!

[ myStream atEnd ] whileFalse: [

Since just one line of code did so much for us, it will take quite a bit of explanation. So as not to distract from the flow of this workshop, you may choose to view the primer on loops.

Proceed to the loops primer.

Summary

Most of the explanation of the topics and concepts regarding Smalltalk for this exercise already took place. However, now would be a good time to recall how we are approaching this exercise. Only one more step remains before we finish the exercise - that of counting the number of hits this web site had for a particular day. We will do that in the next lesson, but let's recap how we got to this point.

We first had to know how to open a file. Then we had to know how to read it in line by line. We then had to know how to extract data from that line. Instead of doing this (writing a routine to display all IP addresses in a web server log file to the System Transcript) all at once, we did it in a "test this snippet of code - it worked - move on" fashion. In another programming language, you could not do this. It would involve writing all the code in some editor, compiling it and then running it. The first time you run this code, you really have no idea if it would work. Chances are it would not have run successfully so you go back and (try to) fix the problem (in your editor), compile it and run it again. You would repeat this process over and over again until eventually you get the program to work. It's an "all or nothing" approach.

With Smalltalk, it is incremental. You test and play with little chunks of code, get those working and then piece them together as a single unit. This is what makes code development in Smalltalk so much more productive. Once you get those little chunks of code working, you never have to go back and test them again.

In the next workshop, now that we know how to extract all those IP addresses from the file, we will be able to collect, sort and count them.

You now should know how to:

Read in a file line by line

Extract data from a string of characters

Perform a loop

Code block statements

Overrides the special meaning of certain characters (comma)

Use the Stream class for file access

Recognize Boolean expressions

 


| Table of Contents | Lesson 3 | Lesson 5 |