Smalltalk

Hairy Computing Streams

March 22, 2006 21:27:19.083

The previous post talked about computing streams, and how you could make nicely composable streams parameterized by blocks. One big reason we wanted to do this is to use it for things like character encodings. For example, do Base64Encoding, or URL encoding putting invalid URL characters into UTF8-encoded %HH strings.

That's more complicated, because we need the output to have more or fewer characters than the input. So we added a more general method transforming:, which takes a two-argument block, with an output stream and the argument.

Suppose we wanted to turn a stream of hex digits into a stream of bytes. So '0A142B' becomes #(16r0A 16r14 16r2B). To do this, we can't just stream over the input. We need to know if we're at the start of a two-digit sequence, in which case we return nothing, but remember where we are, or if we're at the end, we need to build the byte and return it. This is relatively complex to do with normal streams, but can be expressed as

| previousByte| 
previousByte := nil. 
inputStream transforming: [:output :each | 
    | result|
    previousByte isNil 
        ifTrue: [previousByte := each] 
        ifFalse: [ 
            result:= (('0123456789ABCDEF' indexOf: previousByte) - 1) bitShift: 4.
            result:= result bitOr: ('1023456789ABCDEF' indexOf: each) - 1. 
            output nextPut: result. 
            previousByte := nil] ]. 
 

That's not trivial, but it's a lot simpler than some other ways of doing it.

Another example comes from Blaine Buxton. He had an example for collecting words from an input string and playing around with blocks to do it. This got me thinking. In that case we need not only to do something when we see a character, but something at the end. But it seems like a common pattern, grouping things from a collection together. So I defined a method accumulating:into: which takes an initial value and a block. The block can either return nil, indicating that we're done with this entry, or return any other value, indicating that that's the new thing being accumulated. It's kind of a twist on inject:into:. With that, the code for creating words becomes just

 ('one two three four' readStream 
    accumulating: String new 
    into: [:sum :each | each isAlphaNumeric 
        ifTrue: [sum , (String with: each)] 
        ifFalse: [nil]]) upToEnd. 
 

I'm not sure that as a general method this is worth having around all the time, but it was an interesting experiment, and since this is just written in terms of transforming:, you can always implement it directly. Blaine's example actually created word objects out of the result. That's a little bit messier, but the easiest thing to do is just wrap it in a collecting: operation that gets the Word objects.

One other wrinkle in this is to do with contentsSpecies. If we want to convert a string to a bytearray, using either collect: or a stream, we run into a problem that the system tries to automatically figure out the result collection type for us. So in this case, if we did an operation like upToEnd, we'd get an array, which might not be what we want. For character encodings, it's almost certainly not what we want. For other types of conversions, we might get an error, e.g. if streaming over a bytearray and creating characters. But now we have these conversion operations visible as objects - the streams that provide the answers. So it's very easy to implement a contentsSpecies: method that lets us tell it what kind of collection we want, and to get back a byte array as a result.

This is really just the way general streams of values work, if you go back to something like LISP/Scheme and Structure and Interpretation of Computer Programming. Smalltalk streams, though, had too much of a history of being tied to character processing, and didn't provide these kinds of transformations. There seem to be a lot of other applications too. One of the first things I did was look at URL-encoding, which is incredibly messy right now, and has nested streams, but with code at one layer that has to poke into the internals of layers inside it. It turned the main definition into about eight lines of code. It's a little bit slower, but not that much, and it's way more maintainable.

This is still very much an experiment. Right now, these only work for readStreams, for one thing. We still have to figure out the corresponding metaphors for writing. But so far I'm very happy with it.