The "camel case" puzzle Travis posted interested me so much I figured I'd make a separate post about it, with some "value added" in the form of a few less practical but more interesting solutions. Warning: this is going to be a long one.
The "camel case" problem is finding "the tricky/elegant/terse/cool way" of splitting string made of words stuck together with word breaks indicated by case changes, like typical Smalltalk variable and method names. A complicating requirement is to do the right thing for acronyms or all-cap words, so that 'TheNextBIGThing' would be split as #('The' 'Next' 'BIG' 'Thing').
There is a convenient SequenceableCollection method in VisualWorks, #piecesCutWhere:, which evaluates the argument for all the pairs of adjacent characters in the string (1st and 2nd, 2nd and 3rd, etc) and returns a collection of substrings, broken between those pairs for which the block returns true. As an illustration, here is the simple (no acronym support) camel case splitter, as a method of CharacterArray:
splitSimpleCamelCase
^self piecesCutWhere: [:each :next | each isLowercase & next isUppercase]
Here, we split the string at each case "uptick", and that's enough to produce
'splitSimpleCamelCase' => #('split' 'Simple' 'Camel' 'Case')
To take care of acronyms, we need a more complex split condition. Besides splitting at each uptick, we also need to split before downticks: in 'FOOBar', there is a downtick between $B and $a, so we need to split before the $B (note how in simple camel case, where each capital letter is on its own, "one step before the downtick" is always the same as uptick).
This is a trickier condition to handle because we need to look one extra step ahead in the string, while there are no built-in enumerators doing that. There is also a pesky special case of the last two characters, for which there is no character two steps ahead. The simplest solution is to set up a stream to feed us those one extra step ahead characters:
splitCamelCase
| future |
future := ReadStream on: (self copyWith: $Z).
future skip: 2.
^self piecesCutWhere: [:each :next |
each isLowercase & next isUppercase
| (next isUppercase & future next isLowercase)]
Note how the extra $Z tacked onto the end of the original string takes care of the special case of the last pair of characters (any other uppercase character would do, of course).
This works, and is pretty simple, but doesn't quite feel like a one-liner. Here is an elegant twist we can do to improve it. In the first solution we needed the 'future' stream to see the character two steps to the right of the potential cutting point. If we reverse the order of processing the string and go right to left, we will already have seen that character by the time it becomes the one two steps to the right from where we are--so all we need to do is remember if it was uppercase or not.
Of course, we cannot change #piecesCutWhere: to go from right to left, but we can reverse the string before traversing it, and then reverse the results we get back. With a bit of creative rewriting of the split condition, this gives:
splitCamelCase2
| lastWasUppercase |
lastWasUppercase := true.
^(self reverse piecesCutWhere:
[:each :next |
lastWasUppercase not | next isLowercase & (lastWasUppercase := each isUppercase)])
reverse collect: [:each | each reverse]
This was the last solution I left as a comment in Travis' blog. Now the main part of this post. I wouldn't be myself if at this point I didn't write a continuation-passing style mindbender to do the job.
splitCamelCaseCPS
^(self
inject: [:next :next2 :return | return value: Array new value: String new]
into:
[:k :each |
[:next :next2 :return |
k
value: each
value: next
value:
((each isLowercase & next isUppercase or: [next isUppercase & next2 isLowercase])
ifTrue: [[:words :thisWord | return value: (words copyWith: (thisWord copyWith: each)) value: String new]]
ifFalse: [[:words :thisWord | return value: words value: (thisWord copyWith: each)]])]])
value: self last
value: self last
value: [:words :thisWord | words copyWith: thisWord]
This sure looks like write-only code, but not for the same reason lots of Perl code does. There is a very clean structure behind this one, but it's hard to see it because the essential pieces of functionality are hidden behind anonymous blocks and their interaction through the generic #value:... messages. And also because this isn't the usual style we do things in Smalltalk.
We are going to take it apart piece by piece, and then put it back together in a few possible variations to explore how it works.
But first about continuation-passing style. Continuations are not only those special language constructs that exist in Scheme for torturing computer science students, or that can be added to Smalltalk to implement cool web application frameworks. In a more general sense, continuations are found in any program. A continuation is a function that represents the future of computation. For example, if we consider the expression
3 + 4 factorial
while the factorial is being computed, the future is to take the result and add it to 3. This can be written in Smalltalk notation as a block
[:result | 3 + result]
and we can think of returning from #factorial as invoking that one-argument block with the result we are returning as the argument. This one-argument block is a continuation. At any point where we can imagine stopping the execution of any program in any language, there is one waiting for the result. At least in theory.
Theory is great, but what use is it? One thing, it presents a nice model of multiple-value returns. Returning more than one value from a function may seem a weird thing, but it is not at all if you think of that return as calling a continuation. From that viewpoint, the continuation at a multiple-value call site is a function of several arguments, and returning multiple values simply means calling that function and passing it those arguments.
If that still sounds too far from practical use, have a look at VisualWorks method Win32DialogSupplier>>parseDefault:into:ifGarbage:. The method parses a single file name string into the path and filename components, but instead of returning them as a two-element array or something of that sort, it calls a block it expects as an argument, passing them as the arguments. That block is a continuation representing a multiple-value return!
This is an example of continuation-passing style. In general, CPS is when closures are used to implement or reimplement some features of execution semantics. These features can be something not available in the base language, such as multiple-value returns. Or they can be features already available, but reimplemented in CPS to gain more control over execution, or to work around implementation limits. For example, algorithms that require deep recursion can be implemented using CPS in environments with limited stack depth--though we are not going there in this post...
Note that CPS is not directly tied to continuations in the Scheme or Seaside sense. They are certainly the same theoretical artifacts, but you don't need continuations in a language to program in CPS. Scheme continuations are reified continuations of the base language, while in CPS we essentially construct a new language by explicitly managing its continuations. Moreover, it may come as a surprise that CPS in Smalltalk is much more ubiquitous than most people think! For one example, just look at #ifTrue:ifFalse: message. What is it if not two continuations passed to the receiver, the receiver deciding which one to invoke?
With this in mind, let's get back to our string splitting. Here is the basic idea we are going to exploit. The first two solutions based on #piecesCutWhere: were essentially the same. There was a block evaluated for each potential split location. The block examined three characters: the one before the split point, the one after, and the one following that. The only difference was the way of supplying the block with that third character--and together with that, the direction of traversing the string.
In this last solution, we follow a similar approach by setting up a structure of closures such that for each position in the original string there is a closure holding onto the character at that position, and supplied with the two characters that follow from other closures. This is the first piece of the puzzle, set up by inject:into::
self
inject: terminatorBlock
into: [:previousK :each | [:next :next2 | ...previousK value: each value: next... ]]
It's not every day that we write #inject:into: with a block as the first argument, and the entire body of the second argument being a block as well, so let's look closely what happens here. For each position in the string, we create a closure with two arguments, :next and :next2, closed over a similar closure created for the previous string position (:previousK) and the current character (:each). For the very first character in the string, :previousK is the injected terminatorBlock. The result of the entire #inject:into is the closure created for the last character in the string.
We are going to call the closures on this chain the "work closures", since they are the backbone of the entire algorithm. (Perhaps calling them "vertebrae" could be a good idea too). We can get the algorithm going by calling the last closure. The closure code (so far unspecified, except that it has to do "previousK value: each value: next" at some point) would in turn invoke the closure for the second last character it holds onto as :previousK, passing :each as the first argument to it. That closure would invoke its own :previousK, and so on.
All that's left is coming up with the code inside the blocks to collect the characters into a sequence of words, grouping words according to our usual split condition.
Of the simplest solutions, there are two dual ones. We can accumulate characters in an OrderedCollection of OrderedCollections. The outer collection is the resulting sequence of words, and the inner collections--the individual words. Two possible solutions based on this idea correspond to two directions of building the result: we can add the current character to the result before calling previousK, of after. In the former case we are accumulating the result in the reverse relative to the string order, as we are going deeper and deeper into the nested invocations of closures in our closure chain (which goes from the end to of string to the beginning). In the latter we are doing it as we are returning from the nested invocations, so the characters are added in the same order they appear in the string. For the forward order, the main traversal block would be:
[:next :next2 | | words |
words := previousK value: each value: next.
words last add: each.
(each isLowercase & next isUppercase or: [next isUppercase & next2 isLowercase])
ifTrue: [words add: OrderedCollection new].
words]
It first asks the previous closure to give it the words so far (which are the words from the beginning and up to the current position in the string), appends the current character to the end of the last word, and then if the string should be split after the current character, adds a new word to the result. The terminatorBlock at the end of the main closure chain should provided the closure for the first character with the empty result:
[:next :next2 | OrderedCollection with: OrderedCollection new]
Putting it all together (and preserving the color-coding of the above pieces for clarity), we get
splitCamelCaseCPSForward
^((self
inject: [:next :next2 | OrderedCollection with: OrderedCollection new]
into:
[:k :each |
[:next :next2 | | words |
words := k value: each value: next.
words last add: each.
(each isLowercase & next isUppercase or: [next isUppercase & next2 isLowercase])
ifTrue: [words add: OrderedCollection new].
words]])
value: self last
value: self last)
collect: [:each | String withAll: each]
The last pieces that bind it all together are the invocation of the closure chain that passes the last character of the string as :next and :next2 to the closure for the last character. This is similar to passing $Z in the first two solutions. This allows us to use the simplest form of the split condition, without recognizing the last two characters as a special case. The characters we pass here don't matter since they are not included in the result. They only have to be of the same case as the last character, to not trigger a split condition--so simply using the last character works fine. Also, there is an outer #collect: to convert OrderedCollections in the result into proper Strings.
The 'reverse" solution is similar:
splitCamelCaseCPSReverse
^((self
inject: [:next :next2 :words | words]
into:
[:k :each |
[:next :next2 :words |
(each isLowercase & next isUppercase or: [next isUppercase & next2 isLowercase])
ifTrue: [words addFirst: OrderedCollection new].
words first addFirst: each.
k value: each value: next value: words]])
value: self last
value: self last
value: (OrderedCollection with: OrderedCollection new))
collect: [:each | String withAll: each]
The interesting difference here, highlighted in blue, is that all the work closures get a third argument, for the words collected so far, and the initial empty result is created when they are first invoked, while all the terminator block does is turning around and sending back the collected words it received as the final result. And, of course, characters and words are now tacked onto the result from the front, since we are traversing the string from the end.
Now we are prepared to look at the original #camelCaseCPS. It's more involved than the two solutions above, especially the "forward" one, because of a few extra restrictions I introduced to give it more of the "single breath" or "one-liner" spirit. OrderedCollection is a fine Smalltalk tool, but the solution will have a more austere functional feel if we stick to the simple Arrays and Strings, cons up new ones instead of mutating the original, and use the simple #copyWith: to build the result.
Since #copyWith: appends an element to a collection, this latter restriction immediately forces us to build the result in the "forward" style, if we want to avoid explicitly reversing things afterwards. The no-mutation restriction encourages us to split into two parts the single result that gets passed around: "complete words so far" and "current word so far," to make them easier to manipulate.
And here we have a problem. In our "forward" solution, the result so far was returned from the invocation of the previous closure. The problem now is that the result is split into two parts, and we need to return two values instead of one. Instead of going the ugly route of returning an array of values, let's remember what was said about multiple return values--that theoretically, they are nothing but multiple-argument continuations. So, instead of using the built-in single-value return, let's construct a multiple-value return using CPS. Our work closure becomes:
[:next :next2 :return |
previousK
value: each
value: next
value:
((each isLowercase & next isUppercase or: [next isUppercase & next2 isLowercase])
ifTrue: [[:words :thisWord | return value: (words copyWith: (thisWord copyWith: each)) value: String new]]
ifFalse: [[:words :thisWord | return value: words value: (thisWord copyWith: each)]])]
The new :return argument is the return continuation, passed by the caller as the third argument of the #value:value:value: message send in this same block. It takes two arguments: :words and :thisWord, containing the words and characters of the current word collected so far, appends the current character at the proper place, and passes the new words and thisWord collections to the :return block of the current closure--that it, returns these two values to the sender.
That third argument could also have been
[:words :thisWord |
(each isLowercase & next isUppercase or: [next isUppercase & next2 isLowercase])
ifTrue: [return value: (words copyWith: (thisWord copyWith: each)) value: String new]
ifFalse: [return value: words value: (thisWord copyWith: each)]]
without changing the end result. The only difference is the moment when we choose what execution branch to take. The latter version is less interesting--we always pass the same block, and choose how to build onto the result when the block is invoked. But we can make this decision before the block invocation, since all information it relies on is already available at the time of the #value:value:value: send--so the original proposed block does the test as soon as possible, and passes the chosen execution branch to previousK already pre-determined.
The terminator block becomes
[:next :next2 :return | return value: Array new value: String new]
which is the same thing we had in #camelCaseCPSForward, except that now we are using our home-brew continuation-passing multiple-value return instead of the built-in Smalltalk block return.
One last wrinkle is the very first of the continuations that build the result--the one passed as the last argument of the outermost #value:value:value: message, and invoked by the last work closure. It is a block that takes all the words collected so far and the last current word, and joins them together.
Putting it all together, we get the original #camelCaseCPS.
And one last thing (sorry, could not resist): do not try this in Java!
Inner classes notwithstanding.