A Generic Dictionary
This week, I thought I'd experiment with some new Collection classes
The Hypothesis
Sometimes, I want a dictionary that looks up on some computed attribute of some keys. With the base library, I have two choices. Equality and Identity. Identity seems pretty straightforward. But sometimes just what makes an object "equal" in its role as a key, can be a little less. For example, what if I compute values for given floating point inputs, but I want them bin them to the nearest 1/10th. Or what if I want to lookup customers by their name, and it's ok to assume that the lookup is case insensitive.
We could apply the conversion function to our computed key going in and out of the dictionary, but that gets old.
Also, I'm intrigued that the one size fits all hash function might not always be all that is needed for a given domain. And if we could use a specific hash function we might be able to improve the speed of lookup.
The Implementation
What we need is a dictionary where we can parameterize the functions used to index keys in the dictionary. There are two functions involved, first the hash function, and secondly, the comparison function.
Smalltalk.Core defineClass: #GenericDictionary
superclass: #{Core.Dictionary}
indexedType: #objects
private: false
instanceVariableNames: 'compareFunction hashFunction '
classInstanceVariableNames: ''
imports: ''
category: 'Collections-Unordered'
Now we've got our dictionary, we'll use blocks for our functions. Here's our simple setter. We just provide one dual arg version. One has to still observe the invariant that all objects which are equal MUST hash the same, so forcing the application program to set the two functions in context of each other is probably a good thing.
compare: a2ArgBlock hash: a1ArgBlock compareFunction := a2ArgBlock. hashFunction := a1ArgBlock
We want to initialize new instances to a good default state; the default use of = and hash seems good. Sets and thereby Dictionaries, don't use an initialize method, the use a setTally method at creation time to get initialized. So we override thusly:
setTally self compare: [:a :b | a = b] hash: [:obj | obj hash]. ^super setTally
With that in place, it turns there's exactly one method that needs to be tuned to use our functions:
findKeyOrNil: key "overriden to use hash and compare function blocks, rather than hardcoded #hash and #=" | location length probe pass | length := self basicSize. pass := 1. location := self initialIndexFor: (hashFunction value: key) boundedBy: length. [(probe := self basicAt: location) == nil or: [compareFunction value: probe key value: key]] whileFalse: [(location := location + 1) > length ifTrue: [location := 1. pass := pass + 1. pass > 2 ifTrue: [^self grow findKeyOrNil: key]]]. ^location
That's all there is to it.
The Results
The case insensitive keys is kind of fun. It seems to work fine:
dict := GenericDictionary new compare: [:a :b | a equivalentTo: b ignoreCase: true] hash: [:a | a asLowercase hash]. dict add: 'Travis' -> 'Griggs'. Transcript show: (dict at: 'travis'); cr. dict at: 'TRAVIS' put: '(The Kid in Old Yeller)'. Transcript show: (dict at: 'traVIS'); cr
outputs 'Griggs' and then '(The Kid in Old Yeller)' to my transcript.
I was kind of disappointed in my attempts to find a domain where I could best the default implementation for speed. Adding the extra level of block invocation will make our version a little slower than default for the default case. I used a timing test of the form:
| dict keys | dict := (GenericDictionary new: 1000). rnd := Random new. keys := (1 to: 100) collect: [:each | rnd next]. (Time millisecondsToRun: [1000 timesRepeat: [1 to: keys size do: [:i | dict at: (keys at: i) put: $y]]]) out.
It's not too bad, for this case, it was only about 10-15% slower. I then tried some floating point domains and specialized hash functions. In most cases, the default remained faster. Only after having a good look at the Float hash method was able to derive one where I could come out ahead:
| dict keys | dict := (GenericDictionary new: 1000) compare: [:a :b | a = b] hash: [:a | a truncated]. rnd := Random new. keys := (1 to: 100) collect: [:each | rnd next * 1000]. (Time millisecondsToRun: [1000 timesRepeat: [1 to: keys size do: [:i | dict at: (keys at: i) put: $y]]]) out
This particular example assumes a set of floating point keys that are sparsely spread far apart. At this point, we know that simply truncating them for a hash function is enough to get good separation, the default hash function does a bit more than that. So what we lose in block invocation, we gain by much faster hashing, for an overall speed increase of about 2.5x.
I'd have to conclude that only in rare cases is one going to come up with specialized hash/compare functions that are going to be a significant speed improvement. It's still kind of nice to be able to use it as a binning histogram. Or a case insensitive dictionary. I'll have to cogitate to see if there aren't some more interesting problems we could throw this guy at. Can you think of any?
Comments
[isomer] June 22, 2004 8:26:59.488
Travis, this is useful for more than just dictionaries. It would be great to have a pluggable comparison block for sets, and sequenceable collections too.
Strongtalk
[water] June 23, 2004 1:51:25.296
The Strongtalk collections system has this up and down the hierarchy built in to the main classes. It doesn't take a performance hit because of the cheap inlining of blocks performed pervasively. VW may be sophisticated enough now to handle this.