9 AM on Monday morning, and I'm at Travis' Efficient Smalltalk presentation. I could have gone to Alan's GLORP talk, but I've spent many, many years trying to ignore databases - no need to change that policy now :)
Travis started in Smalltalk in 1992 - he left a Fortran job to get away from a bad boss. Taught himself Smalltalk from the Digitalk tutorials "back in the day", and has been Smalltalking ever since. He moved to Key Technology in 1996, and has been there ever since. In distribution terms, the audience is pretty heavily on the VW side, which is cool.
The idea Travis has for this tutorial: dispel the "Smalltalk is slow" myth. Plays into my talk which will come up later this week - I've yet to hit a scaling problem with Silt that wasn't a problem with my code specifically. What it's not: VM optimization, or "Smalltalk is faster than language (x)" thing.
Travis: Smalltalk has one of the best "thought to code" ratios" - you can get the code you are thinking of out very quickly. It's easy to throw a prototype together very quickly - the downside being that it's also easy to write some very inefficient code. The good news: it's just as easy to fix a bad design, and to improve performance relatively easy. That fits my experience - I've never had much trouble getting rid of the bad spots in my blog server. The motto:
Make it Work, Make it Right, Make it Fast
Early optimization is truly the root of all evil. A well designed system optimizes easy. An example: "Flogging the Logging", where Travis says he didn't actually examine the problem before diving into it.
Performance tricks aren't always reusable
"The problem is the system isn't running the Sieve of Erathosthenes" - Alan Kay. So you need to review each gain in context - do they really matter given the larger issues? You have to be willing to look at old assumptions and throw stuff away. Something too few of us do: Diagnose first, code for speed second. I know that my guesses about where the problem is in my own code are frequently disproved by the profiler.
Unit tests can help - you can verify the working nature of the code as you optimize. I can certanly say that a lack of tests can make for deployment surprises.
Where does the "Smalltalk is slow" myth come from?
We don't make time for the "make it fast" part. We get working code so quickly that we tend to just ship it. Throwing faster hardware at the problem is oftena cop-out - Project managers need to allow time for optimization.
Time millisecondsToRun: 
Run it many times, and watch for simple things that throw things off - Large Integers, memory allocation. In VW specifically, dotted references (namespaces) can be an awful lot slower. Make sure to figure out the profiling overhead, because a presumed improvement may not actually exist. Make sure that you use consistent measuring methods across tests.
Good tool to get familiar with. Nice if they can be used programmatically and from a UI. Nice if they can dump to a UI and a file. Multiple ways to view results is good too. One thing I've noticed in multiple visits to customers is how few people have ever looked at the profiling tools. I've often showed up and asked "have you profiled?" and gotten the answer "we know what the problem is". I've yet to see that be the case :/
There are different tools across implementations - VW has five, measuring time and allocation, in one process or multiple - and a VM profiler. Squeak has two options. VA ships with a very complete profiler that does sampling and tracing. It also supports some nice graphical output. You can use it interactively or programmatically. Michale Lucas-Smith points out that the Instantiations goodies includes another profiler. Smalltalk/X has some basic message tally tools. Only does sampling, and doesn't have a nice UI. Also some memory tools. Dolphin has a goodie from Ian Bartholomew that is interactive only. Has a nice UI with lots of detail.
Bruce Badger brings up the Code Crawler (VW and Squeak). It pulls out tons of metrics and has good reporting tools. Somewhat reminiscent of the old Arbor Probe tool.
Patterns for Performance
This is a set of examples that illustrate a theme (or category) that overlap. While few performance issues are exactly like another, they often share the same general problem. While Travis includes numbers here, he's saying to take them with a grain of salt: YMMV.
The first thing - recalibrate your expectations. Optimization tricks are often not reusable:
- Not across different VM's
- Not across different versions
- Not across different OS platforms
- Not across different contexts
All of which explains why early optimization is such a bad idea. So some "old chestnuts": "Everyone" knows that string concatenation is slower than dumping to a stream. It turns out that concatenation is actually faster now, because it's been rewritten. You can get faster with streams by properly sizing up front, but it illustrates the power of old assumptions.
Another example: Many people thing that #at: would be faster if it offered an unsafe version. So Travis wrote a faster #at: in Smalltalk/X. In Smalltalk/X, you can inline C code - so you can do "primitives" directly. So into a demo - remove the bounds check, remove the small integer check - and then test. It turns out that making that change doesn't actually speed things up that much - which has to do with changes to the underlying hardware. Again - when doing performance testing, you have to be willing to toss old (possibly outdated) assumptions.
Use a different algorithm
Chisel away at the profiler results - difficult to see the forest from the trees that way. Big gains are usually found in new algorithms, not in small tweaks. Smalltalk's thought:code ratio makes that easier. If the algorithm itself is slow, then the profiler won't help much.
An example: #indexOfSubstring:at: - in VW, the library uses the simple, direct algorithm. There's a publically available algorithm (Boyer-Moore) that's faster.
Similar idea - use a different (data) representation. The RB does this with a replacement for a dictionary that does linear searching - since the data sets are small, it's a win over the overhead of dictionary hashing. Sometimes, the "slower" CS answer is faster.
Another example: OrderedCollection can grow from both ends. What if all you need is a Stack? He implemented a Stack class that only uses #lastIndex. It's 2X faster for normal use, but is slow if you do need to add at the front.
Use a real object - don't just keep a dictionary, or a collection around and start adding helper methods to the wrong classes - create a real class so that you can isolate problems. There are simple wins too:
somePerson name is slightly faster than somePerson at: 1 - and it reads easier, which is probably more relevant in the long run.
Sometimes it's a win to do the new algorithm in Smalltalk rather than trying to optimize down in C - the C code is harder to write, so production of a new algorithm is so much easier. If you can get it to be "fast enough", you can just stop.
And we're back after the coffee break - much needed, since I'd only had one cup before this all started. So back to the examples.
Transcendental math is great (I blogged about this last week). The VM and image take care of problems for you, either via double dispatch, or via failure/coercion/retry. There is a performance hit for this convenience. So in general, SmallIntegers in Smalltalk are highly optimized, and are no work for the collector. In most Smalltalks, putting the higher generality number on the left can be much faster. i.e.,
100.0 * 10000 is around 4x faster than 10000 * 100.0. So even in Smalltalk, knowing your implementation issues helps. [ed - I had to get iterations up to a million to see a difference in VW]. Fractions are very cool, but they are also very expensive. If you don't need them, don't use them. So for instance:
(a/b < c) is much slower than (a < (b * c))
simply because of Fraction math. Division is expensive in software math in general, not just in Smalltalk.
In math issues, zero is "special" - Nothing is slower than a UHE. An #isZero check is much faster than catching the ZeroDivide exception.
In some cases, limiting the production of new objects (in order to not create GC need) can help. In particular, this can be a big win in graphics work, like translating rectangles. Coding a specific rectangle move method that doesn't create objects as a side effect can end up creating far, far less garbage to clean up.
This also comes up in working with collections - most of the extant methods return new collections instead of changing in place. If you need to work with large collections, creating your own versions that modify in place can be a big win. Examples:
someCollection asSortedCollection first (or #last)
instead, try: someCollection fold: [:a :b | a min: b].
You save the creation of a throwaway collection. With the test data, it probably doesn't matter. With real data, it can make a huge difference. Side note: Martin Kobetic and Michael brought up this post from Alan, which uses streaming as an alternative approach.
Another thing: using simple file "database" access, over-use of Filename>>construct: might be expensive. Hmm - I'll have to examine that in my blog server.
There's always the old standby - caching. For large objects this can be a nice win based on the way the VM stores large objects. Along the same lines - don't do expensive things over and over again. Sometimes, this will result in ugly code that ought to be commented. For instance, in VW, RBScanner>>scanToken:
"fast-n-ugly. Don't write stuff like this. Has been found to cause cancer in laboratory rats. Basically a
case statement. Didn't use Dictionary because lookup is pretty slow."
Another interesting thing: in general, using a MessageNotUnderstood handler is faster than asking the object if it #respondsTo: Although, if you are using this for backwards compatibility, you probably have other problems.
Collections often have a "lifecycle" - i.e., you add things once, you sort once, and then you're done. Quite frequently, the code keeps switching collections based on early needs - you'll know you have this if you see lots of #asOrderedCollection (etc) in your code.
Simple minded standbys (cache a computation in a temp variable, for instance) can often have big impacts as well. This goes back to the fact that implementing in Smalltalk is very easy, so it's easy to an inefficient piece of code into production.
Skip Framework overhead
Don't swat flies with howitzers! Use the simplest object that could possibly work. An example - ExternalRead/WriteStream is a general solution for most needs. You can use an IOAccessor directly if you write enough files that there are performance concernns - and it can be faster. Another example: ComposedText does a lot, Label does a lot less. Label is tons faster if you don't actually need any of the features of ComposedText.
Another example: The XMLParser. The parser framework does everything you would ever want, and builds up a document. If all you need is to find a tag or tow, a simpler string search might suffice (iirc, the Web Toolkit does a fair bit of this).
Inlining can make code easier to read, and can also turn up chances to optimize in other ways. This brings to mind Terry Raymond's defactoring tools. It can make a difference whether you use #isNil instead of #ifNil: - small, but noticeable. Someone points out that the when IBM started inlining #ifNil: (and friends) in VAST (quite awhile ago now), it broke a number of proxy frameworks. Back to the howitzer comment.
Down to the "metal" in Smalltalk. There's variance amongst primitives, especially in the graphics area across platforms. So an example: say you need to draw diamonds. Obvious: #displayPolygon: Not as obvious: create a mask and draw that. The results between these two vary a lot between X11, other platforms, and VNC (and likely things like Terminal Server). You need to be aware of your platform in cases like this.
As a last resort, you can make your own primitive (noting that this makes code harder to develop. install, and share).
Make it look faster
Users don't like being left out - include them. Get feedback, and be aware that fast is in the eye of the beholder. An example: The Windows "flying folder" copy is 20% slower (Win 2k test) than XCopy. Most users think it's faster, because they are getting feedback.
Smalltalk doesn't have to be slow - it's usually misuse of tools that cause problems. Check your assumptions, and look at the algorithms you use.