How to make the GNU Smalltalk Interpreter slower

How to make the GNU Smalltalk Interpreter slower

This is another post about a modern Linux based performance measurement utility. It is called perf, it is included in the Linux kernel sources and it entered the kernel in v2.6.31-rc1. In many ways it is obsoleting OProfile, in fact for many architectures oprofile is just a wrapper around the perf support in the kernel. perf comes with a few nice application. perf top provides a statistics about which symbols in user and in kernel space are called, perf record to record an application or to start an application to record it and then perf report to browse this report with a very simple CLI utility. There are tools to bundle the record and the application in an archive, a diff utility.

For the last year I was playing a lot with GNU Smalltalk and someone posted the results of a very simplistic VM benchmark ran across many different Smalltalk implementations. In one of the benchmarks GNU Smalltalk is scoring last among the interpreters and I wanted to understand why it is slower. In many cases the JavaScriptCore interpreter is a lot like the GNU Smalltalk one, a simple direct-threaded bytecode interpreter, uses computed goto (even is compiled with -fno-gcse as indicated by the online help, not that it changed something for JSC), heavily inlined many functions.

There are also some differences, the GNU Smalltalk implementation is a lot older and in C. The first notable is that it is a Stack Machine and not register based, there are global pointers for the SP and the IP. Some magic to make sure that in the hot loop the IP/SP is ‘local’ in a register, depending on the available registers also keep the current argument in one, the interpreter definition is in a special file format but mostly similar to how Interepreter::privateExecute is looking like. The global state mostly comes from the fact that it needs to support switching processes and there might be some event during the run that requires access to the IP to store it to resume the old process. But in general the implementation is already optimized and there is little low hanging fruits and most experiments result in a slow down.

The two important things are again: Having a stable benchmark, having a tool to help to know where to look for things. In my case the important tools are perf stat, perf record, perf report and perf annotate. I have put a copy of the output to the end of this blog post. The stat utility provides one with number of instructions executed, branches, branch misses (e.g. badly predicted), L1/L2 cache hits and cache misses.

The stable benchmark helps me to judge if a change is good, bad or neutral for performance within the margin of error of the test. E.g. if I attempt to reduce the code size the instructions executed should decrease, if I start putting __builtin_expect.. into my code the number of branch misses should go down as well. The other useful utility is to the perf report that allows one to browse the recorded data, this can help to identify the methods one wants to start to optimize, it allows to annotate these functions inside the simple TUI interface, but does not support searching in it.

Because the codebase is already highly optimized any of my attempts should either decrease the code size (and the pressure on the i-cache), the data size (d-cache), remove stores or loads from memory (e.g. reorder instructions), fix branch predictions. The sad truth is that most of my changes were either slow downs or neutral to the performance and it is really important to undo these changes and not have false pride (unless it was also a code cleanup or such).

So after about 14 hours of toying with it the speed ups I have managed to make come from inlining a method to unwind a context (callframe), reordering some compares on the GC path and disabling the __builtin_expect branch hints as they were mostly wrong (something the kernel people found to be true in 2010 as well). I will just try harder, or try to work on the optimizer or attempt something more radical…

$ perf stat gst -f Bench.st
219037433 bytecodes/sec; 6025895 sends/sec

Performance counter stats for ‘gst -f Bench.st’:

17280.101683 task-clock-msecs # 0.969 CPUs
2076 context-switches # 0.000 M/sec
123 CPU-migrations # 0.000 M/sec
3925 page-faults # 0.000 M/sec
22215005506 cycles # 1285.583 M/sec (scaled from 70.02%)
40593277297 instructions # 1.827 IPC (scaled from 80.00%)
5063469832 branches # 293.023 M/sec (scaled from 79.98%)
70691940 branch-misses # 1.396 % (scaled from 79.98%)
27844326 cache-references # 1.611 M/sec (scaled from 20.02%)
134229 cache-misses # 0.008 M/sec (scaled from 20.03%)

17.838888599 seconds time elapsed

PS: The perf support probably works best on Intel based platforms and the biggest other problem is that perf annotate has some issues when the code is included from other c files.

Nokia and Windows Phone 7

Nokia and Windows Phone 7

I had the opportunity to play with MeeGo for Handsets 1.1 and the upcoming 1.2 in December of last year on a Nokia N900. I very much felt reminded of Openmoko before I had joined them. There were a lot of promises, dreams but the reality looked differently. The Handset 1.1 release was not working at all besides being very slow the xterm and the window manager couldn’t quiet agree, the pre releases of 1.2 worked a bit better but is still far away from being usable. The SDK situation is not much better. The madde approach is very promising, the tools to recreate the SDK actually work but the Qt installation is different depending on if you are using the X86 or the ARM SDK. E.g. the WebKit headers are only in the X86 SDK but were not in the ARM SDK. The reason is that both SDK package descriptions are in different files and they do not get synced.

Nokia announced today to enter a strategic partnership with one of the biggest software companies in the world, Microsoft. I am glad Nokia doesn’t try to turn Symbian into a real operating system, I understand that with throwing out the GTK+ platform and restarting with Qt MeeGo is not close to challenge Android/iPhone. Nokia has missed out on purchasing WebOS from Palm and the best alternative is really to go with Microsoft Windows 7 Phone. On the other hand it is scary that Windows 7 Phone is the best alternative.

Wireshark dissector for TETRA

Wireshark dissector for TETRA

The Professional Mobile Communication Research Group of Beijing Institute of Technology (BIT) was kind enough to send us their TETRA Wireshark dissector. They went through the specification and created ASN1 files out of the tables, I helped with the integration and cleaning to get the code into wireshark and the wireshark developers were kind enough to do a fast review and the code is now merged.

The next part is to extend the GSMTAP dissector to pass on the tetra bits to the TETRA decoder, I have already written the code, now I need to get it merged into the wireshark codebase too.

My TETRA agenda includes figuring out majority logic decision decoding with syndromes which is required for the shorted Reed Muller code used in the AACH, create a proper GNU Radio block for the decoder emitting soft bits that we can feed into our viterbi decoder… maybe I will manage to do this during my vacation in Taiwan… who knows.

Cleaning up the cellmgr_ng and turning it into a MGW/STP

Cleaning up the cellmgr_ng and turning it into a MGW/STP

One year ago I was starting the on the BSC NAT application from a Hotel room in Munich, I was flying from Taiwan to Munich and was happy to see the snow from my window. Shortly after this I began to implement something we called cellmgr_ng. The job was to take MTPLevel3 coming out of a library and put the SCCP payload into the IPA protocol and send it to a MSC. As part of MTP Level3 I had to implement the link bring up and such as well (once again without having too much time to read all the specs).

The problems were as most of the times not the ones one would expect. The system behind the E1/T1 link was sending wrong SCCP messages, when a SCCP connection was not closed fast enough it would send out its own release message but was confused about the source local and destination local addresses. As a result I had to add connection tracking to the codebase, I am also rewriting/removing Point Codes to avoid crashes and make the MSC happy.

Last month and this month I had the pleasure to resume the work on this. My knowledge of MTPL3 is at least a bit bigger now, we have done the libosmocore split so I can use more features without copying code around and there was new demand.

The first big thing is to support a LinkSet with multiple Links in it. This allowed me to do a lot of cleanups, some are still to do but now the abstraction is a lot better and I will do testing and force link failures and see how the system is behaving. This will also end with VTY commands to manage links via the telnet interface, more statistics collected and reported, being able to start writing PCAP files for the linksets from the VTY command.

The second big feature is the addition of M2UA to the codebase. The code does not support fail over scenario’s yet but this should be relatively easy to add inside the structure now.

The result is that we now have something that could be used as a MediaGateway and convert MTPL2/MTPL3/{SCCP,ISUP} into SCTP/M2UA/MTPL3{SCCP,ISUP} and convert the audio to and from RTP, the other part is a Signalling Transfer Point (STP) with some basic capabilities.

Using systemtap userspace tracing…

Using systemtap userspace tracing…

At the 27C3 we were running a GSM network and during the preparation I noticed a strange performance problem coming from the database library we are using running. I filled our database with some dummy data and created a file with the queries we normally run and executed time cat queries | sqlite3 file as a mini benchmark. I also hacked this code into our main routine and ran it with time as well. For some reason the code running through the database library was five times slower.

I was a bit puzzled and I decided to use systemtap to explore this to build a hypothesis and to also have the tools to answer the hypothesis. I wanted to find out if if it is slow because our database library is doing some heavy work in the implementation, or because we execute a lot more queries behind the back. I was creating the below probe:

probe process(“/usr/lib/libsqlite3.so.0.8.6”).function(“sqlite3_get_table”)
{
a = user_string($zSql);
printf(“sqlite3_get_table called ‘%s’n”, a);
}

This probe will be executed whenever the sqlite3_get_table function of the mentioned library will be called. The $zSql is a variable passed to the sqlite3_get_table function and contains the query to be executed. I am converting the pointer to a local variable and then can print it. Using this simple probe helped me to see which queries were executed by the database library and helped me to do an easy optimisation.

In general it could be very useful to build a set of probes (I think one calls set a tapset) that check for API misusage, e.g. calling functions with certain parameters where something else might be better. E.g. in Glib use truncate instead of assigning “” to the GString, or check for calls to QString::fromUtf16 coming from Qt code itself. On second thought this might be better as a GCC plugin, or both.

In the name of performance

In the name of performance

I tend to see people doing weird things and then claim that the change is improving performance. This can be re-ordering instructions to help the compiler, attempting to use multiple cores of your system, writing a memfill in assembly. On the one hand people can be right and the change is making things faster, on the other hand they could use assembly to make things look very complicated, justify their pay, and you might feel awkward to question if it is making any sense.

In the last couple of weeks I have stumbled on some of those things. For some reason I found this bug report about GLIBC changing the memcpy routine for SSE and breaking the flash plugin (because it uses memcpy in the wrong way). The breakage is justified that the new memcpy was optimized and is faster. As Linus points out with his benchmark the performance improvement is mostly just wishful thinking.

Another case was someone providing MIPS optimized pixman code to speed-up all drawing which turned out to be wishful thinking as well…

The conclusion is. If someone claims that things are faster with his patch. Do not simply trust him, make sure he refers to his benchmark, is providing numbers of before and after and maybe even try to run it yourself. If he can not provide this, you should wonder how he measured the speed-up! There should be no place for wishful thinking in benchmarking. This is one of the areas where Apple’s WebKit team is constantly impressing me.

GSM in Smalltalk – a GSM Toolkit

GSM in Smalltalk – a GSM Toolkit

I started to play with smalltalk somewhere in February, more specific with the GNU Smalltalk implementation. Like it is with any new language and class library it takes a while to get productive and it took me until somewhere the last month where I finally started to do GSM handling in Smalltalk and thanks to laf0rge the code is now in a public repository and hosted along the other Osmocom projects. You can see all the subprojects over here.

So far I can create and parse the ipaccess IPA protocol, SCCP, BSSAP, BSSMAP and messages for Call Control and Mobility Management of GSM48. I also have a small Web-UI to connect to a MSC and then do a Location Updating Request and a Mobile Originated Call for a configurable IMSI (you will need to know your AuKey for that). It is very easy to parse and create the whole stack of messages, adding new GSM48 messages is easy as well (i still consider writing a program to get this directly from the GSM48 PDF).

This code is probably very well suited to do load tests of a MSC, e.g. find out how many LUs it can do a second, also to play easily with various quirks with it. Thanks to the nature of smalltalk one can easily change the creation of messages at runtime without needing to reconnect to the MSC. The next thing would probably to use it for fuzzing a MSC…

One note of caution, it is my first Smalltalk project and there are probably plenty of things I do wrong, I already know some of them and will work on refactoring it to a better structure. The next stop is nice ASN1 support, I will base it on the LDAP work.

osmo-bsc in OpenBSC master

osmo-bsc in OpenBSC master

A small service announcement. The bsc_msc_ip from on-waves/bsc-master is now dead and one can use osmo-bsc from master. It has the same functionality but is implemented in a cleaner way making it more easily extandable.
The biggest benefit of a open/free BSC equipment is the flexibility. If your network is a bit different to what people have thought out 30 years ago, you are now in the position to easily change that. E.g. your BSC could have different MSC connections and send different subscribers to different MSCs.
I am going to work on some new BSC features as well. With the new internal BSC API it should be possible to have different call routing based on some criteria, implement IMSI access control on the BSC, provide CBS services…
Best Future ever

Best Future ever

I’m attending some classes in University to finish up my degree. The classes in itself are close to an abuse and punishment (they are mandantory) but on the commute I manage to read one paper in each direction. I was reading Reflective Facilities in Smalltalk-80 and saw the best implementation of a Future ever. In their Smalltalk-80 dialect all message sends go through a #dispatchMessage: selector and the future is implemented in there.

dispatchMessage: aMessage
semaphore wait.
result become: self
^super dispatchMessage: aMessage

This implementation waits for a signal in the semaphore and then turns itself into the result of the operation. This allows one to write code like this
| f |
f := Future promising: [2.0 + 2.0].
f printString.
The execution of the message sent to the Future will block in the #dispatchMessage: until the result of 2.0 + 2.0 is available and then send the printString to the result. There is only one downside, most smalltalk dialects don’t have a dispatchMessage: like this.