Things I didn’t know about RTP and AMR

Things I didn’t know about RTP and AMR

Oh my god… Looking at an AMR payload wrapped in RTP, wrapped in UDP, wrapped in IP… one will recognize that there is a +70% overhead on payload vs. header… The other thing is… the AMR payload can have a CRC or not but it is not indicated in the RTP header and must be signalized out of band… it took someone a while to figure it out. Hurray on having clever colleagues.

Dealing with Performance Improvements

Dealing with Performance Improvements

I hope this post is educational and help the ones among us doing performance optimisations without any kind of measurement. If you do these things without a benchmark you are either a genuis or very likely your application is going to run slower. I’m not going to talk about performance analysis right now but tools like OProfile, callgrind, sysprof and speedprof are very handy utilities. The reason I’m writing this up is that I saw a performance regression in one of my testcase reductions and this is something which I don’t appreciate and in general I see a lot of claims about performance tuning but little bit in regard to measurements and this part is very worrying.

For QtWebKit we have the performance repository with utilities, high level tests and something I labeled reductions. In detail we do have the following things:

  1. Macros for benchmarking. I started with the QBENCHMARK macros but they didn’t really provide what I needed and changing them turned out to be a task I didn’t have time for. I create WEB_BENCHMARK macros that work the same as the QBENCHMARK macros. One of the benefits is to provide better statistics, it prints the mean, std deviation and these things at the end of the run. And it has a different metric for measuring time. I’m using the setitimer(2) syscall to measure the CPU time we are executing in userspace and kernelspace on behalf of the application. This metric is a robust way to avoid issues like CPU scheduling and such. It would be the wrong metric to measure latency and such though, as we are not executing anything when waiting.
  2. Pick the area you want to optimize. With the QtWebKit performance repository we do have a set of reductions. These reductions consist of real code, a test pattern and test data. The real code is coming from WebCore and is driving Qt, the test pattern comes from loading real webpages. It is created by adding printf and such to the code and the test data is the data that was used when creating the test pattern. We do have these reductions for all image decoding operations we are doing on the webpages, for our font usage, for QTextLayout usage.
    The really awesome bit about these reductions is that they generate stable timings, are/should be fully deterministic. This allows to really measure any change I’m doing to let’s say QImageReader and the decoders.

Using the setitimer(2) syscall we will have pretty accurate CPU usage of the benchmark, using the /lib/libmemusage.so of GLIBC we should have an accurate graph of the memory usage of the application. It is simple to create a benchmark, it is simple to run the benchmark, it is simple to run the benchmark with memory profiling. By looking both at CPU and Memory usage it will become pretty clear if and where you have tradeoffs between memory and CPU.

And I think that is the key of a benchmark. It must be simple so people can understand what is going on and it must be simple to execute so everyone can do their own measurements and verify your claims. And specially having a benchmark and having people verify your measurements is keeping you honest.

Finally the commit message should state that you have measured the change, it should show the result of the measurement and it should contain some interpretation. e.g. you are optimizing for memory usage and then a small CPU usage hit is acceptable…

Explorations in the field of GSM

Explorations in the field of GSM

Something like 14 months ago I had no idea about GSM protocols, 12 months ago I was implementing paging for OpenBSC, beginning from last summer I explored SS7 and SCCP, wrote a simple SCCP stack for On-Waves. Started to implement the GSM A Interface for OpenBSC, the last week I saw myself learning more about MTP Level3. With the Osmocom I start to explore GSM Layer 1 (TDMA, bursts, syncing), GSM Layer 2 (LAPDm) and on GSM Layer3 we mostly see the counterpart of OpenBSC.

I feel like I am back to school (in the positive way) and I have learned a lot in the recent year and looking forward I will learn more about protocols used at the MSC side and such. I’m very excited about what the future is going to be like. Will we have a complete GSM Network (BTS, BSC, MSC, MS, SMSC, GPRS gateway(s)) with GPL software by the end of the year?

Conclusions of my QtWebKit performance work

Conclusions of my QtWebKit performance work

My work on QtWebKit performance came to a surprising end late last month. It might be interesting for others how QtWebKit compares to the various other WebKit ports, where we have some strong points and where we have some homework left todo and where to pickup from where I had to leave it.

Memory consumption

Before I started our ImageDecoderQt was decoding every image as soon as the data was complete. The biggest problem with that is that the ImageSource we are embedded into does not tell the WebCore::Cache about the size of the images we already have decoded.

In this case there was no need to decode the whole image as soon as the date comes in but wait for the ImageSource to request the image size and the image data. This makes a noticable difference on memory benchmarks and allows us to have the WebCore::Cache control the lifetime of decoded image data.

We still have one case where we have more image data allocated than the WebCore::Cache thinks. This is the case for GIF images as we are decoding every frame to figure out how many images we have there.

To fix that we should patch the ImageSource to ask the ImageDecoder for “extra” allocated data, and we should fix/verify the GIF Image Reader so we can jump to a given GIF frame and decode it. This means we should remember where certain frames begin…

Performance

Networking

Markus Götz and Peter Hartmann are busy working on the QNetworkAccessManager stack. Their work includes improving the parsing speed of HTTP headers, making sure to start HTTP connections after the first iteration of the mainloop instead of the third.

In one of my tests wget is still twice as fast as the Qt stack to download the same set of files. And wget is using one connection at a time, no pipelining… and Qt is attempting to have up to 6 connections in parallel. This means there is still some work to do in reducing latency and improving scheduling of requests. I’m pretty confident that Markus and Peter will work on this!

Images

The biggest limitation of the Qt Image decoders is that in general progressive loading is not possible and unless I have messed up my reduction the Qt Image decoders are faster than the ones we have in WebCore.

With some of my reductions I can make some stuff twice as fast for the pattern QtWebKit is having on QImageReader. Currently when asking the QImageReader for the size, the GIF decoder will decode the full frame (size + image data). For the GIF decoder we start the JPEG decompression separately for getting the size, the image and the image format.

A proof of concept patch for the JPEGReader to reuse the decompression handler showed that I can cut the runtime of the image_cycling reduction by 50%.

Misc

One misc. performance goal is to remove temporary allocations. E.g. remove QString::detach() calls from the paint path, to not copy data when moving from QString to WebCore::String, QByteArray to WebCore::String. Some of these include not using WebCore::String::utf8(), but have a zero cost conversion of WebCore::String to QString and use Qt’s utf8()…

Text

But the biggest problem of QtWebKit performance is text and I statzed to work on this. For Qt we always have to go through the complex text path of WebCore which means we will end at QTextLayout, which will ask harfbuzz to shape the text.

There are two things to consider here. For QtWebKit we are using Lars’s QTextBoundaryFinder instead of ICU. I’m not sure if we have ever compared how ICU and QTextBoundaryFinder split text. We might do more work than is necessary, at least it would be good to know. Specially for Japanese and Korean we might split words too early creating more work for our complex text layout path.

The second part is to look at our QTextLayout usage pattern and start to optimize for it… the quick solutions of asking QFont to not do kerning, and not to do font merging (to not use the QFontEngingeMulti) didn’t really make a noticable difference… To get an idea of the size of the problem, on loading pages like the Wikipedia Article of the Maxwell Equations we are spending so much time in WebCore::Font::floatWidthForComplexText that other ports like WebKit/GTK+ takes to load the entire page. This also seems to be the case for sites like google news.

And this is exactly where I would have loved to continue to work on it, but that is now pushed back to my spare time where it needs to compete with the other hobby projects.

On getting Free Support

On getting Free Support

Most of the things I know about Software and Hardware I have from reading books, looking at sourcecode but most importantly people willing to answer my questions on IRC and giving me a direction I could look for answers.

Now it seems to be my part to give back and help others to gain knowledge, but some things have changed. Free Software has made it to the mainstream, so besides hackers that want to understand things, we do have paid programmers that don’t want to understand but still need to make it work.

So the other day I was finding me in a IRC query. His target board was a Freescale i.MX27 ADS and his mission was to make it boot from NAND. He was using the OpenEmbedded “mx27ads” machine and had successfully build a kernel and rootfs. Now the problem is that Freescale is not particulary liked in the Free and Open Source Software Community and that Freescale prefers to be on an isolated island. After about 8 hours of spending my time on this, i decided to get back to paid work and carry on.

If you are searching for support on mailinglists, irc channels and irc queries be prepared to think, giving Free Support means that you will be helped to understand the problem and have to pick a solution yourself. If you don’t like that, don’t want to think, don’t have the time to think, you should consider getting someone from the irc channel as consultant.

So here is a list of things that work when paying a consultant but not when you are paid to do your job and you need someone else to do it for you:

  1. Pasting your log somewhere and then ask what is wrong with it. In case of compile failures with GNU make you have to search for ‘***’ in the error log, in case of an early failure of a bitbake run it tells you what software to install, in case of configure failures read the config.log, in case of other failures search the log for ‘Error’, ‘rror’, ‘Failure’, ‘ailure’. And most importantly think before you post it.

    At OpenEmbedded we check for Software being installed on the system. This includes checking for GIT, CVS, SVN and other projects needed to bootstrap. If you don’t have it installed OpenEmbedded will tell you “Error… you don’t have installed:” and the output will be finished by “You will have to install….”. If you paste such an error and ask what the problem is you are really embarassing yourself, your parents, your teachers and your country. The Error message tells you which binaries it searched for and couldn’t find, and it points you to common package names for them. I don’t think it can be any easier for a Software Engineer.

  2. Knowing your hardware. Remember, you have the hardware in front of you, you have it connected somewhere, you have something compiled on your system. So if you are asked about the system specification you will have to point to a PDF, website that names the SoC, the used flash chip (NAND, NOR, erase size, name of it), the SDRAM used and whatever else is used on the system. If you paste a FAQ of something like Buildroot you have clearly failed. Know the stuff that is on your desk, if you don’t nobody can help you.
  3. Pasting unrelated information. When you are asked to paste the output of something. Do not just write that oneline by hand. In many cases we humans apply interpretation to things. When you are asking for help it is an indication that you are not able to interpret the result in a way that lead to a result. Paste the full log, it is saving everyone a lot of time.
  4. Listen/Read to what people tell you. If you have a kernel without NAND support, but want to boot from NAND. You will have to enable NAND and the MXC NAND driver in your kernel. You will have to compile the kernel again, and you will need to boot that kernel. Now we are all humans and have done mistakes before.

    One of the most common mistakes is to not change the config, or not to boot the kernel you have built.

    For the config, check .config after you have built if it is the one you expected it to be, in case of OpenEmbedded copy the .config back to the defconfig (recipes/linux/${PN}-${PV}/${MACHINE}/defconfig). You should copy it back to have a backup in the case you are rebuilding the kernel or such.

    For the second thing it is rather easy. The kernel does contain the TIME it was built and it contains the number of times you have built it from the same directory. So if you rebuild the kernel the TIME and the NUMBER will go up. The kernel does print this information at bootup. If you are asked to check that, don’t paste the log, but check it for yourself. Last but not least, if the NUMBER did not go up or the TIME did not change you have either not rebuilt it, or not from the same directory… And if you didn’t rebuild from the same directory then you are likely to not use the right config…

  5. Don’t do crazy things while people spend their time to help you. Do not remove your build directories and just start over. No, even after your next rebuild the kernel will lack NAND support. That is because it is not enabled in the defconfig for your machine, and the build process is deterministic… You have someone on the other side that decided he wants to help you, if you are not focused, why should he?

And for all of you, that have read until the end. If you decide to seek help in an open forum. First do your homework by having some idea about the problem, have information ready (hardware spec, build logs, whatever), be prepared to think, formulate a hypothesis and try it.

With Free Software you are in the fortunate situation that you can talk to the guys who build the stuff you are using, all you need to do is to be focused on receiving help and be prepared to think.

Reverse engineering with okteta

Reverse engineering with okteta

In the last week I was hacking on OpenBSC to make GSM 12.21 Software Load usable for the ip.access nanoBTS. The difficulta was not within GSM 12.21 as Harald had it implemented for the Siemens BS11 BTS. The difficulty was that some messages need to contain paramaters and these come directly from the firmware file which ultimately means that one needs to understand the firmware file format to extract these. okteta came to rescue me and it was extremely good at doing this.

Okteta has not only the hex view one expects but also some useful utilities. Selecting a couple of bytes and the “Decoding Table” can tell you the different values in different endinanesses. So whenever I thought this is a file length, I would look into the “Decoding table”, select bytes and see how many I selected and if it could make sense, it can calculate various checksums over a selection.

Thanks a lot for Okteta, it safed my day!

Looking back to 2009

Looking back to 2009

The second year as part time freelancer has passed.

Looking back the most significant things are:

  • Signing the contribution agreement for gdb and glibc with the Free Software Foundation and trying to contribute to both projects. So picking future work will always have to be compatible with this.
  • Hacked on OpenBSC. At first just simple stuff like a telnet interface, paging and later doing paid work for On Waves to add SCCP over IP, GSM 08.08 and other things for “toy” integration of OpenBSC into a real network.
  • Mid this year I asked Nokia if they have work for me in Asia, later I started focusing on QtWebKit performance. Allowing me to improve QtWebKit and Qt (which will benefit a lot more users), but also to look into various tools like OProfile, memprof, memusagestat and just know netfilter queue’s… more on this later.
  • I have done my usual things on OpenEmbedded, working on landing patches through the patchwork queue, finally redoing the Bitbake parser and working on the Qt recipes.
  • I didn’t manage to make a Linux Kernel contribution. I wanted to write a i2c driver for a fm radio chip but I fried my hardware with a broken power supply, my MIPS patches are not yet done. So if you know of any Kernel work where stuff can be released/upstreamed please let me know!
Kafka’s Briefe an den Vater

Kafka’s Briefe an den Vater

I used to read books of Franz Kafka a lot, I loved his struggle with his father as I could relate to it. I loved the idea if writing letters in form of liberating myself. But in recent years as I got older and parents became less important.

So the last thing I will ever say about my “father” (and in case he reads it, as even old people can use computers…). I have no memory left of you, I have no picture left from my youth that displays you, I have no gift from you left in my posessings, I will get rid of your last name, You are a lonely old man. Good bye, I don’t care for you.

(Qt)WebKit Sprint in Wiesbaden

(Qt)WebKit Sprint in Wiesbaden

The sprint is over for some time. You can see summaries of the different sessions and some slides in the wiki. Besides talking about QtWebKit and how to improve it (API, features, speed, make people aware that they can contribute, influence the release schedule, policies.. *hint*) the thing that has impressed me the most is unrelated to coding.

We all hear when someone from our Community is leaving the Qt department, and we always wonder how life will continue, who will fill the gap. In the last year a couple of new people got hired at Oslo and I’m really impressed how they find such capable people that are technically skilled and willing to move to Oslo! kudos!

hawkboard.org

hawkboard.org

Apparently TI wants to continue its’ success with the beagleboard and apply it to the OMAP-L138 product in terms of hawkboard according to a presentation at FOSS.IN the coolest thing about the hawkboard and the OMAP-L138 is the floating point DSP.