Talking about performance measurements at foss.in

Talking about performance measurements at foss.in

It is the second time I’m at foss.in and this time I was talking about the current work I’m doing on QtWebKit. Nokia is kind enough to give me enough time to explore the performance of QtWebKit (mostly on Qt Embedded Linux and ARM) and do fixes across the stack in WebKit, Qt or whereever we think it will be necessary.

Performance for me comes in memory footprint and runtime speed (how long does it take?) and for this I have experimented using OProfile, Memprof/Memusage, QBENCHMARK but also wrote some WebKit specific tools. E.g. a tool that allows me to mirror webpages to turn them into a benchmark (which still has quite some problems), a simple http server to serve the content, some test case reductions in order to look into specific areas like networking, image decoding, painting, fonts.

The slides and links can be found here and they link back to the WebKIt wiki where you can find an introduction to the (Qt)WebKit specific tools, a set of bugs and pending patches, and a set of issues that are known but not yet handled.

The main message of the talk is to not do optimisation by myth, but to use a stable environment and one of the existing tools to see what is going. It is really easy.

Attending foss.in

Attending foss.in

Thanks to generous sponsoring I managed to make it to Bangalore for FOSS.IN and Girish kindly agreed to provide accomodation. It is really great to be in India again, to see the streets, the local market, catch up with friends from India and Europe.

Girish is currently struggling with plugin code in QtWebKit for Mac, Win and X11 in both window and windowless mode, in the good ole wold and in QGraphicsView… I’m analyzing the loading behavior of a particular website and now I need to find out why we take a whole second to do layout and sometimes just actively do nothing… It is really awesome to have some clever company here in India.

GSM AMR (Speech Version 3) with OpenBSC

GSM AMR (Speech Version 3) with OpenBSC

This week I had to make parts of OpenBSC work with TCH/H and use AMR. This work is needed for On Waves and when I say parts I mean the strict BSC subset of OpenBSC (in contrast to make the MSC code we have work as well).

The first part is to make TCH/H work and that was easy as LaF0rge did almost everything to make it work. You have to change the OpenBSC configuration to use TCH/H instead of TCH/F for the given timeslots. The next thing was to make channel assignment work. The Mobile Station (MS) comes on the Random Access Channel (RACH) and is asking for a channel and gives a random number (so it can identify the response). Now depending on a global indicator (NECI) the MS will ask for different channels.

So the next step was to add a NECI configuration to our VTY configuration code and then change the code that decodes the channel request to know about the NECI and pick the right channel. On top of that a small hack to assign a TCH/H in case of a MS requesting “any” channel as part of paging.

Now that TCH/H should work one has to focus on the speech. GSM 08.08 and GSM 04.08 have different enums for speech. GSM 08.08 is differenting speech version 1,2,3 for full and half rate totalling in six different values, for GSM 04.08 there is a TCH mode that includes speech version 1,2,3, various data modes and signalling (but no differentation full/half rate channel). After getting this right and selecting speech version 3 it still didn’t work. It turned out that one has to fillout the optional Multirate Configuration when one is using speech version 3. This multi rate configuration needs to be present in the GSM 04.08 RR Assignment Command, Modify Channel but also in the RSL messages for Modify Request and Channel Assignment.

After this AMR on a TCH/H should work (when the BTS is supporting it too). The next step for someone else is to make the MSC code in OpenBSC work with TCH/H and other audio codecs. This would require to stop to ask for a TCH/F, change the channel requested decoding again..

Visiting On Waves in Iceland

Visiting On Waves in Iceland

Currently I’m sitting in the nice offices of On Waves and when not trying to convince the embassy of India to give me a visa I’m working on OpenBSC. For this week I try to make call handling with the MSC rock solid.
So far I have fixed some bugs, added features to OpenBSC, enabled A5/1 encryption, started using TCH/H, started using AMR, fixed bringup on nanoBTS coldstart and now I’m working on the MGCP to verify that I can hear audio for my calls.

The benefit of using GCC NEON intrinsics

The benefit of using GCC NEON intrinsics

I’m currently writing NEON code for the Qt PorterDuff SourceOver implementation. At the beginning one has to make the decision to use inline assembly, a seperate .S file or the ARM NEON Intrinsics.

I have chosen to go with the ARM NEON Intrinsics embedded into C++ code for a couple of simple reasons. At first it is portable across gcc and RVCT doing a .S or inline assembly would not work for RVCT that is used by the Symbian people. The second reason is that I get type safety. The NEON registers can be seen as 8bit, 16bit, 32bit, 64bit signed/unsigned registers when doing low level assembly you might pick the wrong operation and it is hard to see, with using the intrinsics you get a compiler warning about your mistake. One downside is that with some easy things I can make my compiler abort with an internal compiler error… but this will change over time.

Next is the myth that GCC is crap and that the instrinsics are badly “scheduled”. From my looking at the assembly code it is mostly arranged like I wanted it to be. On a simple operation GCC was putting a LDR in the code right inbetween neon load and stores operations, with a simple change in the code this LDR was gone and I should not see any of the described hazards.

Now my ARM NEON code is slower than the C code (that is using tricks) but that is entirely my fault and I have some things I can try to make it faster. And to be more specific the ARM NEON code is four frames faster than the old C code (that was not using tricks).

Collecting hints to increase performance in Qt (and apps)

Collecting hints to increase performance in Qt (and apps)

I’m working part time on improving the performance of QtWebKit (memory usage and raw speed) and I have created some tools to create an offline copy of a number of webpages (gmail, yaho mail, google, news sites…).

Using these sites I have created special purpose benchmark reductions. E.g. only do the image operations we do while loading, while loading an painting, load all network resources. One thing I have noticed is that with a couple of small things one can achieve a stable and noticable speedup. These include not calling QImage::scanLine from within a loop, avoid using QByteArray::toLower or not use QByteArray::append(char) from a loop without a QByteArray::reserve.

I have created a small guide to Qt Performance, I will keep it updated and would like to hear more small hints that can be used to improve things. If it makes sense I can migrate it to the techbase as well.

Painting on ARM

Painting on ARM

I’m currently work on making QtWebKit faster on ARM (hopefully later MIPS hardware) and in my current sprint I’m focused on the painting speed. Thanks to Samuel Rødal my work is more easy than before. He added a new paintengine and graphicssystem that allows to trace the painting done with QPainter and then later replay that. Some of you might feel reminded of Carl Worth’s post that mostly did the same for cairo.

How to make painting faster? The Setup

  1. Record a paint trace of your favorite app with tst_cycler -graphicssystem trace, do the rendering and on exit the trace will be generated
  2. Use qttracereplay to replay the trace on your hardware (I had some issues on my target hardware though)
  3. Use OProfile to look where the time is spent and do something about it…
  4. Change code go back to qttracereplay..

What did I do so far?
Most samples are recorded in the comp_func_SourceOver routine. With some searching in the MMX optimized routines and talking to the rasterman I’m doing the following things to improve things on the const_alpha=255 path. In the qttracereplay I go from about 17.4 fps to around 26 fps on my beagleboard with Qt Embedded Linux on the plain OMAP3 fb but I still need to do a more careful visual inspection of the result.

  • Handle alpha=0x00 on the source special by not doing anything
  • Handle alpha=0xff on the source special by simply copying it to the dest
  • Unroll the above block eight times interleaved with preloads…

I will have to clean all this up, merge it with the symbian optimized copies (which sometimes require armv6 or later)… I will probably look at BYTE_MUL now and see if I can make it faster without taking a armv6 or later instruction… or honestly first understand how the current BYTE_MUL is working…

DTrace for GNU/Linux — SystemTap

DTrace for GNU/Linux — SystemTap

One idea is in my mind. How can libmemusage.so, libmemintercept of memprof, memprof be merged into one and provide more and better results. E.g. sometimes you want to trace and get the profile when the heap maxes, or you want to have a histogram of allocations. It is possible to write C code for that and integrate it into one. On the other hand with technologies like DTrace one can easily write the histogram generation, profiling, etc with a trace script.

So what can we do on Linux? The thing coming closest to it is Systemtap. You write a trace script and the script gets compiled into a kernel module that is then loaded and is doing the probing. In theory one is able to even trace userspace with it.

The only problem is that SystemTap is not ready yet. To do really useful stuff one has to patch the Kernel with the utrace patch, Ubuntu/Debian are not featuring recent enough elfutils. So it will probably take another year until memprof can be a simple gui around SystemTap and such and probably two years until most distros come with a ready to use SystemTap.

memprof 0.6.2 release

memprof 0.6.2 release

Today I have released memprof 0.6.2. The most prominent change is merging a merge of raster’s timegraph for memory allocations and fixing various stability bugs introduced post 0.6.0. The code is currently located on gitorious and the release tarball is here and the shortlog can be seen below:

Cristi Magherusan (2):
some other minor changes, mostly guint -> gsize’s
fixed a typo, bug #51556 in the gnome bugzilla

Holger Hans Peter Freyther (10):
mi-perfctr.c: Remove the O_CREAT (from the openSUSE buildservice)
memprof.glade: Open and save the file
Provide a GtkFileChooseButton to select the executable.
merge rasterman’s extra window
.gitignore: Ignore generated files
process_find_line: Clarify who is owning the returned pointer
detailwin.c: Fix possible crash when opening the maps file fails
process_locate_symbol: Make sure a valid string is always returned
add_leaf_to_tree: Avoid running into a crash
memprof release 0.6.2

Stefan Schmidt (2):
configure.in: Use AM_SILENT_RULES if available
stack-frame: Introduce macros for stack pointer regs and use them.

Tomasz Mon (2):
configure.in: Search for bfd.h provided by binutils development package
Integrate the detailwin into the main GtkNotebook

William Pitcock (1):
use elf_demangle() in more places

memprof-0.6.2

What is the size of a QList::Data, RenderObject?

What is the size of a QList::Data, RenderObject?

We tend to write classes without really caring about what the compile will do to create the binary file. When looking into performance and specially memory usage and you create certain objects thousands of times it becomes interesting of how much memory one is wasting for padding/no good reason.

The Linux kernel hackers wrote a tool called pahole that will analyze the DWARF2 symbols and then spit out friendly messages like the one below.


struct Data {
class QBasicAtomicInt ref; /* 0 4 */
int alloc; /* 4 4 */
int begin; /* 8 4 */
int end; /* 12 4 */
uint sharable:1; /* 16:31 4 */

/* XXX 31 bits hole, try to pack */

void * array[1]; /* 20 4 */

/* size: 24, cachelines: 1, members: 6 */
/* bit holes: 1, sum bit holes: 31 bits */
/* last cacheline: 24 bytes */

In this case QList::Data could have used at least three bytes less memory and changing the definition of sharable and array would have removed a whole in the struct. Maybe that is something for Qt5 to keep in mind.

The research question. Can QtWebKit memory usage be reduced by shrinking some of the Qt structs without losing functionality?