Recipe for desaster

Recipe for desaster

I really hate closed source systems, every time I have to deal with it, it turns out to be a waste of time dealing with it. Right now I’m dealing with a closed source BSS (I wish I could call names), I can make it fail by sending a very simple message, a message that is crucial for everything (SMS, Call) to terminate on the MS. Of course an expensive support contract was signed but apparently it does not buy more than the right to create a ticket in a bug tracker and let it rot there. Never ever make your business depend on a black box!

Thoughts on Fedora 12 and 13

Thoughts on Fedora 12 and 13

I was a bit unhappy with the performance of Ubuntu Karmic and Ubuntu Lucid (mostly the start scripts to handling filesystem errors at all…) and decided it is time to try a RPM based Distribution again. I really don’t like to wait for a download of a DVD and then to upgrade the whole system with new software after the installation. Coming from Debian I totally love the slick network installer. This means I have to install my RPM based distro like this or I will stay with Ubuntu or go back to Debian. I have tried doing with OpenSUSE some time ago and it failed miserable, so now it was time for Fedora. So here is a random list of notes I hate about Fedora.

  • Netinstaller needing a 200MB download. One can build a whole GNOME image with that…
  • The guided partitioner is nice and fails. It picks a 500MB boot partition which is not enough if you want to use pre-upgrade later to upgrade your system (which silently fails when going out of disk space…), so I want to increase the boot partitions size, but I need to delete the LG and PV of LVM before.. So at the end my guided partitioning was all manual as I want to change the size of one partition… this could be handled way better.
  • YUM is awfully slow and one is not encouraged to do the equivalent of a “apt-get dist-upgrade” with it, one needs to use a installer CD to do the upgrade. This seems to be really backward.
  • I had to learn the wonders of rpm –rebuilddb during a failed upgrade…Something upgraded the berkeley db version or such.. and the package database was corrupted.
  • Removing pulseaudio (it takes like 50% of CPU time on my netbook) removes bluez..

Besides that Fedora seems to be a robust and well maintained distribution which contains really recent versions and sometimes even the future… like systemtap with a utrace enabled kernel to trace user applications.

Got a new MSI X340 Laptop

Got a new MSI X340 Laptop

After the break down of my beloved Macbook I decided it is time for a new notebook. So I went out to the various markets and searched around. I was in search for something slim but still powerful. So it could be a Macbook AIR? Looking at the specs revealed it contains NVIDIA GFX, so that is a no go. Then I saw a Sony X-Series… very very slim… then my math failed and i thought it is cheap… well Sony’s are never cheap and the repair is always outrages… I stumbled across the MSI X-Series and i really liked it…

So my X-340 has a Intel Chipset and is a Centrino2, it is one of these ultra low voltage CPUs… and the Linux supports rocks. Wireless, Bluetooth, Camera (not that I need it), Sound, Touchpad, Graphics (intel gma4500) all work out of the box. The CPU seems to be powerful enough for what I want to do.

So far I seem to have made a good pick.

GSM RACH Bursts and Paging Requests

GSM RACH Bursts and Paging Requests

Yesterday I had the pleasure of trying OpenBSC on a real network and the result was desaster, but honestly what else to expect when trying it the first time. It is not that OpenBSC was crashing, leaking memory, or not recovering from failure it is just the load of the network was differrent than what I assumed and that leads to problems.

What happens is one is seeing a lot of location updating requests, which will load the SDCCH but that is really fine and we have seen such things at the Chaos Congress, what is different is the result of location updating requests, the network will flood us with paging requests… Right now we are sending up to 20 paging requests every two seconds… The first thing to notice is that this too much for the nanoBTS, it is sending us a nice CCCH/ACCH/BCCH overload warning which we do not handle (we should start two timers and throttle the amount of messages we send) the other part is… if we are out of SDCCHs and ask 20 more phones a second to get one… We have created the RACH DoS that Dieter Spaar has done with a Mobile Station.

The Random Access Request contains the channel type and one to IIRC four bits of random numbers, so even if we have a free channel… it can happen that two phones believe that we have assigned a channel to it… and then we see RF Failures, which in turn will trigger the phone to try again (or we page it again)… and then nothing will work….

The other observation is that if our cell is really busy we should start to assign TCHs to fullfill location updating requests….

So the changes I need to make is to change the paging to not page as much as we physically can stuff into the PACCH but as to how much of the responses we can handle (pretty obvious?) and the other is to allocate a “bigger” channel in case we have no smaller channel… E.g. use number of free channels divided by X for paging requests…

Traveling woes..

Traveling woes..

I’m just back to the flat. On this trip (in order) my luggage didn’t arrive witht the same airplane, my cellphone broke during the flight (the headset speaker is dead), my laptop backlight broke and my luggage didn’t arrive at the final destination… So after doing a backup of the data and removing the disk I will bring my laptop to an Apple dealer, and for the weekend I try to get my phone replaced… any my luggage should be home later today as well..

Besides that the traveling was fine…

Hacking on OpenBSC

Hacking on OpenBSC

I was invited to visit the On-Waves (they have a shiny new website) office in Paris this week and I was quite busy hacking away on the OpenBSC codebase. On-Waves allows me to play a bit with their MSC and learn more about GSM and in exchange OpenBSC gains a more and more complete and stable GSM A-Interface.

When developing code for OpenBSC we are mostly sitting very close to the BTS, only have one active subscriber, test one thing, restart, test another thing, restart but with any piece of software I’m writing, I want OpenBSC to be rock solid, run unattended for years, have no memory leaks, deal with the nanoBTS going away and coming back, the MSC going away and coming, all this at any point in time. So far events like Hacking at Random and the Congress are the ideal testing ground as many different handsets, subscribers, etc are the ideal playground.

My testing was limited to a small set of handsets connected via USB and executing AT commands for call handling and sending SMS. I’m addressing subscribers on the same cell. That means whenever I do a call I have mobile originated and mobile terminated testing covered and this is done by funny chat scripts that work most of the time. The next thing is to simulate failure, for some stuff where a specific layer3 message would be send, we have to wait for a more complete OsmocomBB, so what I can easily do is to cut off TCP connections. I have done this with another piece of weird shell magic. I use the output of $RANDOM and treat it as seconds and then use a kill -SUGUSR2 `pidof bsc_msc_ip` to close the MSC connection at a random time. And then I let it running and wait for failures.

I have fixed a bug/issue in the way we do release a channel. There are multiple things involved. First of all is instructing the BTS that a given channel on a timeslot is open or closing it (RF Channel Release of RSL), the other part is that on the channel one can have logical applications running (SAPI), this can be call control (SAPI=0) and SMS (SAPI=3). When opening a connection to a Mobile Station (MS) the SAPI=0 is always established, when attempting to deliver a SMS we need to open SAPI=3 first. Now our issue with bringing this down was that whenever we got a SAPI release confirm (we asked for the release and it was released) or release indication (the MS closed it) and we used to respond with a RF Channel Release. Now when trying to bringdown a connection were we delivered a SMS we would issue a RF Channel Release twice and the nanoBTS ACKed it twice! To make matter worse, whenever we get a RF Channel Release ACK we mark it as free. We had this small window when we got the first RF Channel Release ACK, allocated the channel again, and then get the second RF Channel Release ACK. I have fixed this issue in multiple ways. The first is to use the T3111 timer to wait until we issue the RF Channel Release, the second is to handle (RF) failures by “blocking” the lchan for a short second to receive multiple errors and release acks and the last bit is to properly bring down the channel. When we have SAPI!=0 we bring that down first, then we send SACH deactivate, followed by SAPI=0 release and then finally we send the RF Channel Release. This makes things more reliable on our side but we need to fix some more things. There is a FIXME inside the gsm_04_08_utils.c that mentions the start of a T3109 timer. In any case when sending a SAPI release the BTS will answer with success or a timeout and we handle both.

Today I addressed losing the RSL or OML connection to the nanoBTS and making sure we are reconnecting and not leaking any memory. This took me most of the day to get stable and I have found a bug or such inside the osmocore/select.c when releasing a bsc_fd that is the last one of the list. The difficulty here is making sure we do not leak memory, close all file descriptors, close all channels that take place on the RSL connection and make sure that when the BTS is up again we can use the channels that were allocated during the failure. To help with testing I added two commands to our vty interface to drop the OML or the RSL connection on a given BTS. The other part that was helpful is to use Linux’s Netfilter and drop packets on a TCP connection and to wait for a failure. Now I can simulate most of the network failures easily and could build some trust.

And my final wishlist item would be to have like 16 GTA02 boards, use FS0 on each and run a simple script to dial, send SMS, pickup phonecalls this would allow me to heavily test the networking in an automated way. On top of that would be to have a OsmocoreBB enabled Calypso or C123 and then I could even send messages that are normally not send at all. And thanks to FreeSoftware development I’m sure we are going to reach that goal.

GSM Fail…

GSM Fail…

Hi,

okay… I have a simple task… dial some numbers, drop some SMS. Right now I’m using a whacky mix of shell and chat script to do it but it is not as nice and reliable as it should be. So I was trying ofono and fso to use my serial device to do some calls for me… Or at least I tried to. FSO was relatively easy to install on Debian, having a nice readme, picking a GSM driver… well and then… how do I specify the config value… looking at the source, finding stuff… taking an old example from Charlie… nothing… hmm.. Okay, on to ofono…

So ofono is written to make the life of designers more easy, it must be so easy that there is no documentation required… after finding some examples in the test directory of the code… I send one MSG to enable the Power of my modem and hope the VoiceCallManager comes up… instead I do see a segfault in the ofonod… *sigh*

I think I will write a custom application to send and parse some simple AT commands for now… how frustrating.

Using oeaudit.py

Using oeaudit.py

In the last days I have cleaned up my OE Audit tool and it should be usable by everyone now. The tool requires two inputs, one is the list of packages to be build by OE for the given configuration (distro, machine) and the other is the FreeBSD auditfile. This FreeBSD auditfile can be automatically downloaded.

Without much more overhad, here we go.

$ bitbake -s > available
$ export PYTHONPATH=/place/bitbake/lib
$ /OE/contrib/oeaudit/oe_audit.py -f
This will fetch the auditfile from the FreeBSD project for you
$ /OE/contrib/oeaudit/oe_audit.py -a auditfile -p available
Now you will see a list of vulnerabilities in the packages OE is going to use.

Dealing with security issues in the context of OpenEmbedded

Dealing with security issues in the context of OpenEmbedded

One thing that has bothered me while being at Openmoko is the lack of Security Response by the OpenEmbedded Crew. In one way a security issue is just like any other bug and distros don’t upgrade each package for each bug fixed upstream but it is getting worse when the security issues exists in the default installation, in a daemon listening to network traffic and such with ready to get exploits on the network.

I think it is really unethically to go around and claim how great OpenEmbedded is and then companies like Openmoko, Palm, etc. ship vulnerable software to their users and it is easy to pass the black pit to companies actually using OpenEmbedded, let me say it is too easy.

There are various things one can do to address these problems. One option is to downgrade and use the classic Buildroot as their maintainers seem to address vulnerabilities in time. I use the word downgrade as these systems provide less functionality, flexibility than OpenEmbedded, e.g. they lack the creating of SDKs, chosing the libc (glibc, uclibc, eglibc) but then again they do their homework and provide people with security updates in time, the other option is to go to a distribution like Debian or Fedora with a proven track record of handling security issues.

But I’m going to talk about the third option that includes improving OpenEmbedded. I had the idea while being at Openmoko but the guy who was assigned to do this was laid off shortly afterwards so it never happened. In general for every package we ship in OpenEmbedded there is an established distribution (e.g. FreeBSD, Debian, Fedora) that is shipping it as well. Or in the seldom cases where OpenEmbedded is the first adopter, the software is kept current and there is not much security research anyway. This means that to provide security upgrades to our users we only need to monitor the big and established guys and that sounds like something that can be partially scripted.

I’m using FreeBSD on my servers and in the FreeBSD world there is an application called portaudit which is looking at your built/installed packages and is comparing the name, package version and patch release to a list of known security issues in the ports tree and then asks you to upgrade, Gentoo has a similiar XML file for each security incident, Debian has a security feed as well.

A long story short, on a flight to iceland I was hacking a python script called oe_audit.py that is using the FreeBSD auditfile and the output of “bitbake -s” (the list of providers and their versions) and then starts comparing these lists. Right now the script is inside the OE tree, it is still a gross hack but I will improve it to be a proper python script. In its first version it has found issues with plenty of packages in OpenEmbedded and thanks to the help of some we are down to only a couple of issues in our tree. In general addressing security issues is not that hard, follow a couple of mailinglists, look at websites, when a CVE is published search for the patch, apply it to our version, be done. Specially given the amount of OE developers we could nominate a security sherif each week that has to do the upgrades… It is not that we see more than three upgrade a week anyway… So this week it would have been Pango, Php, Pulseaudio…

Using setitimer of your favorite posix kernel

Using setitimer of your favorite posix kernel

As this was requested in a comment of a previous post and knowing your kernel helps to write better performing systems here is a small information of how to use the interval timers provided by your posix kernel.

What is the interval itime?r

The interval timer is managed and provided by your kernel. Everytime the interval of the timer expires the kernel wil send a signal to your application. The kernel is providing three different interval timers for every application. The different timers are for measuring the real time passed on the system, the time your application is actually executed and finally the profiling timer which tmes the time when your application is executed and when the system is executing on behalf of your application. More information can be found in the manpage with the name setitimer.

Why is it useful?

In the QtWebKit Performance Measurement Utilities we are using the interval itimer as the timing implementation for our Benchmark Macros. To be more precise we are using the ITIMER_PROF to measure the time we spend executing in the system and in the application, we are using the smallest possible precision of this timer with one microsecond. The big benefit os using this instead of elapsed real time, e.g. with QTime::elapsed, is that we are not depending so much on system scheduling. This can be really nice as even with a lightly crouded system we can generate stable times, the only thing influecing the timing is the MHZ of the CPU.

How is it implemented?

It is a kernel timer, this means that it is implemented in your kernel. In case of Linux you should be able to find a file called kernel/itimer.c, it defines the syscall setitimer at the bottom of the file. In our case the SIGPROF seems to be generated in kernel/posix-cpu-timers.c in the check_cpu_itimer routine. Of course the timer needs to be accounted by things like kernel/sched.c when scheduling tasks to run…

How to make use of it?

We want to use ITIMER_PROF, according to the manpage this will generate the SIGPROF. This means we need to have a signal handler for that, then we need to have a way to start the timer. So let us start with the SIGPROF handling.

Elapsed time handling
static unsigned int sig_prof = 0;
static void sig_profiling()
{
    ++sig_prof;
}

The signal handler
    struct sigaction sa;
    sa.sa_handler = sig_profiling;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = SA_RESTART;
    if (sigaction(SIGPROF, &sa, 0) != 0) {
        fprintf(stderr, “Failed to register signal handler.n”);
        exit(-1);
    }

Start the timer
tatic void startTimer()
{
    sig_prof = 0;
    struct itimerval tim;
    tim.it_interval.tv_sec = 0;
    tim.it_interval.tv_usec = 1;
    tim.it_value.tv_sec = 0;
    tim.it_value.tv_usec = 1;
    setitimer(ITIMER_PROF, &tim, 0);
}

Discussion of the implementation

What is missing? We are using the sigaction API… we should make use of the siginfo_t passed inside the signal handler.

What if we need a higher precision or need to handle overflows?
There is the POSIX.1b timer API which provides timers in the nanosec region and also providers information about overflows (e.g. when the signal could not be delivered in timer). More information can be found when looking at the timer_create functions.

When is the interval timer not useufl?

Imagine you want to measure time it takes to complete a download and someone wrote code like this:

QTimer::singleShot(this, SLOT(finishDownload())), 300000);

In this case to finish the download a lot of real time will pass and the app might be considered very slow, but it in terms of the itimer only little will be executed as the time we just sleep is not accounted on us. This means the itimer can be the wrong thing to use when you want to measure real time, e.g. latency or time to complete network operations.