Bitbake parser performance

Bitbake parser performance

There is one thing in OE that is pretty time consuming. If you try to get some variables right that influence the entire system (SRCREV, DISTRO, MACHINE…) you will find yourself running bitbake over and over again and each run will take several minutes of your time. There are two solutions to this problem and both are very good.

  • Have less data. If you have less data, less data need to be parsed. Apparently poky is doing that with their meta module. You can also see this with jhbuild of GNOME where you have to enable modulesets to have more data. For OE we should consider splitting up things that are orthogonal to each other. GUI and console networking tools? For Distro- and Autobuilders this will not make a difference, for the avergae joe it might. I’m not sure we should split OE into several independent modules just for parsing speed
  • Parse faster. This is what I will talk about now

I’m bothered by the Bitbake parser for several years now. And all these years I had a paid job/main contractor and was doing work for them and never finished my work on the various approaches based on Marc Singers lexer/grammar. As of now this obstacle is gone, I have plenty of time (wanna change that?), and the last two days I was working on the Bitbake parser.

The current approach was to base on Marc Singers flex and lemon work (fixed to really parse everything), try to hook it into python, try to hook it into the rest of Bitbake. At some point this always stalled because it is very hard to verify that the new parser is doing things properly. And it is quite frustrating as well. lemon/flex is pretty fast in figuring out the structure but we have quite some python code in our metadata which needs to be executed and so far this has not been optimized. While one is able to lex and analysis the grammar below a second, it will take quite some time to execute the python code. Anyway, all of my previous attempts stalled at some point, mostly when trying to verify…
So this time I just ignored how horrible our current regexp based scanner is and decided to turn it step by step into a parser that creates a syntax tree/list and then evaluates this list/tree into the bb.data dictionary.
The first commits attempted to move the actual data handling out of the line based regexp handling into a new python module. Afterwards I turned all these methods into methods creating a Node for the syntax tree and immediately evaluating it to match the current behavior. Finally I was able to change that to evaluate the tree to a bb.data at the end of the parsing. So I have successfully (git bisectable) converted the current scanner into something producing the AST and then evaluating it. When parsing the OE metadata certain files like *.inc or *.bbclass will get parsed over and over again. With the above change we can scan these files once, keep the syntax tree around and then just evaluate again.

I ended up with something like 27 patches against Bitbake, plenty of baby steps, each with high confidence that there a no regressions and this leads to turning down the parsing time from 3m9.573s to 2m35.994s on my rusty macbook.

There is some more work ahead to improve this situation, move away from the regexp to PLY, attempt multithreaded parsing, attempt to write a peep-hole optimizer(???), look at the data module again…quite some time is spent in the cache too…

Comments are closed.