Benchmarking compositor performance

Recently Phoronix did an article about performance under different compositing and non-compositing window managers. GNOME Shell didn’t do that well, so lots of people pointed it out to me. Clearly there was a lot of work put into making measurements for the article, but what is measured is a wide range of 3D fullscreen games across different graphics drivers, graphics hardware, and environments.

Now, if what you want to do with your Linux system is play 3D games this is very relevant information, but it really says absolutely nothing about performance in general. Because the obvious technique to use when a 3D game is running is to “unredirect” the game – and let it display normally to the screen without interference from the compositor. Depending on configuration options, both Compiz and KWin will unredirect, while GNOME Shell doesn’t do that currently, so this (along with driver bugs) probably explains the bulk of difference between GNOME Shell and other environments.

Adel Gadllah has had patches for Mutter and GNOME Shell to add unredirection for over a year, but I’ve dragged my feet on landing them, because there were some questions about when it’s appropriate to unredirect a window and when not that I wasn’t sure we had fully answered. We want to unredirect fullscreen 3D games, but not necessarily all fullscreen windows. For example, a fullscreen Firefox window is much like any other window and can have separate dialog windows floating above it that need compositing manager interaction to draw properly.

We should land some sort of unredirection soon to benefit 3D gamers, but really, I’m much more interested in compositing manager performance in situations where the compositing manager actually has to composite. So, that’s what I set out this week to do: to develop a benchmark to measure the effect of the compositing manager on application redraw performance.

Creating a benchmark

The first thing that we need to realize when creating such a benchmark is that the only drawing that matters is drawing that gets to the screen. Any frames drawn that aren’t displayed by the compositor are useless. If we have a situation where the application is drawing at 60fps, but the compositor only is drawing 1fps, that’s not a great performing compositor, that’s a really bad performing compositor. Application frame rate doesn’t matter unless it’s throttled to the compositor frame rate.

Now, this immediately gets us to a sticky problem: there are no mechanisms to throttle application frame rate to the compositor frame rate on the X desktop. Any app that is doing animations or video, or anything else, is just throwing frames out there and hoping for the best. Really, doing compositor benchmarks before we fix that problem is just pointless. Luckily, there’s a workaround that we can use to get some numbers out in the short term – the same damage extension that compositors use to find out when a window has been redrawn and has to be recomposited to the screen can also be used to monitor the changes that the compositor is making to the screen. (Screen-scraping VNC servers like Vino use this technique to find out what they need to send out over the wire.) So, our benchmark application can draw a frame, and then look for damage events on the root window to see when the drawing they’ve done has taken effect.

This looks something like:

In the above picture, what is shown is a back-buffer to front-buffer copy that creates damage immediately, but is done asynchronously during the vertical blanking interval. The MESA_copy_sub_buffer GL extension basically does with, with the caveat that (for the Intel and AMD drivers) it can entirely block the GPU while waiting for the blank.

I’ve done some work to develop this idea into a benchmark I’m calling xcompbench. (Source available.)

Initial Results

Below is a graph of some results. What is shown here is the frame rate of a benchmark that blends a bunch of surfaces together via cairo as we increase an arbitrary “load factor” which is proportional to the number of surfaces blended together. Since having only one window open isn’t normal, the results are shown for different “depths”, which are how many xterms are stacked underneath the benchmark window.

Compositor Benchmark (Cairo Blending)

So, what we see above is that if we are drawing to an offscreen pixmap, or we are running with metacity and no compoisting, frame rate decreases smoothly as the load factor increases. When you add a compositor, things change: if you look at solid blue line for mutter you see the prototypical behavior – the frame rate pins at 60fps (the vertical refresh rate) until it drops below it, then you see some “steps” where it preferentially runs locked to integral fractions of the frame rate – 40fps, 30fps, 20fps, etc. Other things seen above – kwin runs similarly to mutter with no other windows open, but drops off as more windows are added, while mutter and compiz are pretty much independent of number of windows. And compiz is running much slower than the other compositors.

Since the effect of the compositor on performance depends on what resources the compositor and application are competing for, it clearly matters what resources the benchmark is using – is it using CPU time? is it using memory bandwidth? is it using lots of GPU shaders? So, I’ll show results for two other benchmarks as well. One draws a lot of text, and another is a simple GL benchmark that draws a lot of vertices with blending enabled.

Compositor Benchmark (Text Drawing)

Compositor Benchmark (GL Drawing)

There are some interesting quirks there that would be worth some more investigation – why is the text benchmark considerably faster drawing offscreen than running uncomposited? why is the reverse true for the GL benchmark? But the basic picture we see is the same as for the first benchmark.

So, this looks pretty good for Mutter right? Well, yes. But note:

It’s all about Timing

The reason Compiz is slow here isn’t that it has slow code, it’s that the timing of when it redraws is going wrong with this benchmark. The actual algorithm that it uses is rather hard to explain, and so are the ways it interacts with the benchmark badly, but to give a slight flavor of what might be going on, take a look at the following diagram.

If a compositor isn’t redrawing immediately when it receives damage from a client, but is waiting a bit for more damage, then it’s possible it might wait too long and miss the vertical reblank entirely. Then the frame rate could drop way down, even if there was plenty of CPU and GPU available.

Future directions

One thing I’d like to do is to be able to extract a more compact set of numbers. The charts above clearly represent relative performance between different compositors, but individual data points tell much less. If someone runs my benchmark and reports that on their system, kwin can do 45 fps when running at a load factor of 8 on the blend benchmark, that is most representative of hardware differences and not of compositor code. The ratio of the offscreen framerate to the composited framerate at the “shoulder” where we drop off from 60fps might be a good number. If one compositor drops off from 60fps at an offscreen framerate of 90fps, but for a different compositor we have to increase the load factor so that the offscreen framerate is only 75fps at the shoulder, then that should be a mostly hardware independent result.

It is also important to look at the effect of going from a “bare” compositor to a desktop environment? The results above are with bare compiz, kwin, and mutter ,and not with Unity, Plasma, or GNOME Shell. My testing indicates pretty similar results with GNOME Shell as with the full desktop. Can I put numbers to that? Is the same true elsewhere?

And finally, how do we actually add proper synchronization instead of using the damage hack? I’ve done an implementation of an idea that was come up with a couple of years ago in a discussion between me and Denis Dzyubenko and it looks promising. This blog post is, however, too long already to give more details at this point.

My goal here is that this is a benchmark that we can all use to figure out the right timing algorithms and get them implemented across compositors. At that point, I’d expect to see only minimal differences, because the basic work that every compositor has to do is the same: just copy the area that the application updated to the screen and let the application start drawing the next frame.

Test Configuration

Intel Core i5 laptop @2.53GHz, integrated intel Ironlake graphics
KWin 4.6.3
Compiz 0.9.4
Mutter 3.0.2

Update: The sentence “why is the text benchmark considerably faster drawing offscreen than running uncomposited” was originally reversed. Pointed out by Benjamin Otte and fixed.

This entry was written by Owen and posted on June 13, 2011 at 4:34 pm and filed under Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

19 Comments

csslayer

Posted June 13, 2011 at 10:45 pm | Permalink

kwin already try to use custom XProperty to block composite, _KDE_NET_WM_BLOCK_COMPOSITING.

Would GNOME support this and push it to NETWM together?

http://blog.martin-graesslin.com/blog/2011/04/turning-compositing-off-in-the-right-way/
- Owen
  
  Posted June 13, 2011 at 11:22 pm | Permalink
  
  In general, a hint that a window would be better off without compositing sounds like a reasonable thing, and I 100% agree it should be done at the _NET_WM level rather than with different properties for each environment. There would obviously have to be discussion of exactly what it means. Does it mean “turn off compositing for this window” or “unredirect this window”? – It sounds like Martin added the property in conjunction with a feature to turn compositing off entirely but we’re not going to do that for GNOME Shell. And the set of windows that you want to turn off compositing for is potentially different than the set of windows that you want to unredirect.
  - mgraesslin
    
    Posted June 14, 2011 at 12:41 pm | Permalink
    
    It allows to turn compositing off completely and that’s the reason why I haven’t posted it to the NETWM spec yet as it doesn’t make sense for GNOME Shell and Unity.
    
    Though it could be turned into a “the Compositor may decide to unredirect the window or turn compositing off completely”.
    
    Btw I didn’t implement it, I only blogged about it 😉
  - liam
    
    Posted June 14, 2011 at 3:33 pm | Permalink
    
    Hi Owen,
    
    Not sure if that was a typo but if not, what is the difference between unredirecting and not compositing a window?
    
    Best/liam
smspillaz

Posted June 14, 2011 at 4:55 am | Permalink

I think the way compiz works is that we wait for the next “redraw” time on the timer when we hit a damage event to redraw rather than redrawing the scene immediately on damage.
- smspillaz
  
  Posted June 14, 2011 at 4:56 am | Permalink
  
  Perhaps in this case it’s more appropriate to try and redraw the scene into the backbuffer first and then wait for the vblank time to copy into the front buffer. Something that’s worth looking into I guess.
  - Owen
    
    Posted June 14, 2011 at 10:54 am | Permalink
    
    Though I spent quite a bit of time reading the compiz code and putting in debugging print statements to try and figure out what was going on with my benchmark, I think all I’ll say is that there is a lot of different stuff going on there! (especially without my notes at hand.)
    
    I did a write-up on timing frame display a couple of years ago: https://blog.fishsoup.net/2009/06/02/timing-frame-display/
    (There was a promised follow-on post to that specifically about application/compositor interaction that never happened.) My basic conclusion there was that the best thing is if you have asynchronous frame completion events, failing that, there are marginal advantages to trying to time drawing to complete just before vblank. But if that goes wrong, you pay a large penalty… and in complicated situations, with multiple apps hitting the CPU and GPU, frame drawing times are not very predictable.
    
    The X and GL stack doesn’t really provide us the right primitive for asynchronous sub-buffer updates… “copy this area at the next vblank and then send an event when you do”, though that functionality is available for full-buffer page swaps. So, we need to either add an asynchronous sub-buffer copy mechanism, or we need to enhance things so that a compositor can use page flipping for sub-screen updates. (If you know the details of how page flipping is happening, you can copy between buffers as needed.)
Alexandros

Posted June 14, 2011 at 6:33 am | Permalink

Very interesting!

You might also want to take a look at glcompbench (http://launchpad.net/glcompbench). Its approach and goal are quite different from yours: it simulates compositor functionality (instead of benchmarking a live one) and provides some interesting statistics to assist in diagnosing issues in the stack or a specific emulated compositor.
default

Posted June 14, 2011 at 8:59 am | Permalink

Do you know how and if the new X sync fences (introduced in X server 1.10) could help with improving composition performance?
- Owen
  
  Posted June 14, 2011 at 11:57 am | Permalink
  
  The fences do are actually about keeping the compositor from getting ahead of itself and redraw applicationing frames that the GPU hasn’t finished renderering yet. This is a problem that only occurs with the nvidia drivers, but not with the open source drivers, which make different design choices and inherently have more coherency between different clients rendering on the GPU.
  
  While the fences don’t obviously form a solution to any problem we have when using the open source drivers, there is an important connection here. Thanks for the reminder to look back over that!) Without application modification, using the fences will cause a compositor slowdown because the compositor has to make a call to the server to add a fence after some amount of application damage is received. But if we start doing changes to applications to mark their own frames of drawing, we should also make applications add their own fences so the compositor doesn’t have to go back and insert them.
  
  http://lists.x.org/archives/xorg-devel/2010-December/016513.html
Anon

Posted June 14, 2011 at 9:02 am | Permalink

Maybe mesa could always do ‘it’ (undirect?) by default for fullscreen windows? Sorry if I missed some internal technicality and more 😉

But it seems like a reasonable default, if possible?
- Owen
  
  Posted June 14, 2011 at 12:04 pm | Permalink
  
  My opinion is that GL is just another drawing API. Yes, fullscreen GL windows are often 3D games sucking as much performance as they can from the GPU, but they can just as well be a normal application using GL for non-performance reasons. (Also, unredirection is a choice made by the compositor not the application, so Mesa really doesn’t have the choice.) One heuristic you can more reliably make is if the window is both fullscreen and bypassing normal window manager behavior by making itself “override redirect” it’s most likely a game.
Calvin Walton

Posted June 14, 2011 at 10:17 am | Permalink

I noticed that your test results are done on an Intel graphics card. On my laptop with an older Intel card, I’ve seen similar results: Mutter runs very quickly and smoothly in most cases, except under really high load (In particular, video playback is very smooth, even fullscreen w/compositing!).

However, on my desktop machine with a much faster Radeon card (using the Mesa r600g driver) – Mutter performance seems to be worse than my laptop! Do you think this benchmark would provide useful information to help tune Mutter performance on specific graphics drivers, or help diagnose issues in the graphics driver itself?
- Owen
  
  Posted June 14, 2011 at 12:54 pm | Permalink
  
  Yes, this tool would be quite useful for investigating such a problem. The shape of the frame rate graphs, and looking at what (if anything) is consuming CPU during benchmark runs would give strong clues. But in the end, there’s not much substitute for getting the system exhibiting the performance problems into the hands of someone with the knowledge and motivation to debug the issues. I need to take a look at xcompbench results on a different machine and driver to get an idea of what parts of the results are hardware dependent or not, so I’ll probably reassemble one of my desktop systems and do some testing with a discrete AMD card soon.
CedricBail

Posted June 16, 2011 at 8:29 am | Permalink

I just tried your tool with Enlightenment 17 compositor. It does work with software backend, but the opengl backend isn’t detected properly by xcompbench. Could be a driver issue. But xcompbench is basically locked on the first frame and never detect any damage on the compositor window I guess.

Do you have any idea how to fix that issue ? Or where I should look to find a solution ?
- Owen
  
  Posted June 16, 2011 at 9:56 am | Permalink
  
  It’s most likely not that it isn’t detecting damage at all, but because it doesn’t damage for the first frame, it never goes ahead and draws a second frame. The thing that would be useful is inside common.c:cb_run() to add some g_printerr() debugging for Configure/Map/Damage events and see what happens on startup. If you mail me (otaylor@fishsoup.net) I can take a look and/or provide help in figuring out what debugging statements to add.
Matthias Dahl

Posted June 18, 2011 at 10:22 am | Permalink

Owen, that was an interesting read. A few days ago I switched over to Gnome4 from KDE4 and I’m lovin’ it but the tearing during video playback is driving me nuts. I asked several times on #gnome-shell and #gnome but never received any response to the problem. I also had a look at the mentioned patches but in order to forward port them, there is a lot of adapting to do which is hard if you don’t know the mutter/gnome-shell code by heart. 🙂

So, IMHO this should fix that problem as well once its implemented or am I wrong here?

Besides, is there any workaround or short term fix or even an experimental git tree available?
- Owen
  
  Posted June 22, 2011 at 6:19 pm | Permalink
  
  If your video drivers are working properly, the classic form of tearing – an update of the screen contents while the vertical retrace is in the middle of the screen – cannot happen with GNOME Shell, since it is always updating the screen contents during the vertical blanking interval. This applies to all applications. (If we unredirected full screen applications, then it would be up to them to handle synchronization and avoid tearing.) Off hand, it’s hard for me to say whether what you are observing is a video driver problem, or some other form of visual artifact that looks similar to the classic problem of tearing. In any case, the better we do application synchronization the better every sort of animating content will look, and the better we define how things work, the less excuse there will be for buggy video drivers.
  - Matthias Dahl
    
    Posted June 23, 2011 at 4:24 am | Permalink
    
    Thanks for your comment on this. I know it is not the classic form of tearing but close enough. Every video I play has a discrepancy in the upper 1/5 area of the picture meaning there is a nice tearing line on movement which shows the frame was not copied/… in time and we now have two intermixed frames which is obviously bad and mostly the definition of tearing.
    
    Now I haven’t had a closer look at the clutter/mutter/gnome-shell code (yet?), so I know next to nothing about how the vblank synchronization is handled- please keep that in mind when reading the following. 🙂
    
    Is it possible that the contents of a window is not copied in time or that only one buffer is used for a window content which gets overwritten due to timing issues or that once one application syncs to the vblank, none other can sync to it, so a video player gets the frames out whenever it likes? Or some combination of that?
    
    No matter what- it works flawlessly in KDE4 w/ composition on, so there’s gotta be a reason behind why we are seeing those tearing artifacts. If you google around, you’ll notice that I am not the only one with exactly the same problem. And there is even a related gnome bugzilla bug about it (#651312) even though the reporter talks about twin monitor setup but it is exactly the same with a single monitor setup as well (like mine).
    
    If you need any help in figuring this out or I could run some tests or patches for you, please simply let me know. I’d be more than happy to get my hands dirty on this. 🙂 But on my own, I’m totally lost because I am not familiar with the inner workings of gnome-shell/mutter/clutter at all and I lack the time to dive deeper into this unfortunately.