Avoiding Jitter in Composited Frame Display

When I last wrote about compositor frame timing, the basic algorithm compositor algorithm was very simple:

When we receive damage, schedule a redraw immediately
If a redraw is scheduled, and we’re still waiting for the previous swap to complete, redraw when the swap completes

This is the algorithm that Mutter has been using for a long time, and is also the algorithm that is used by the Weston, the Wayland compositor. This algorithm has the nice property that we draw at no more than 60 frames per second, but if a client can’t keep up and draw at 60fps, we draw all the frames that the client can draw as soon as they are available. We can see this graceful degradation in the following diagram:

But what if we have a source such as a video player which provides content at a fixed frame rate less than the display’s frame rate? An application that doesn’t do 60fps, not because it can’t do 60fps, but because it doesn’t want to do 60fps. I wrote a simple test case that displayed frames at 24fps or 30fps. These frames were graphically minimal – drawing them did not load the system at all, but I saw surprising behavior: when anything else started going on the system – if I moved a window, if a web page updated – I would see frames displayed at the wrong time – there would be jitter in the output.

To see what was happening, first take a look at how things work when the video player is drawing at 24fps and the system is otherwise idle:

Then consider what happens when another client gets involved and draws. In the following chart, the yellow shows another client rendering a frame, which is queued up for swap when the second video player frame arrives:

The video player frame is displayed a frame late. We’ve created jitter, even though the system is only lightly loaded.

The solution I came up for this is to make the compositor wait for a fixed point in the VBlank cycle before drawing. In my current implementation, the compositor starts drawing at 2ms after the VBlank cycle. So, the algorithm is:

When we receive damage, schedule a redraw for 2ms after the next VBlank.
If a redraw is scheduled for time T, and we’re still waiting for the previous swap to complete at time T, redraw immediately when the swap completes

This allows the application to submit a frame and know with certainty when the frame will be displayed. There’s a tradeoff here – we slightly increase the latency for responding to events, but we solve the jitter problem.

There is one notable problem with the approach of drawing at a fixed point in the VBlank cycle, which we can see if we return to the first chart, and redo it with the waits added:

What we see is that the system is now idle some of the time and the frame rate that is actually achieved drops from 24fps to 20fps – we’ve locked to a sub-multiple of the 60fps frame rate. This looks worse, but also has another problem. On a system with power saving, it will start in a low-power, low-performance mode. If the system is partially idle, the CPU and GPU will stay in low power mode, because it appears that that is sufficient to keep up with the demands. We will stay in low power mode doing 20fps even though we could do 60fps if the CPU and GPU went into high-power mode.

The solution I came up with for this is a modified algorithm where, when the application submits a frame, it marks it with whether it’s an “urgent” frame or not. The distinguishing characteristic of an urgent frame is that the application started the frame immediately after the last frame without sleeping in between. Then we use a modified algorithm:

When we receive damage:
- If it’s part of an urgent frame, schedule a redraw immediately
- Otherwise, schedule a redraw for for 2ms after the next VBlank.
If a redraw is scheduled for time T, and we’re still waiting for the previous swap to complete at time T, redraw immediately when the swap completes

I’m pretty happy with how this algorithm works out in testing, and it may be as good as we can get for X. The main downside I know of is that it only individually solves the two problems – handling clients that need all the rendering resources of the system and handling clients that want minimum jitter for displayed frames, it doesn’t solve the combination. The client that is rendering full-out at 24fps is also vulnerable to jitter from other clients drawing, just like the client that is choosing to run at 24fps. There are mitigation strategies – for example, not triggering a redraw when client that is obscured changes, but I don’t have a full answer. Unredirecting full-screen games definitely is a good idea.

What are other approaches we could take to the overall problem of jitter? One approach would be use triple buffering for the compositor’s output so it never has to block and wait for the VBlank – as soon as the previous frame completes, it could start drawing the next one. But the strong disadvantage of this is that when two clients are drawing, the compositor will be rendering more than 60fps and throwing some frames away. We’re wasting work in a situation where we already have oversubscribed resources. We really want to coelesce damage and only draw one compositor frame per VBlank cycle.

The other approach that I know of is to submit application frames tagged with their intended frame times. If we did this, then the video player could submit frames tagged two VBlank intervals in the future, and reliably know that they would be displayed with that latency and never unexpectedly be displayed early. I think this could be an interesting thing to pursue for Wayland, but it’s basically unworkable for X, since there is no way to queue application frames. Once the application has drawn new window contents, they’ve overwritten the old window contents, and the old window contents are no longer available to the compositor.

Credit: Kristian Høgsberg made the original suggestion that waiting a few ms after the VBlank might provide a solution to the problem of unpredictable latency.

This entry was written by Owen and posted on November 28, 2012 at 4:29 pm and filed under Coding, GNOME. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

14 Comments

Benjamin Otte

Posted November 29, 2012 at 6:31 am | Permalink

If I understand you correctly what you actually want to do is start drawing the compositor frame
– as late as possible
– reliably
to make applications happy.

As late as possible, because it reduces the latency until what the app did is rendered on the screen. This is nothing the app can do anything about, it’s something the compositor needs to solve. You can probably measure the compositing time and then do a good enough guesstimate at how much time your compositing takes to be done in time for the swap.
But with the kernel switching CPU/GPU time to applications – especially for heavy graphics users – this might be more complicated, but I’m sure someone can figure out a way to make the compositing have higher priority to not get worked up over a glxgears (or tracker munching your HD).

And it needs to be reliable because it makes sure apps can predict when the next frame’s gonna happen and react accordingly. This is nice because we get smooth animations, though I’m not sure how much it matters as long as we render our frames at the correct timestamps (I think the eye can interpolate quite well), but it’s important for video players.
But then, video players need to sync their video not only to the refresh rate, but also to the audio. And if the frame that gets sent to the compositor is 2 frames late (that’s 33ms), it might already start getting close to you noticing A/V sync issues. (From experience I adjust A/V in mplayer at ~100ms but notice it somewhere around 50-75ms. Of course, that’s anecdotal, so research may be off by a factor of 2-3). So I think what you want to give applications in a future Waylandic world is the ability to control when a frame is displayed. I don’t think apps would complain if they could send you images in batches either. They’d save on wakeups (like wake up once per second, decode 24 frames and send them with timestamps to the compositor) and they’d get a guaranteed display time. It’d definitely be what video players want.
Also keep in mind that the display’s idea of 60fps might not be what the soundcard thinks of 60fps due to clock skew, so the video player might actually want to display the video at 23.8 or 24.13fps.
- Owen
  
  Posted November 29, 2012 at 9:54 am | Permalink
  
  “As late as possible while still reliable” sums things up pretty well. But you don’t want to push the “late” aspect too hard, because if the compositor can’t get the entire frame drawn in time, and you miss a frame, that looks really bad. My first implementation of this drew at 6ms after the VBlank, but what I found in testing was that redrawing the entire compositor scene graph on my system took something like 12ms when the system was in low power mode – so a light test case would run smoothly in low power mode, clutter would decide to redraw the entire screen for whatever reason and I’d get a frame drop. We need to start the compositor redraw earlier enough to meet the deadline not in the average case, but in the worst case. In general, I think you want to give the compositor pretty close to an entire frame cycle to draw – maybe you can shave off a few milliseconds, but those few milliseconds won’t greatly affect user experience.
  
  There are essentially different aspects of latency that affect the user experience. The first is instantaneity – does the response to a user action feel instant? This is the loosest criteria, and even 100ms may be OK here. The second is simultaneity – does a sound happen at the same time as you see it happen? Acceptable here is something like sound from 15ms before to 45ms after video. Finally, tracking lag – does the display on the screen match the position of the mouse pointer or the users finger. Here, the less latency the better – 33ms is acceptable but noticeable, 16ms is better.
  
  Prediction of when frames are going to display is one advantage of reliability – and important for A/V synchronization. But the main reason we want reliability is consistency. We want to avoid the situation in the third diagram where for the video player, instead of drawing in the pattern “frame x x frame x frame” we drew “frame x x x frame frame”.
  
  As you say, it is necessary to actively synchronize sound and video during playback – you can’t just assume the clocks are close enough, even when the video is, say, at a nominal 30fps and the display at a nominal 60fps. The three sources of mismatch are: the video might be at the NTSC rate of approximately 29.97fps, the screen refresh rate might not be exactly 60fps, the audio clock might be off. There are two approaches – either you let the audio be the master and display each video frame at the display frame that is as close as possible to the correct time – this is good enough for consumer use, but not great, or you resample the audio and lock the video and display exactly together. tests/video-timer.c in the wip/frame-synchronization branch of GTK+ shows implementation of both algorithms (without the actual audio resampling!) In either case, the video player can and should compensate for the latency introduced by the compositor.
  - ed
    
    Posted November 30, 2012 at 4:57 pm | Permalink
    
    a little off topic but…
    How does windows aero/dwm.exe have no tracking lag at all for moving around windows (i.e. window position follows cursor position perfectly)?
  - Owen
    
    Posted November 30, 2012 at 5:04 pm | Permalink
    
    In X, the cursor overlay position is updated by the X server as soon as it receives an event, while the window is at the position of the last event processed by the wm/compositor, so the cursor leads windows during a drag. If the cursor is instead drawn in software by the compositor, then it will track the window exactly. Windows is probably doing it like this – I’ve been told that Weston also does it this way. (Note that this doesn’t help if you are dragging with a finger rather than a mouse, since your finger is always drawn in hardware 🙂
  - ed
    
    Posted December 1, 2012 at 5:13 am | Permalink
    
    Actually I think that Windows also draws the cursor with hardware (similar to Option “HWCursor” for X).
    1. Disabling compositing (which disables vsync) does not cause the cursor to tear.
    2. With compositing enabled, the cursor is just as responsive (i.e. 0ms lag when moving at top of screen). Except for window dragging, anything interaction requiring mouse movement lags behind the cursor by at least 2 frames.
    
    I just find it curious how windows makes window dragging as low-latency as the cursor itself. Redrawing the window’s decoration, which has transparency and stuff, is definitely done by the compositor. The drawing of the cursor is just done by slapping on a bitmap, and I suspect that it is done independently of the compositor.
    For weston, there is no lag in interactions with mouse movement (except a little in xwayland applications). Unfortunately the compositor does the drawing of the cursor and I believe that compared to X or Windows’ cursor drawing (“hardware” drawing), weston’s cursor lags by 1 frame.
  - Owen
    
    Posted December 1, 2012 at 12:58 pm | Permalink
    
    In the end, you can’t get away from the basic logic of the situation: if you want the cursor to exactly match dragged windows, then the cursor must have the same latency as the window drag. If you want the cursor to exactly match drawing in an application, then the cursor must have the the same latency as the application drawing.
    
    Windows could be using a hardware cursor without compositing and drawing the cursor in the compositor when that’s on. Or it could be updating the HW cursor position at the frame display VBlank with the mouse cursor position used for the window drag instead of the very latest cursor position. These aren’t distinguishable by observation.
    
    In general, I don’t think that 16ms or even 33ms of latency in cursor position is observable to the naked eye *unless* you have two things to compare and can turn the time into a lag distance. If you are is using a Wacom Cintiq and can compare the pointer position exactly to the tip of the stylus, you are going to be very sensitive to latency in cursor position. If you can observe a 0-latency cursor vs. application drawing with latency you are going to notice. But getting down to less than a frame of latency for application drawing is hard because it’s just like audio – if you trim latency to a minimum, you have no margin for error and any disruption can cause skips.
Andreas Tunek

Posted November 29, 2012 at 3:45 pm | Permalink

When you are doing this work, please keep in mind two things.

1: There are 120 fps monitors around. There might be even higher in the future.

2: Some applications (like music production and games) want as little latency as possible.
Oleg

Posted November 30, 2012 at 3:56 am | Permalink

Could you give examples of applications, which might be submitting “urgent” frames (apart from fast-action games)?
- Owen
  
  Posted November 30, 2012 at 10:29 am | Permalink
  
  I don’t typically see urgent frames as being defined by an application author or by the type of application – they are defined by whether the frame was rendered starting immediately after receiving a _NET_WM_FRAME_DRAWN message from the compositor indicating the completion of the previous frame. Basically, any application might produce “urgent” frames when it is animating continuous motion. And any application needs the “urgent” frame handling if it is animating continuous motion but can’t do 60fps – or can’t do 60fps when the GPU is in power saving mode. This might be as simple as a web page scrolling and redrawing in response to user input.
Chris M.

Posted November 30, 2012 at 1:49 pm | Permalink

Same comment as Andreas. Increasing the frame rate say to 300fps would make the latency very small in the case of a 24fps update rate from video playback.
- drago01
  
  Posted November 30, 2012 at 3:30 pm | Permalink
  
  This makes no sense .. first you need hardware capable of drawing at that framerate and secondly you’d be wasting power rendering frames that the user never sees.
Nestor

Posted November 30, 2012 at 2:58 pm | Permalink

this sounds like qos for a network manage traffic. ie some apps needs low latency and some – high throughput. so may be look at qos management algorithms to solve a problem? regards
andrimner

Posted December 2, 2012 at 8:22 am | Permalink

Nice article, and a classic latency versus jitter versus throughput problem.

As you mention, the solution does not solve the combined problem of both applications requiring minimum jitter and full rendering resources. Neither does unredirecting… (However unredirecting does reduce latency and resource usage for full-screen apps)

A possible solution would be to introduce selective triple-buffering for an application requiring full rendering resources. In this situation we only need to schedule redraw synchronously with VBlank (when damage) and can abandon the “urgent” concept. The maximum number of triple-buffers can be limited to one, as multiple applications requiring full rendering resources will be starved anyway.

A less resource heavy solution would be to prioritize the needs of the foreground application and always make the foreground application the “urgent” one. This would minimize both latency and jitter of the foreground application! I would personally be fine with both jitter and ratio-locked frame rates of the background applications, if only the foreground application ran perfectly.

A less radical compromise would be to only make an application “urgent” if it is both the foreground application and using full rendering resources. Then we can render synchronously with VBlank most of the time at the cost of slight latency of the foreground application.

Best regards, and thank you for improving our desktops 🙂
joelholdsworth

Posted December 3, 2012 at 8:03 am | Permalink

I always wondered – should the video playback application ideally be doing telecine-like pull-up/pull-down things to recast the video at a fixed factor of the system frame rate? This way the video frames would appear as a fixed number of system frames – rather than varying in some pattern.