gnome-battery-bench

One thing we want to do for the next versions of GNOME and Fedora is to improve battery performance. Your laptop may well be advertised by the manufacturer to have “up to 10 hours of battery life” or some such claim. You probably don’t get anywhere near this.

Let’s put out some rough numbers here to give an overall sense of scale for the problem. For a modern ultrabook:

  • The battery is 50 watt-hours (Wh) – it can power a load of 50W for an hour or a load of 5W for 10 hours.
  • The baseline idle consumption of the system – this is RAM refresh, the power consumption of peripherals in power-saving mode, etc, is 5W.
  • The screen and keyboard backlights, if both turned on to 100%, draw 5W.
  • The CPU/GPU can sustain about 15W – this is a thermal limit, so it can draw more for short bursts, but over time it will be throttled to an average.
  • All other peripherals (Wifi, bluetooth, touchpad, etc.) can use about 5W of power combined when not in power-saving mode.

So the power draw of the system can range from about 5W (the manufacturer’s 10 hours) to 30W (1 hour 40 minutes). If you have such an ultrabook, how is it doing right now? I’d guess it’s using about 15W unless you pay a lot of attention to power usage. Some of the things that might be going wrong:

  • Your keyboard/screen backlights are likely higher than is needed.
  • Some of the devices on your system don’t have power-saving turned on properly.
  • You likely have some background CPU activity (webpage ads, for example).

Of course, if you are running a compilation in the background, you want your CPU to be using that full 15W of power to get it done as soon as possible – in this case, your battery isn’t going to last very long. But a lot of usage is closer to idle.

Measuring power usage

I’ve made assertions above about power used by different things. But how did I actually measure that? powertop is the state of the art for measuring power usage and tweaking it on Linux. It will show you a figure for current battery discharge rate, but it bounces around by several watts; partly because powertop’s own data collection loads the system. The effect of a kernel option is usually much smaller than that. One of the larger effects I discovered on my laptop was that turning USB autosuspend for the touchscreen saves about 150mW. When you tweak a tunable in powertop, without a way to measure power usage more accurately, it’s hard to know whether any observed differences are real.

To support figuring out what is going on with power, I wrote gnome-battery-bench. What it does is pretty simple – it plays back recorded sequences of events in a loop and monitors battery charge to estimate power usage. Because battery usage is being averaged over minutes, a lot of the noise is averaged out. And the event sequences can be changed to exercise different usage patterns by the user.

gnome-battery-benchThe above screenshot shows gnome-battery-bench running a “Light Duty” benchmark that combines scrolling around in a Wikipedia page and typing in gedit. Instantaneous usage bounces around a lot from the activity and from random noise, but after a few cycles the averaged power and estimated battery lifetime converge. The corresponding idle power usage is about 5.5W, so we see then know that we’re using about 2.9W from the activity.

gnome-battery-bench is designed as a graphical application because I want to encourage people to explore with it and find out interactively what is using power on their system. And graphing is also useful so that the user can see when something is going wrong with the measurement; sometimes batteries will report data that jumps around. But there’s also a command line version that can be used for automatic scripting of benchmarks.

I decided to use recorded sequences of events for a couple of reasons: first, it’s easy for anybody to create new test sequences – you just run the gnome-battery-bench command line tool in record mode and do what you want to test. Second, playing back event sequences at a low level simulates user interaction very accurately. There is little CPU overhead, and as far as the desktop is concerned it’s exactly like user input. Of course, not everything can be easily tested by simply playing back canned sequences, but our goal here isn’t to test everything, just to be able to test some things that are reasonably representative.

The gnome-battery-bench README file describes in more detail how it works and how to install it on your system.

Next steps

gnome-battery-bench is basically usable as is. The main remaining thing to do with it is to spend some time designing and recording a couple of sequences that better reflect actual usage. The current tests I checked in are basically just placeholders.

On the operating system, we need to make sure that we are actually shipping with as many power-saving options on for peripherals as can be supported. In particular, “SATA link power management” makes a several-watt difference.

Backlight management is another place we can make improvements. Some problems are simply bad defaults. If ambient light sensors are present on the system, using them can be a big win. If not, simply using appropriate defaults is already an improvement.

Beyond that, in GNOME, we can optimize application and system code for efficiency and to not do things unnecessarily often in the background. Eventually I’d like to figure out a way to have power consumption also tracked by perf.gnome.org so we can see how code changes affect our power consumption and avoid regressions.

perf.gnome.org – introduction

My talk at GUADEC this year was titled Continuous Performance Testing on Actual Hardware, and covered a project that I’ve been spending some time on for the last 6 months or so. I tackled this project because of accumulated frustration that we weren’t making consistent progress on performance with GNOME. For one thing, the same problems seemed to recur. For another thing, we would get anecdotal reports of performance problems that were very hard to put a finger on. Was the problem specific to some particular piece of hardware? Was it a new problem? Was it an a problems that we have already addressed? I wrote some performance tests for gnome-shell a few years ago – but running them sporadically wasn’t that useful. Running a test once doesn’t tell you how fast something should be, just how fast it is at the moment. And if you run the tests again in 6 months, even if you remember what numbers you got last time, even if you still have the same development hardware, how can you possibly figure out what what change is responsible? There will have been thousands of changes to dozens of different software modules.

Continuous testing is the goal here – every time we make a change, to run the same tests on the same set of hardware, and then to make the results available with graphs so that everybody can see them. If something gets slower, we can then immediately figure out what commit is responsible.

We already have a continuous build server for GNOME, GNOME Continuous, which is hosted on build.gnome.org. GNOME Continuous is a creation of Colin Walters, and internally uses Colin’s ostree to store the results. ostree, for those not familiar with it is a bit like Git for trees of binary files, and in particular for operating systems. Because ostree can efficiently share common files and represent the difference between two trees, it is a great way to both store lots of build results and distribute them over the network.

I wanted to start with the GNOME Continuous build server – for one thing so I wouldn’t have to babysit a separate build server. There are many ways that the build can break, and we’ll never get away from having to keep a eye on them. Colin and, more recently, Vadim Rutkovsky were already doing that for GNOME Continuouous.

But actually putting performance tests into the set of tests that are run by build.gnome.org doesn’t work well. GNOME Continuous runs it’s tests on virtual machines, and a performance test on a virtual machine doesn’t give the numbers we want. For one thing, server hardware is different from desktop hardware – it generally has very limited graphics acceleration, it has completely different storage, and so forth. For a second thing, a virtual machine is not an isolated environment – other processes and unpredictable caching will affect the numbers we get – and any sort of noise makes it harder to see the signal we are looking for.

Instead, what I wanted was to have a system where we could run the performance tests on standard desktop hardware – not requiring any special management features.

Another architectural requirement was that the tests would keep on running, no matter what. If a test machine locked up because of a kernel problem, I wanted to be able to continue on, update the machine to the next operating system image, and try again.

The overall architecture is shown in the following diagram:

HWTest Architecture The most interesting thing to note in the diagram the test machines don’t directly connect to build.gnome.org to download builds or perf.gnome.org to upload the results. Instead, test machines are connected over a private network to a controller machine which supervises the process of updating to the next build and actually running, the tests. The controller has two forms of control over the process – first it controls the power to the test machines, so at any point it can power cycle a test machine and force it to reboot. Second, the test machines are set up to network boot from the test machines, so that after power cycling the controller machine can determine what to boot – a special image to do an update or the software being tested. The systemd journal from the test machine is exported over the network to the controller machine so that the controller machine can see when the update is done, and collect test results for publishing to perf.gnome.org.

perf.gnome.org is live now, and tests have been running for the last three months. In that period, the tests have run thousands of times, and I haven’t had to intervene once to deal with a . Here’s perf.gnome.org catching a regression (fix)

perf.gnome.org regressionI’ll cover more about the details of how the hardware testing setup work and how performance tests are written in future posts – for now you can find some more information at https://wiki.gnome.org/Projects/HardwareTesting.

Avoiding Jitter in Composited Frame Display

When I last wrote about compositor frame timing, the basic algorithm compositor algorithm was very simple:

  • When we receive damage, schedule a redraw immediately
  • If a redraw is scheduled, and we’re still waiting for the previous swap to complete, redraw when the swap completes

This is the algorithm that Mutter has been using for a long time, and is also the algorithm that is used by the Weston, the Wayland compositor. This algorithm has the nice property that we draw at no more than 60 frames per second, but if a client can’t keep up and draw at 60fps, we draw all the frames that the client can draw as soon as they are available. We can see this graceful degradation in the following diagram:

But what if we have a source such as a video player which provides content at a fixed frame rate less than the display’s frame rate? An application that doesn’t do 60fps, not because it can’t do 60fps, but because it doesn’t want to do 60fps. I wrote a simple test case that displayed frames at 24fps or 30fps. These frames were graphically minimal – drawing them did not load the system at all, but I saw surprising behavior: when anything else started going on the system – if I moved a window, if a web page updated – I would see frames displayed at the wrong time – there would be jitter in the output.

To see what was happening, first take a look at how things work when the video player is drawing at 24fps and the system is otherwise idle:

Then consider what happens when another client gets involved and draws. In the following chart, the yellow shows another client rendering a frame, which is queued up for swap when the second video player frame arrives:

The video player frame is displayed a frame late. We’ve created jitter, even though the system is only lightly loaded.

The solution I came up for this is to make the compositor wait for a fixed point in the VBlank cycle before drawing. In my current implementation, the compositor starts drawing at 2ms after the VBlank cycle. So, the algorithm is:

  • When we receive damage, schedule a redraw for 2ms after the next VBlank.
  • If a redraw is scheduled for time T, and we’re still waiting for the previous swap to complete at time T, redraw immediately when the swap completes

This allows the application to submit a frame and know with certainty when the frame will be displayed. There’s a tradeoff here – we slightly increase the latency for responding to events, but we solve the jitter problem.

There is one notable problem with the approach of drawing at a fixed point in the VBlank cycle, which we can see if we return to the first chart, and redo it with the waits added:

What we see is that the system is now idle some of the time and the frame rate that is actually achieved drops from 24fps to 20fps – we’ve locked to a sub-multiple of the 60fps frame rate. This looks worse, but also has another problem. On a system with power saving, it will start in a low-power, low-performance mode. If the system is partially idle, the CPU and GPU will stay in low power mode, because it appears that that is sufficient to keep up with the demands. We will stay in low power mode doing 20fps even though we could do 60fps if the CPU and GPU went into high-power mode.

The solution I came up with for this is a modified algorithm where, when the application submits a frame, it marks it with whether it’s an “urgent” frame or not. The distinguishing characteristic of an urgent frame is that the application started the frame immediately after the last frame without sleeping in between. Then we use a modified algorithm:

  • When we receive damage:
    • If it’s part of an urgent frame, schedule a redraw immediately
    • Otherwise, schedule a redraw for for 2ms after the next VBlank.
  • If a redraw is scheduled for time T, and we’re still waiting for the previous swap to complete at time T, redraw immediately when the swap completes

I’m pretty happy with how this algorithm works out in testing, and it may be as good as we can get for X. The main downside I know of is that it only individually solves the two problems – handling clients that need all the rendering resources of the system and handling clients that want minimum jitter for displayed frames, it doesn’t solve the combination. The client that is rendering full-out at 24fps is also vulnerable to jitter from other clients drawing, just like the client that is choosing to run at 24fps. There are mitigation strategies – for example, not triggering a redraw when client that is obscured changes, but I don’t have a full answer. Unredirecting full-screen games definitely is a good idea.

What are other approaches we could take to the overall problem of jitter? One approach would be use triple buffering for the compositor’s output so it never has to block and wait for the VBlank – as soon as the previous frame completes, it could start drawing the next one. But the strong disadvantage of this is that when two clients are drawing, the compositor will be rendering more than 60fps and throwing some frames away. We’re wasting work in a situation where we already have oversubscribed resources. We really want to coelesce damage and only draw one compositor frame per VBlank cycle.

The other approach that I know of is to submit application frames tagged with their intended frame times. If we did this, then the video player could submit frames tagged two VBlank intervals in the future, and reliably know that they would be displayed with that latency and never unexpectedly be displayed early. I think this could be an interesting thing to pursue for Wayland, but it’s basically unworkable for X, since there is no way to queue application frames. Once the application has drawn new window contents, they’ve overwritten the old window contents, and the old window contents are no longer available to the compositor.

Credit: Kristian Høgsberg made the original suggestion that waiting a few ms after the VBlank might provide a solution to the problem of unpredictable latency.

Application/Compositor Synchronization

This blog entry continues an extended series of posts over the last couple of years. Older entries:

What we figured out in the last post was that if you can’t animate at 60fps, then from the point of achieving a smooth display, a very good thing to do is to just animate as fast as you can while still giving the compositor time to redraw. The process is represented visually below. (You can click on the image for a larger version.)

The top section shows a timeline of activity for the Application, Compositor, and X server. At the bottom, we show the contents of the application’s window pixmap, the back buffer, and the front buffer as time progresses. From this, we can get an idea of the time between the point where a user hits a key and the point where that displays on the screen: the input latency. The keystroke C almost immediately makes its way into a new application frame, and that new frame is almost immediately drawn by the compositor into the back buffer, and the back buffer is almost immediately swapped by the X server. On the other hand, the keystroke D suffers multiple delays.

What happens if we use the same algorithm when we’re unloaded – when the total drawing time is less than the interval between screen refreshes? Then it looks like:

This is basically working pretty well – but we note that even though the application is drawing quickly and the entire system is unloaded we still have a lot of input latency. If we plot the latency versus the application drawing time it looks like:

The shaded area shows the theoretical range of latencies, the solid line the theoretical average latency, and the points show min/max/avg latencies as measured in a simulation. (It should be mentioned that this is only the latency when we’re continually drawing frames. An isolated frame won’t have any problems with previously queued frames, so will appear on the screen with minimal latency.)

We could potentially improve this by having the application delay rendering a new frame – the compositor can use the time used to render the last frame to make a guess as to a “deadline” – a time by which the application needs to have the frame rendered. We can again look at a timeline plot and simulated latencies for this algorithm:

There are downsides to delaying frame render – the obvious one is that if we guess wrong and the application starts the frame too late, then we can entirely miss a frame. From a smoothness point of view this looks really bad. In general, an application should only use a deadline provided by the compositor if it has reason to believe that the next frame is roughly similar to the previous one. Another disadvantage is that the delay algorithm does cause a frame-rate cliff as soon as the time to draw a frame exceeds the vblank period – there is an instant drop from 60fps to 30fps.

Which of these two algorithms is better likely depends upon the application: if an application wants maximum animation smoothness and protection from glitches, drawing frames as early makes sense. On the other hand, if input latency is a critical factor – for a game or real-time music application, then delaying frame drawing as late as possible would be preferable.

So, what we want to do from the level of application/compositor synchronization is provide enough information to allow applications to implement different algorithms. After drawing a frame, the compositor should send a message to the application containing:

  • The expected time at which the frame will be displayed on screen
  • If possible, a deadline by which the application needs to have finished drawing the next frame to get it appear onscreen.
  • The time that the next frame will be displayed onscreen

But even without the deadline information, just having a basic response at the end of the frame already greatly improves the situation from the current situation. I’m working on a proposal to add application/compositor synchronization to the Extended Window Manager Hints specification.

Frame Timing: the Simple Way

This is a follow-up to my post from last week; I wanted to look into why the different methods of frame timing looked different from each other. As a starting point, we can graph the position position of a moving object (like the ball) as displayed to the user versus it’s theoretical position (gray line):

The “Frame Start” and “Frame Center” lines represent the two methods described in the last post. We either choose position of objects in the frame based on the theoretical position at the start of the time that the frame will be displayed, or at the center of the frame time. The third line “Constant Steps” shows a method that wasn’t described in the last post, but you would have seen if you tried the demo: it’s actually the simplest possible algorithm you can imagine: compute the positions of objects at the current time, draw the frame as fast as possible, start a new frame.

The initial reaction to the above graph is that the “Frame Center” method is about as good as you can get at tracking the theoretical position, and the “Constant Steps” method is much worse than the other two. But this isn’t what you see if you try out the demo – Constant Steps is actually quite a bit better than Frame Start. Trying to understand this, I realized that the delay – the vertical offset in the graph – is really completely irrelevant to how smooth things look – the user has no idea of where things are supposed to be – just how things change from frame to frame. What matters for smoothness is mostly the velocity – the distance things move from frame to frame. If we plot this, we see a quite different picture:

Here we see something more like the visual impression – that the “Frame Start” method has a lot more bouncing around in velocity as compared to the other two. (The velocity for all three drops to zero when we miss a frame.) We can quantify this a bit by graphing the variance of the velocity versus time to draw a frame. We look at the region from a frame draw time of 1 frame (60fps) to a frame draw time of 2 frames (30fps).

Here we see that in terms of providing consistent motion velocity, Constant Steps is actually is a bit better than the Frame Center at all times.

What about latency? You might think that Constant Steps is worse because it’s tracking the theoretical position less closely, but really, this is an artifact – to implement Frame Center, we have to predict future positions. And, unless we can predict what the user is going to do, predicting future positions cannot reduce latency in responding to input. The only thing that tracking the theoretical positions closer helps with is is if we’re trying to do something iike sync video to audio. And of course, to the extent that we can predict future positions or compute a delay to apply to the audio track, we can do that for Constant Steps as well: instead of drawing everything at their current positions, we can choose them based on the position at a time shortly in the future.

If such a simple method work well, do we actually need compositor to application synchronization? It’s likely needed because we can’t really draw frames as fast as possible, we should draw frames only as fast as possible while still allowing the compositor to always be able to get in, get GPU resources, and draw a frame at the right time.

Follow

Get every new post delivered to your Inbox.