A better thermostat

By: on June 20, 2006

On a modern PC, the motherboard controls the power to the CPU fan, and has sensors that monitor the temperature of the CPU and the speed at which the fans are spinning. The slower the fan spins, the quieter it is, so it’s desirable to monitor the CPU temperature and adjust the fan speed appropriately.

Under Linux, the lm-sensors package is responsible for hardware monitoring and control. It comes with a script, “fancontrol”, which is meant for this exact job; every ten seconds, it reads the CPU temperature, and adjusts the fan power. This way it can trade off a hotter CPU for a quieter system, while keeping the CPU temperature within acceptable limits.

Unfortunately, its algorithm for choosing the fan power doesn’t work well on my system. As a result, my CPU is much cooler than it needs to be and my PC is quite noticeably louder than it could be, and since I spent quite a bit on a quiet case and quiet CPU so that it could live in the living room and drive the TV using MythTV this annoys me. So I decided to try to improve it.

In doing so, I’m dipping a toe into the water of cybernetics. The script closes a negative feedback loop: if the temperature rises, the fan power and thus the fan speed will rise, causing the temperature to fall. While the power output of the CPU is constant, the system will find some equilibrium at which the temperature and fan power stay constant.

CPU heatsinks are rated in C/W (degrees per watt) – the lower the better. What this means is that the bigger the temperature difference between the CPU and the air in the case, the faster the heatsink gets rid of heat energy, in a linear relationship. As they work, CPUs convert the electrical energy they receive into heat energy. If the CPU puts out a constant wattage of energy, the temperature of the heatsink will rise until it finds equlibrium, when the temperature difference is big enough that it’s getting rid of heat energy at the same rate it’s receiving it. Turning up the fan reduces the C/W of the heatsink, which means that a lower temperature difference will suffice to shed the heat energy, and so the system will find a lower temperature equilibrium.

The CPU doesn’t, of course, consume a constant amount of energy. When it’s idling, which on most systems is most of the time, it consumes a great deal less energy than when it’s working hard converting your CDs to Ogg files. If the fan speed stays constant, then the result will be that the CPU temperature is lower when the system is idle and higher when it’s busy, which enables it to get rid of the extra heat energy.

However, this has two downsides. The first is that the fan speed must be set high enough that the temperature does not become dangerously high even when the CPU is working at full pelt; this is unnecessarily noisy 99% of the time. The second is that the CPU will live longer if it’s kept at a constant temperature for most of its life than if its temperature wanders from low to high and back as the workload changes.

So the purpose of a fan control script is to keep the CPU at a constant temperature as the workload changes, by adjusting the fan speed appropriately. This target temperature should be chosen in advance to be as high as possible while remaining a safe margin below the maximum temperature the CPU can bear, since the higher the temperature the less work the fan needs to do to get rid of the heat, and since it may need to rise to this temperature in any case when it’s working flat out. This is made more complicated by the fact that the relationship between fan power and fan speed is not constant and is highly nonlinear.

The “fancontrol” script that ships with lm-sensors takes a straightforward approach. Every ten seconds, it measures the temperature, and plugs this temperature into a function which returns a new fan power value, which it applies to the fans. I mean “function” here in the mathematical sense: a particular temperature will always result in a particular fan speed, unless you stop the script and change the configuration file. It has no memory of what has gone before. This means that a change in the power the CPU is putting out is guaranteed to change the temperature it operates at, which is contrary to the goals of the script.

On my system the script found an equilibrium in which the fan worked very hard to keep the CPU unnecessarily cool, contrary to the principle above that the CPU should be kept hot to keep the fan quiet. But because it had no memory, it never noticed that it could afford to slow the fan.

So I’ve written a smarter script, one that knows what temperature the CPU is supposed to be and actively tries to maintain it. If the temperature rises above the target, it increases the fan power until it drops back to the target temperature. It maintains this new, higher fan power until the temperature drops below the target, at which point it reduces the speed to return to the target temperature again.

The strategy the script employs is pretty simple: every few seconds, it multiplies the difference between the measured temperature and the target temperature by a constant, and adds it to the fan speed.

In theory, this could cause problems. When the fan speed is changed, it takes time for the system to find a new equilibrium. During this time the system will be sampling the temperature and adjusting the fan speed further. This latency could cause the fan speed to oscillate around the target temperature, forever overshooting its target. However, choosing a sufficiently slow fan adjustment seems to prevent this problem occuring.

The larger problem was the nonlinearity of the relationship for fan power to fan speed. I’d like to do something clever about this, but for the moment I’ve done something fairly dull. I measured the fan speed at each power stepping, and drew a graph, which I then fitted a curve to. My script adjusts the “virtual fan speed” coming from this graph, paying no attention to the real fan speed.

I’m sure a much smarter script could be written, modelling the fan/heatsink assembly more closely, and paying attention to the real fan speed that results from the power adjustments. But this script seems to do the job very nicely of choosing a sensible fan speed based on the CPU workload.

One worry I had with this was that if the script died unexpectedly or misbehaved, the fan could be left stuck on a low setting as the temperature rises, damaging the CPU. So I resolved to write a wrapper script in C which would detect and respond to these conditions. This turned out to be three times as much code as the script itself, so I’m not sure it was worth it, but I’m using it for now. The wrapper switches on programmatic fan control before launching the script, and turns it off if the script dies, or if the temperature rises above a certain threshold. It also catches all the signals it can and ensures that the script is killed and programmatic fan control switched off before exiting.

There’s still some cleaning up to do – at the moment some code is very specific to my system, but I hope to submit this to the lm-sensors people in the next few days. The biggest issue remaining to sort out is the “virtual fan speed” kludge to handle the strong nonlinearity of the fan power to the fan speed. I’d be much happier monitoring the real fan speed and making use of that somehow, but I’m not sure how to factor it in. The system only updates the fan speed every second, so you can’t simply step the power until the desired fan speed is reached, but I’m sure there’s a robust way to make use of the fan speed readings to make the right adjustments.

(updated – see Part Two)


Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>