Real-time Linux Kernel drivers – revisited for new CPU-card

Two years ago I was looking at writing a Linux kernel driver with a real-time 10ms loop for a new x86-based CPU card at work. After a bit of back and forth (and some much-needed stress-testing) I ended up with an implementation that worked pretty well, and more or less guaranteed that our 10ms loop would also run with the specified period.

Now I’ve spent the last months working on a new CPU card, and thought that I’d better run the old stress-tests on it also, to make sure that we still have the real-time performance that we’re expecting.

The new card has the following specs:

Texas Instruments Sitara AM3356 CPU (ARM Cortex-A8) running at 800 MHz
512 MB DDR3 RAM
2x Microchip LAN8710A 100 Mbps Ethernet
U-Boot, kernel and root-filesystem on SD-card
Running Linux 3.18.9 with Preempt-RT patch (-rt5)

(…and yes, the CPU card is slightly inspired by the various BeagleBones…;) )

The test

The test was done similarly to what I did last time. The 10ms loop is meant for polling various hardware on a legacy backplane bus, which for the new card is done through one of the AM3356 PRU processors (which are pretty nice!). Luckily, one of the hardware cards we’re using on this backplane has its interface implemented in a Xilinx Spartan6 FPGA, so modifying the gateware on this to measure some statistics on how often it is polled and show this in ChipScope, was quite easy. To make sure that initial latency spikes from loading the driver didn’t affect the measurements, the first 1000 cycles are ignored from the statistics.

The driver implementing the real-time loop could be re-used directly from last time (thank you Linux!), and I only had to modify the most low-level part that actually does the access to our hardware bus.

The stress load is the same as last time (although I had to slightly modify rtc_wakeup to work on the ARM processor – I’ll admit that I’m not sure if it still does exactly what it’s supposed to, but it does create a significant load on the CPU, so it at least fulfills the main intent).

Results

First of all, I wanted to have some baseline performance numbers, so I commented out the call to sched_setscheduler(); (line 34 in the final code listing), to have the thread run with the standard, non-real-time scheduler and priority:

Non-real-time scheduler, no stress load

Cycles: 11229
Below 8ms: 0
Above 12ms: 0
Maximum cycle length: 10.26ms
Minimum cycle length: 9.74ms

Not too bad actually, but last time the non-real-time version also worked quite well, until stress was applied:

Non-real-time scheduler, with stress load (but without external ping flood)

Cycles: 11011
Below 8ms: 2948
Above 12ms: 1396
Maximum cycle length: 33.45ms
Minimum cycle length: 0.02ms

With the stress load applied, the nice performance is gone. Notice that this is without the external ping flood, as the 10ms loop just stopped completely(!) once the ping flood was started.

So, we still need the real-time performance to make sure that we actually get to poll the hardware with the desired period. Commenting the sched_setscheduler() back in, and rerunning the tests gives:

Real-time scheduler, with stress load (without external ping flood)

Cycles: 11019
Below 8ms: 0
Above 12ms: 0
Maximum cycle length: 10.02ms
Minimum cycle length: 9.98ms

The real-time scheduler does it’s magic once again. Perfectly acceptable performance, even under heavy load. Notice that min/max times for the real-time version with stress are actually closer to the desired 10ms than the non-real-time solution without stress!… To complete the test, I added in the external ping flood also:

Real-time scheduler, with stress load (including external ping flood)

Cycles: 11111
Below 8ms: 0
Above 12ms: 0
Maximum cycle length: 10.01ms
Minimum cycle length: 9.99ms

Even though the external ping flood seems to make everything stop on the CPU board (even the kernel heart-beat LED more or less gives up), the real-time loop happily keeps running with the desired period.

Conclusion

Judging from the performed tests, the real-time driver is still necessary, and, importantly, it still works. It was nice to see that the driver could be re-used directly between the two platforms (keeping in mind that only a bit of the framework/API of the very first version on our old ARM-board, which directly accesses a hardware timer in the ARM CPU, was reusable on a different platform), which is just reconfirming that the solution I went along with seems to be decent, both regarding performance and portability.