Max Asböck works at the Linux Technology Center at IBM. In the past he has worked on systems management software and has written a Linux device has worked on systems management software and has written a Linux device driver for a service processor. He currently works on Linux RAS (Reliability, Availability, Serviceability) for x86 based system. He also supports customers and occasionally answers questions about the health of DIMMs.
Linux reports memory errors through the machine check handler and the EDAC drivers. In many cases easy access to memory error information is beneficial and should be welcomed. It allows for better insight into the health of the hardware. However, memory error reporting in Linux in its current state still has a few issues. The main one being that memory errors are reported without relation to normal expected DIMM error rates. Without this knowledge of the hardware and its error thresholds it is hard to judge if a DIMM is faulty based on a number of reported corrected ECC memory errors. Users will likely be asking the question - Are my DIMMs bad? - after seeing a number of memory errors when in reality the DIMMs are fine and the rate of errors is normal. In fact, the author's experience shows that system administrators will indeed ask this very question.
Therefore it is useful to describe the current state of memory error reporting in Linux, to explain the problems that remain and point to possible enhancements. This talk intends to do this by describing in details the current infrastructure:
A number of issues and potential improvements shall be discussed as well:
The talk hopes to show that with some improvements the current Linux memory error reporting mechanisms can be turned into a reliable instrument for DIMM failure prediction.