dmesg buffer: limitations of the original kernel tracing tool

This proposal has been rejected.

*

One Line Summary

Discuss limitations of dmesg buffers to triag ChromeOS kernel crashes.

Abstract

When introduced in 2012, ARM Chromebooks were crashing at roughly 3x higher rate than x86 chromebooks. I can talk about what types of crashes we saw, and what types of crashes the tombstone (dmesg contents) are useful for. Due to this work, ARM crash rates today are about 3x or 4x lower.

Many types of faults are “precise” (technical term) or good enough for tracking crash statistics: NULL ptr, page faults, divide by zero, IOMMU faults. And the corresponding tombstone usually has enough info to track down and/or uniquely identify the bug “fingerprint”.

But some panics, in particular, hung task panics are completely useless. The identified task is almost always the victim of some other part of the kernel not doing it’s part to allow the task to proceed.

  • How can we identify that other part of the kernel reliably?*

Tags

debugging, kernel

Speaker

  • Biography

    Grant has been working on IO subsystems since the mid 1980s. This includes new chipset support and debugging many device drivers. parisc-linux got Grant involved by writing about 5K LOCC for two different IOMMUs, PCI host controllers, core interrupt handler, and SMP bringup.

    The past 4 years (its been that long?!!) has been focused on ChromeBook storage, networking (USB ethernet), and misc drivers (IOMMU, thermal sensors, CPU watchdog). To understand why ARM Chromebooks were crashing more frequently than x86 machines, he started sorting kernel crashes reported (opt-in) by users and reporting statistics about those crashes.

    Grant also teaches an introductory class to kitchen knives at Google, volunteers regularly at bikex.org, plays outdoors when he can (MTB, fishing, body boarding, snow skiing), makes/repairs both wood and LED projects.