Navigating the Oom Killer: Lessons from Product Development

Introduction

In the world of Linux, the OOM (Out of Memory) killer is an essential mechanism designed to protect the system when it runs out of memory. While it's a safeguard, encountering it during the product development phase can lead to system instability and crashes—something no developer wants, especially when dealing with embedded systems or resource-constrained environments.

This case study, takes you through our experience with the OOM killer during the testing phase of our product development. I'll explain what the OOM killer is, why it gets triggered, how it impacted our system, and what steps we took to diagnose and mitigate the issue. If you work in embedded systems or any environment with limited memory, understanding the OOM killer is critical for building robust, and stable products.

The OOM Killer Explained

The OOM killer is a feature of the Linux kernel that is triggered when the systemexhausts its available memory. Its main job is to prevent the system from crashingby selectively terminating processes to free up memory. This can seem drastic,but it’s necessary for keeping the system operational in low-memory situations.Linux, like most modern operating systems, uses a virtual memory system tomanage RAM. When physical memory is fully utilized and there is no more room toallocate, the kernel must decide which processes to kill to free up resources.That’s when the OOM killer steps in.

The kernel considers several factors when deciding which process to terminate,including the amount of memory each process is consuming and its importance tothe system. Processes with a high memory footprint, those with lower priority, orthose running in the background are often the first to get killed. For developers,particularly in embedded environments where memory resources are oftenlimited, encountering the OOM killer can be frustrating. The system’s automatickilling of processes can lead to instability, requiring investigation into memoryusage patterns and system behavior. Understanding what triggers the OOM killer,and how it selects processes for termination, is key to diagnose memory-relatedissues and preventing them from occurring in the future.

How we encountered the OOM Killer?

During the development and testing one of our key products, we encountered frequent OOM Kills. The system was terminating a vital process which resulted in restarts and crashes of the system objects. To diagnose the root cause, we used Heaptrack, a memory profiling tool, which pointed to memory leaks inside some of the crucial shared object files. It allowed us to track memory allocations & de- allocations and pinpoint the exact functions responsible for the issue.

We used CLI commands like free and top to observe memory usage in real-time. These tools helped us visualise the memory growth and correlate it with our system's workload.

Strategies for Prevention and Mitigation

The strategies we adopted while navigating through the issue were like,

Regular Memory Profiling: I would initially recommend to use any of the memory profiling tools like Heaptrack and Valgrind into the regular development workflow to continuously monitor memory allocations and detect potential leaks early.

Software Upgrades: Upgrading any memory leaking software packages to the latest version.

Tuning Kernel Parameters: We could tune the kernel parameters related to the OOM killer, such as vm.overcommit_memory and oom_score_adj, to ensure that the OOM killer was less likely to terminate critical processes. This should be done with caution.

Is it Normal to Tune Kernel Parameters for OOM Killer?

Yes, it’s quite common to tune kernel parameters to influence how the OOM killer behaves, especially in systems with limited resources. Commonly tuned parameters include:

oom_score_adj: Adjusts the priority of processes for the OOM killer. Lowering the score reduces the chance of a process being terminated, which can be helpful for protecting critical applications.

To check the current oom-score of a process using it’s PID: “cat /proc//oom_score_adj”
The score value ranges from -1000 (never kill) to 1000 (high likelihood of being killed).

vm.overcommit_memory: Controls whether the kernel allows processes to allocate more memory than is physically available.

To check the current over-commit setting: “sysctl vm.overcommit_memory”
To change it: “sysctl -w vm.overcommit_memory=n”
n is either 0,1 or 2 accordingly. “0" allows moderate over-commit. However, unreasonable memory allocation will fail. It is a default setting. “1" always over-commit. “2" don’t allow over-commit, a process usually won’t be terminated by the OOM killer, but the memory allocation attempts may return an error.

What Cautions Are Required?

Tuning kernel parameters can have a significant impact on system stability and performance. Here are some key cautions:

Disabling memory overcommitment (by setting vm.overcommit_memory = 2) can lead to failed memory allocations for processes if there’s insufficient memory available. This might prevent important applications from starting or handling high workloads.
Reducing the likelihood of the OOM killer targeting critical processes (e.g., setting oom_score_adj to a lower value) is helpful, but it could lead to non- critical processes being killed instead, which might disrupt the system’s overall behavior in unexpected ways.
Over-prioritizing key processes could lead to the starvation of other tasks or daemons that are important for system health, like logging or monitoring services.

When to Avoid Tuning?

If the root cause is a memory leak or inefficient memory management, tuning the kernel parameters might just delay the problem without fixing the underlying issue.
If the system is extremely resource-constrained, aggressive tuning may not prevent the OOM killer and can make the system unresponsive instead.

Conclusion

Addressing OOM killer incidents requires a combination of problem-solving, proactive management, and system optimization. Identifying the root cause— often through tools like memory profilers and system monitoring—can resolve immediate issues, but long-term stability demands more. By fine-tuning kernel parameters, optimizing resource usage, and implementing real-time monitoring, you can prevent future memory-related problems. While kernel tuning can be powerful, it must be approached with caution and constant oversight. Ultimately, a strategy of regular updates, careful system configuration, and continuous monitoring is essential for maintaining a stable and efficient system.

Author

Tek Media

October 8, 2024 5 mins Read

PRODUCT DEVELOPMENT