this post was submitted on 17 Nov 2025
18 points (100.0% liked)

Linux

60122 readers
836 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 6 years ago
MODERATORS
 

Hi! I am aware of tools like top, htop, atop, and sar that can be used to monitor usage. The *top programs seem to only do any reporting in real time, while the latter sar tool can provide historical usage data only (as percentage by CPU).

The problem that I am trying to get information on is what processes are running, and their stats, at times when the system is unresponsive (making the *top programs impossible to use).

What is the best way to log process stats in real time so when the system becomes unresponsive and requires a reboot, we can go and look to see what state the system was in to hopefully troubleshoot what causes the system to become unresponsive?

Thank you!

top 3 comments
sorted by: hot top controversial new old
[–] suicidaleggroll@lemmy.world 6 points 3 weeks ago* (last edited 3 weeks ago)

I use node_exporter + VictoriaMetrics + Grafana for network-wide system monitoring. node_exporter also has provisions to include text files placed in a directory you specify, as long as they're written out in the right format. I use that capability on my systems to include some custom metrics, including CPU and memory usage of the top 5 processes on the system, for exactly this reason.

The resulting file looks like:

# HELP cpu_usage CPU usage for top processes in %
# TYPE cpu_usage gauge
cpu_usage{process="/usr/bin/dockerd",pid="187613"} 1.8
cpu_usage{process="/usr/local/bin/python3",pid="190047"} 1.4
cpu_usage{process="/usr/bin/cadvisor",pid="188999"} 1.0
cpu_usage{process="/opt/mealie/bin/python3",pid="190114"} 0.9
cpu_usage{process="/opt/java/openjdk/bin/java",pid="190080"} 0.9

# HELP mem_usage Memory usage for top processes in %
# TYPE mem_usage gauge
mem_usage{process="/usr/local/bin/python3",pid="190047"} 3.0
mem_usage{process="/usr/bin/Xvfb",pid="196573"} 2.4
mem_usage{process="/usr/bin/Xvfb",pid="193606"} 2.4
mem_usage{process="next-server",pid="194634"} 1.2
mem_usage{process="/opt/mealie/bin/python3",pid="190114"} 1.2

And it gets scraped every 15 seconds for all of my systems. The result looks like this for CPU and memory. Pretty boring most of the time, but it can be very valuable to see what was going on with the active processes in the moments leading up to a problem.

[–] frongt@lemmy.zip 1 points 3 weeks ago

Kernel dumps? I doubt that any monitoring agent would be any more responsive than what you've already listed.

[–] artyom@piefed.social -1 points 3 weeks ago

Mission Control