Grafana dashboard containing graphs of job resource usage over time is available at:

https://job-profiling.hyak.uw.edu

The dashboard can be reached from the campus network, or via Husky OnNet VPN.  Login is via UW NetID, and it is available to all Hyak users. 

The data for the dashboard is collected from Slurm utilizing Slurm's InfluxDB Profile Accounting Plugin.  Note that the plugin stores the profiling data in a buffer on each node, sending data to the profiling database only when the buffer fills or a task ends.  Therefore, dashboard data will arrive in chunks and can lag as much as 10 minutes behind real time.  Note that for multi-node jobs, data from different nodes may arrive at different times.




Long running checkpoint job – can see it bouncing between nodes.  Weird diagonal lines are when it lands on a node it was previously running on so the lines connect.
https://job-profiling.hyak.uw.edu/d/U3WlCDRZz/job-profiling?orgId=1&from=1555830000000&to=1556311168417&var-job=717716&var-host=All&var-step=All&var-task=All

Long running multi-node job that sure seems like it is wasting a lot of resources (only one node well utilized, others hardly doing anything):
https://job-profiling.hyak.uw.edu/d/U3WlCDRZz/job-profiling?orgId=1&from=1555688820227&to=1556311917176&var-job=760793&var-host=All&var-step=All&var-task=All

A job doing 3MB/s writes for a bit:
https://job-profiling.hyak.uw.edu/d/U3WlCDRZz/job-profiling?orgId=1&from=1556168110138&to=1556168468615&var-job=636230&var-host=All&var-step=All&var-task=All



Job Profiling Graphs

CPU FrequencyThe average CPU frequency of CPUs allocated to the task. Note that Intel processors can scale up core frequencies when there are idle cores and there is thermal and power overhead to do so.  Also note that certain operations, like waiting for IO, can cause a core to run at a lower frequency even though it is utilized.
CPU Time Used per secondThe average amount of total CPU Time consumed per second by the task during the (30 second) profile sample.  A fully utilized CPU core will consume 1s of CPU Time per second. A step that is fully utilizing all cores on a 28 core node would consume very close to 28 CPU seconds per second.
CPU UtilizationThe total CPU utilization of the task. A value of 1.0 represents one fully utilized core. A step fully utilizing 28 cores would show utilization very close to 28.
Memory RSSThe Resident Set Size, which in practice, is the amount of physical memory consumed by the task.  This is a stacked graph, so the values of the individual steps will be graphed atop one another so that the height of the top line of the graph represents the total physical memory consumed by the job tasks being displayed in the graph.
VMSizeThe virtual memory usage of the task, which represents all memory allocated to a task, including memory that has been written out to swap, and memory that has been allocated but not consumed. This is a stacked graph, so the values of the individual steps will be graphed atop one another so that the height of the top line of the graph represents the total virtual memory consumed by the job tasks being displayed in the graph.
PagesThe number of pages of memory being used by a task.  A memory page is a fixed-length contiguous block of virtual memory, described by a single entry in the memory page table.

Data Written and Data Read to/from Filesystem per second

The average amount of data written to or read from mounted filesystems per second over the (30 second) profile sample.

Tips