Child pages
  • Mox Job Profiling

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The data for the dashboard is collected from Slurm utilizing Slurm's InfluxDB Profile Accounting Plugin.  Note that the plugin stores the profiling data in a buffer on each node, sending data to the profiling database only when the buffer fills or a task ends.  Therefore, dashboard data will arrive in chunks and can lag as much as 10 minutes behind real time.  Note that for multi-node jobs, data from different nodes may arrive at different times.


Info
titleExample Jobs



"Interesting" examples for Pramod for possible screenshots, explanation:
Long running checkpoint job – can see it bouncing between nodes.  Weird diagonal lines are when it lands on a node it was previously running on so the lines connect.
https://job-profiling.hyak.uw.edu/d/U3WlCDRZz/job-profiling?orgId=1&from=1555830000000&to=1556311168417&var-job=717716&var-host=All&var-step=All&var-task=All

Long running multi-node job that sure seems like it is wasting a lot of resources (only one node well utilized, others hardly doing anything):
https://job-profiling.hyak.uw.edu/d/U3WlCDRZz/job-profiling?orgId=1&from=1555688820227&to=1556311917176&var-job=760793&var-host=All&var-step=All&var-task=All

A job doing 3MB/s writes for a bit:
https://job-profiling.hyak.uw.edu/d/U3WlCDRZz/job-profiling?orgId=1&from=1556168110138&to=1556168468615&var-job=636230&var-host=All&var-step=All&var-task=All


...