User Tools

Site Tools


wiki:prototype:power_monitor_slurm

Power Monitoring on mini-clusters

Introduction

You can enable the power monitoring tool and extract traces of your jobs. This will allow you to get power metrics of your programs and get a better feel of how they behave. To enable the power monitoring you just have to submit your job to a SLURM partition with the word -power suffixed. This will start the power monitoring tool for your job and will yield a CSV file with the data. It will also produce a PRV file that can be analyzed in Paraver along with the PCF and ROW files.

The power monitoring tool is available in the following partitions:

  • jetson-tx-power
  • merlin-power

Warning: The power monitoring tool extracts power traces for each node allocated in your job except for the partition thunder-power. This means that in order to get a power trace in the thunder cluster it is mandatory to allocate all the nodes in the partition.

Collected Metrics

Each power monitoring partition extracts a different set of metrics. This section will list the metrics available on each partition.

Jetson-tx

  • Voltage (V)
  • Current (mA)
  • Power Consumption (W)

All of these metrics are measured for the CPU, the GPU and also the Total value.

Merlin

  • CPU power consumption (uW)
  • IO's power consumption (uW)

Example

This section includes a step-by-step guide to get a power trace of a small MPI program that computes prime numbers. The source code can be downloaded from here.

Once you have downloaded the source file, you need to load the proper modules and compile the application.

> module load gcc
> module load openmpi
> mpicc -O main.c -lm -o main.x

This will generate an executable main.x. You now have to write a job script that specifies a power monitoring partition. This example will use jetson-tx-power but feel free to use any of the aforementioned available partitions. Just keep in mind that to use thunder-power you will need to allocate the whole machine.

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=4
#SBATCH --time=00:05:00
#SBATCH --partition=jetson-tx-power
#SBATCH --output=output-%j.out
 
source /etc/profile.d/modules.sh
module purge
module load gcc/5.3.0
module load openmpi/1.10.2
 
srun ./main.x

The jobscript shown above will allocate 2 nodes with 4 MPI ranks each. To submit the job just type in your terminal.

> sbatch jobscript.sh

Once the job has finished you will find an output file containing the output of the application output-<job_id>.out, a CSV file with the power monitoring data <job_id>-jetson-tx-power-<timestamp>.csv and a PRV file with the power trace to be analyzed in Paraver main.x.prv.

Congratulations! You now know how to extract power traces on an HCA cluster.

Other Features

Power trace per jobstep

If your jobscript contains more than one jobstep, i.e. more than one srun command, the power monitoring tool will yield a power trace for each one of the jobsteps.

Energy to Solution

An Energy-to-Solution overview is done at the end of the execution. You can find it at the bottom of the output file. This study includes the energy used by each node during the jobstep and the total energy consumption. As always, the thunder-power partition monitors at a cluster level.

The following snippet is an example of the Energy-to-Solution overview that you will find in the output file of the job.

Energy to Solution Analysis:
jetson-tx03 25.2929808729 J
jetson-tx02 26.0449999422 J
Total 51.3379808151 J

Final Note

If you want to analyze the PRV power trace in Paraver you will need to create a custom configuration. You can find more information about the Paraver tool on its official website.

wiki/prototype/power_monitor_slurm.txt · Last modified: 2019/03/13 16:49 by kpeiro