User Tools

Site Tools


wiki:prototype:power_monitor

Power Monitoring on mini-clusters

Jetson-TX1

General Information

On the Jetson-TX1 Cluster, the Texas Instrument INA3221 featured on the board is used to gather the power data.

In this case, more metrics can be collected as in the Yokogawa. Basically, the following metrics are avaialble per node:

  • Voltage
    • CPU
    • GPU
    • Total
  • Current
    • CPU
    • GPU
    • Total
  • Power Consumption
    • CPU
    • GPU
    • Total

Also, in this case, measurements are made per node and not per cluster. So, it is needed a different execution of the needed scripts in each node where the power data is wanted to be collected.

How to obtain the power data

You will need the following scripts:

Please not that both files should be on the same folder in order for them to work.

Once downloaded, as in the ThunderX section, you will need to modify your jobscript in order to start and stop the power monitor script on each node allocated by the job.

At the end, your job script should be something like this:

#!/bin/bash
 
#SBATCH -t 30:00
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH -o out/powerTraceJob-%j.out
#SBATCH -e err/powerTraceJob-%j.err
#SBATCH -J powerTraceJobName
#SBATCH --partition=jetson-tx
 
# Start the power monitoring
/path/to/getJTX1measurements.py /path/to/out/file.csv &
 
# Collect some seconds before starting the application
sleep 10s
 
# Run the application
# For example, here to run on one node only with 2 MPI ranks and 48 OpenMP Threads
srun --cpus-per-task=1 --ntasks=4 ./path/to/your/binary
 
# Collect some seconds after the application finishes
sleep 10s
 
# Stop the power monitoring
pkill -f -INT getJTX1measurements.py

Merlin

General Information

On the Merlin cluster, power data is provided by the BMC of the motherboard via the sysfs. The specific files are the following:

  • Power data
    • /sys/devices/platform/APMC0D29:00/hwmon/hwmon0/power1_input
    • /sys/devices/platform/APMC0D29:00/hwmon/hwmon0/power2_input
  • Sensor label
    • /sys/devices/platform/APMC0D29:00/hwmon/hwmon0/power1_label
    • /sys/devices/platform/APMC0D29:00/hwmon/hwmon0/power2_label

All the power is provided in uW. To know what is powerN_input sensor measuring, one should read the content of the file powerN_label.

At the end, what can be measured is:

  • CPU power consumption in uW (power1_input)
  • IO's power consumption in uW (power2_input)

How to obtain the power data

In this case, a simple script that gathers the contents of the file is more than enough. Here the script:

getMerlinMeasurements.sh

bash

#/bin/bash
 
if [ $# -ne 1 ]
then
    echo "USAGE: `basename $0` OUTPUT_FILE"
    exit 1
fi
 
myOutFile=$1
cpuFile=/sys/devices/platform/APMC0D29:00/hwmon/hwmon0/power1_input
ioFile=/sys/devices/platform/APMC0D29:00/hwmon/hwmon0/power2_input
 
measuresPerSecond=4
delay=`echo "1 / $measuresPerSecond" | bc -l`
 
echo "timestamp, cpu_power_uW, io_power_uW"> $myOutFile
 
while true
do
    echo "`date +%s.%N`, `cat $cpuFile`, `cat $ioFile`">> $myOutFile
    sleep ${delay}s
done
 
exit 0

Then, your jobscript need to be modified as well in the following way:

#!/bin/bash
 
#SBATCH -t 30:00
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH -o out/powerTraceJob-%j.out
#SBATCH -e err/powerTraceJob-%j.err
#SBATCH -J powerTraceJobName
#SBATCH --partition=merlin
 
# Start the power monitoring
/path/to/getMerlinMeasurements.sh /path/to/out/file.csv &
myPid=$!
 
# Collect some seconds before starting the application
sleep 10s
 
# Run the application
# For example, here to run on one node only with 2 MPI ranks and 48 OpenMP Threads
srun --cpus-per-task=1 --ntasks=4 ./path/to/your/binary
 
# Collect some seconds after the application finishes
sleep 10s
 
# Stop the power monitoring
kill $myPid
wiki/prototype/power_monitor.txt · Last modified: 2019/03/13 16:49 by kpeiro