User Tools

Site Tools


wiki:prototype:power_monitor

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

wiki:prototype:power_monitor [2017/08/31 17:21]
127.0.0.1 external edit
wiki:prototype:power_monitor [2019/03/13 16:49] (current)
kpeiro
Line 1: Line 1:
 ====== Power Monitoring on mini-clusters ====== ====== Power Monitoring on mini-clusters ======
- 
-===== ThunderX ===== 
- 
-==== Setup Information ==== 
- 
-    * Yokogawa WT230 Digital Power Meter 
-      * [[http://​www.yokogawa.co.jp/​ftp/​dist/​ks/​eusers/​wt/​ked5s/​im/​IM760401-01E_041.pdf|Manual]] and [[http://​tmi.yokogawa.com/​files/​uploaded/​WT210.pdf|White Paper]] 
-      * Model number: 760503 
-      * It collects power data (V, W, A) from the 4 Thunder nodes since they only use one power supply for all of them. 
-      * It provides a led panel where you can see the measures in real time. 
-      * It provides a serial interface where you can connect and, by sending specific information,​ receive the instant power data collected at the moment you query the Yokogawa. 
-    * HCA server 
-      * It is connected to the Yokogawa Power Meter via serial cable. 
-    * Thunder nodes 
-      * Power supply connected to the Yokogawa so it can collect power measures. 
-      * Connected to HCA thorugh a 2x10GbE connection 
- 
-==== How to obtain the power data ==== 
- 
-Modify the job script such that the HCA server begins collecting power data from Yokogawa before your application starts running on the nodes and ends collecting power data after the execution. Please note that: 
- 
-    * Since the power data from all the nodes is gathered at the same time, **allocating the 4 nodes for the job is mandatory**,​ even if only one of them is going to be used. 
-    * It is strongly recommended to gather power information from about 10 seconds before/​after your application starts/​finishes in order to compare the static against the dynamic power. 
- 
-You will need the following script: 
- 
-    * [[https://​oc.hca.bsc.es/​owncloud/​index.php/​s/​1RWwuPEO3EO8PZn|getYokogawaMeasurements.py]] 
- 
-Then, you should prepare your job script making sure of the following: 
- 
-    * You allocate all the four Thunder nodes. 
-    * You execute only on the nodes you want to. 
-    * The power gathering starts N seconds before your application does. 
-    * The power gathering ends N seconds after your applications does. 
- 
-At the end, your job script should look something similar to this: 
- 
-<code bash>#​!/​bin/​bash 
- 
-#SBATCH -t 30:00 
-#SBATCH --ntasks=8 
-#SBATCH --cpus-per-task=48 
-#SBATCH -o out/​powerTraceJob-%j.out 
-#SBATCH -e err/​powerTraceJob-%j.err 
-#SBATCH -J powerTraceJobName 
-#SBATCH --partition=thunder 
- 
-export OMP_NUM_THREADS=48 
- 
-# Start the power monitoring 
-ssh -f hca "/​path/​to/​getYokogawaMeasurements.py /​dev/​ttyUSB0 /​path/​to/​output/​file.csv"​ 
- 
-# Collect some seconds before starting the application 
-sleep 10s 
- 
-# Run the application 
-# For example, here to run on one node only with 2 MPI ranks and 48 OpenMP Threads 
-srun --cpus-per-task=48 --ntasks=2 ./​path/​to/​your/​binary 
- 
-# Here to run one two nodse with a total of 4 MPI ranks and 48 OpenMP threads 
-srun --cpus-per-task=48 --ntasks=4 ./​path/​to/​your/​binary 
- 
-# More configuration can be done, but are not covered here... 
-# In case of doubt, contact hca.sysadmin@bsc.es 
- 
-# Collect some seconds after the application finishes 
-sleep 10s 
- 
-# Stop the power monitoring 
-ssh hca "pkill -f -INT getYokogawaMeasurements.py"​ 
-</​code>​ 
- 
 ===== Jetson-TX1 ===== ===== Jetson-TX1 =====
  
wiki/prototype/power_monitor.txt ยท Last modified: 2019/03/13 16:49 by kpeiro