Mont-Blanc Prototype

Currently we have deployed two versions of the cluster, one for production an another one for testing.

For information about the Mont-Blanc mini-clusters, please refer to this other wiki.

FAQ

How do I submit a job to the prototype?

You can find more information about how to use the job scheduler (SLURM) here.

Which software is installed at the Mont-Blanc prototype?

The easiest way to find out which is the software available at the Mont-Blanc prototype is by executing the command module avail or checking at the Mont-Blanc prototype information page. Anyhow, you can also check our News section to check if there is new software installed as well as for any other modifications that we could make.

How can I get power data from my jobs?

:!: The Mont-Blanc Power Accounting plugin for SLURM is not working properly at the moment. The energy reported by the sacct command could be incorrect :!:

It depends on which kind of data do you want. If you only care about the energy to solution of your job, then is as simple as execute:

sacct -j ${JOBID} -o jobid,jobname,partition,alloccpus,state,exitcode,consumedenergy

If what you need/want is a power profile, then you will need the start and the end time of your job and query the Power Monitor as explained here.

Why ''sacct'' command is reporting 4.29M as consumed energy for my jobs?

When the consumed energy field is filled with 4.29M the computation of the energy-to-solution failed for some reason. Most probable one could be that some of the power traces from some nodes are missing for another issue with the power database.

Why ''sacct'' command is reporting 0 as consumed energy for my jobs?

It depends on where you are getting the 0J as consumed energy. For example:

       JobID    JobName  Partition  AllocCPUS   NNodes      State ExitCode ConsumedEnergy
------------ ---------- ---------- ---------- -------- ---------- -------- --------------
219999       power_mon+ mb-priori+          2        1  COMPLETED      0:0
219999.batch      batch                     1        1  COMPLETED      0:0              0
219999.0     dense-mat+                     2        1  COMPLETED      0:0             28
220025       power_mon+ mb-priori+          2        1  COMPLETED      0:0
220025.batch      batch                     1        1  COMPLETED      0:0              0
220025.0     dense-mat+                     2        1  COMPLETED      0:0             71
220025.1     dense-mat+                     2        1  COMPLETED      0:0             51
220030       power_mon+ mb-priori+          2        1  COMPLETED      0:0
220030.batch      batch                     1        1  COMPLETED      0:0              0

For example, the job 219999 report 28 because the job script was executing the binary with the srun command. But, as you can see, only the jobstep 0 is reporting the data (i.e. JobID equals to 219999.0) while the batch job step reports 0.

Afterwards, a job with two job steps is executed (JobId equals to 220025). For each job step the consumed energy is reported.

Finally, the last job (JobId equals to 220030) does not report any consumed energy since the srun command were not used for executing the application inside the job script.

More information about in which cases the consumed energy is computed and reported can be seen here.

Is there any user-level configuration for MPI jobs?

Yes, both MPICH and OpenMPI offer the user a set of parameters that can be modified to change the behavior of the MPI implementation. More information about this can be found at this link.

How can I get a Paraver trace of my application?

There's no easy answer for this question since it depends on your application. See our manual for further details.

How can I run my benchmark ensuring any other job is running at the cluster at the same time?

Good question. When benchmarking, the best results are obtained by ensuring that no other application is using the network, this is, by using the whole machine for your own. The way to do this can be found at this guide.

:!: Special user permissions are required for creating reservations at SLURM. Please contact hca.sysadmin@bsc.es before following the guide above specifying the application do you want to benchmark as well as the total amount of nodes and the time you will need to reserve. :!:

I am facing some weird behavior and errors with the filesystem. Is there anything I can do to improve the overall performance of Lustre?

At the Mont-Blanc prototype, the distributed filesystem Lustre is used to provide /home and /apps. We already know that Lustre has some issues when dealing with small files and not too much can be done to solve this. Anyhow, please check this page to learn some hints that can improve the performance and, maybe, correct the errors your are facing.

Who should I contact if I have a doubt or a technical issue?

In those cases please contact the System Administrators of the Mont-Blanc prototype.

QR Code
QR Code start (generated for current page)