User Tools

Site Tools


wiki:prototype:sve_emulation

Arm Instruction Emulator

General Information

Arm Instruction Emulator supports emulation of all SVE instructions when running on Armv8-A compatible hardware. Note that the emulator does not support emulation of Armv8.x instructions, namely Armv8.1 and Armv8.2.

How to use it

NOTE: For the following tutorial, we will use the Merlin platform. Anyhow, the steps are the same for any of the Armv8-A clusters available at BSC. Also, we will use Lulesh as an example application, but this procedure will work on any other application that can be compiled with the Arm HPC Compiler (i.e., llvm-based compiler).

Prepare your binary

First of all, you need to compile your application with the Arm HPC Compiler. Therefore, we need to load the required modules:

# Make sure you don't have any other module loaded
druiz@merlin-1:~/armIE_tuto/LULESH$ module purge
remove ARMIE/1.2.1 (PATH, LD_LIBRARY_PATH, CPATH)
remove HPC_COMPILER/1.4 (PATH, LD_LIBRARY_PATH, CPATH)
 
# Load the Arm HPC Compiler and the Arm Instruction Emulator modules
druiz@merlin-1:~/armIE_tuto/LULESH$ module load HPC_COMPILER ARMIE
load HPC_COMPILER/1.4 (PATH, LD_LIBRARY_PATH, CPATH)
load ARMIE/1.2.1 (PATH, LD_LIBRARY_PATH, CPATH)

Now, you need to edit your compilers to use armclang/armclang++/armflang, which are the compilers from the Arm HPC Compiler. In the case of Lulesh, we need to use armclang++ since it is a C++ application. We also need to specify to the compiler that we want it to emit SVE instructions. After these additions/modifications, our compiler declaration and its flags should look something like this:

...
 
CXX = armclang++
CXXFLAGS = -O3 -mcpu=native -march=armv8-a+sve -ffp-contract=fast
 
...

Then compile your binary:

make -j8

At this point, you should have your binary which will use SVE instructions.

MPI Applications

If your code uses MPI you will need to compile it with the Arm HPC Compiler version of your MPI library. To display which MPI flavors are available use the module avail command (see Environment Modules Usage for more information).

# Load the Arm HPC Compiler, the MPI library and the Arm Instruction Emulator modules
druiz@merlin-1:~/armIE_tuto/LULESH$ module load HPC_COMPILER/1.3 openmpi/2.1.1_arm_compiler ARMIE
load HPC_COMPILER/1.3 (PATH, LD_LIBRARY_PATH, CPATH)
load openmpi/2.1.1_arm_compiler (PATH, MANPATH, LD_LIBRARY_PATH)
load ARMIE/1.2.1 (PATH, LD_LIBRARY_PATH, CPATH)

The compiler might give an error suggesting version `GLIBCXX_3.4.21' not found. If this is your case, you should also load the GCC7 module:

# Load the Arm HPC Compiler, the MPI library and the Arm Instruction Emulator modules
druiz@merlin-1:~/armIE_tuto/LULESH$ module load HPC_COMPILER/1.3 openmpi/2.1.1_arm_compiler ARMIE gcc/7.1.0
load HPC_COMPILER/1.3 (PATH, LD_LIBRARY_PATH, CPATH)
load openmpi/2.1.1_arm_compiler (PATH, MANPATH, LD_LIBRARY_PATH)
load ARMIE/1.2.1 (PATH, LD_LIBRARY_PATH, CPATH)
load gcc/7.1.0 (PATH, MANPATH, LD_LIBRARY_PATH)

Running your binary

The first thing to do is to double check that your binary actually includes SVE instructions. The fastest and easiest way to do it is just executing it. Since the SVE extensions are not available on any Armv8-A SoC at this moment, we will see something like this:

druiz@merlin-1:~/armIE_tuto/LULESH$ ./lulesh2.0 -i 10 -q
Illegal instruction

Once we know for sure our code has SVE instructions, we can continue to executing it with the Arm Instruction Emulator. It will emulate the SVE instructions performed (therefore, the execution time will be larger).

The Arm Instruction Emulator accepts different options:

druiz@merlin-1:~/armIE_tuto/LULESH$ armie --help
Execute binaries containing SVE instructions on ARMv8-A hardware
 
Usage:
  armie [flags] -- <command to execute>
 
Examples:
  armie -msve-vector-bits=256 -- ./sve_program
  armie -msve-vector-bits=2048 --debug -- ./sve_program &lt;flags for sve_program&gt;
 
Flags:
  -m<string>                    Architecture specific options. Supported options:
    -msve-vector-bits=<uint>    Vector length to use. Must be a multiple of 128 bits up to 2048 bits
    -mlist-vector-lengths       List all valid vector lengths
  -d, --debug                   Enables assertion checks in the emulator to help isolate and diagnose bugs
                                Cannot be used with -p, --profile-period
  -s, --stats                   Enables statistics about the emulated SVE instructions
  -o, --output <file name>      Redirects all messages generated by armie to a file
  -p, --profile-period <uint>   Enables the performance profiler and sets the sampling period (in microseconds)
                                Cannot be used with -d, --debug
  -h, --help                    Prints this help message
  -V, --version                 Prints the version
 
Experimental flags:
Note: these features are experimental and may be removed or changed significantly in future versions
  -m<string>                    Architecture specific options. Supported options:
    -msve-memtrace=<string>     Sets the output format for memory access tracing for SVE instructions
                                Must be one of "none" (default), "text" or "binary"

The most important ones are:

  • -msve-vector-bits=<uint>
    • To specify the vector register width of the SVE registers
  • -mlist-vector-lengths
    • To know which vector lengths are available for the option -mvse-vector-bits
  • -s, –stats
    • To print statistics about the SVE instructions emulated
  • -o, –output <file name>
    • To print the output generated by armie to a file
  • -p, –profile-period <uint>
    • To enable the performance profiler and set the sampling period in microseconds

Now, it is time to execute our application with the Arm Instruction Emulator.

druiz@merlin-1:~/armIE_tuto/LULESH$ armie -msve-vector-bits=1024 -s -o armie_lulesh_1024b.txt -p 100 -- ./lulesh2.0 -i 10
Running problem size 30^3 per domain until completion
Num processors: 1
Num threads: 8
Total number of elements: 27000
 
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
 
Run completed:
   Problem size        =  30
   MPI tasks           =  1
   Iteration count     =  10
   Final Origin Energy = 7.011263e+06
   Testing Plane 0 of Energy Array on rank 0:
        MaxAbsDiff   = 4.365575e-11
        TotalAbsDiff = 4.375256e-11
        MaxRelDiff   = 3.128945e-11
 
Elapsed time         =      12.23 (s)
Grind time (us/z/c)  =  45.300948 (per dom)  ( 45.300948 overall)
FOM                  =  22.074593 (z/s)

The output generated by armie is located in this case at armie_lulesh_1024b.txt. The performance profiler output will be located at ${binaryName}_${idNumber}.samples file.

Understanding //armie// output

Stats

Before we generated a file with the output of the armie. The contents look like this:

0x403b2c: 0x25ed1fe0 80
0x403b30: 0x25f8c000 80
0x403b38: 0x04f0e3ec 18640
0x403b3c: 0xe5ee4120 18640
0x403b40: 0xe5ee4140 18640
...
0x40d2b8: 0xe540e500 12751
0x40d2bc: 0x25584022 12751
0x40d2c0: 0x04b0e3e9 12751
0x40d2c4: 0x04285028 12751
0x40d2c8: 0x25824841 12751
armie exiting after executing 12875317 SVE instructions at a rate of 13994.91 per fault.

The first column indicates the address of the instruction, the second one is the instruction encoding and the third is the number of times the specified instruction was executed. On the last line we can see a summary of how many SVE instructions were executed in total as well as the rate of instructions executed per fault.

Performance Profiler

The contents of the performance profiler output is like this:

0x404bcc 38590
0x4092f0 37356
0x4067dc 37120
0x404028 31272
0x40401c 13254
...
0x401500 1
0x4014f0 1
0x4014b8 1
0x401230 1
0x401160 1

In this case, the first column indicates the address of the instruction, while the second is the number of times the instructions was executed.

This specific format allows us to use the addr2line and the addr2func commands to know where those instructions were located in our code and therefore find hotspots.

VerifyAndWriteFinalOutput: 1
CalcElemFBHourglassForce: 92
CollectDomainNodesToElemNodes: 56
Domain::delv_xi: 1
Domain: 124
...
std::__fill_n_a<int*, unsigned long, int>: 218
Domain::x: 8
EvalEOSForElems: 2
SumElemStressesToNodeForces: 42
$x: 2
wiki/prototype/sve_emulation.txt · Last modified: 2017/12/19 08:30 (external edit)