Configure your enviornment on the Oculus cluster

To use the Oclus cluster, you need to configure the environment for the CCS workload scheduler that is used to submit jobs to the cluster. This needs to be done only once.

# log into Oculus frontend server

ssh fe.pc2.uni-paderborn.de

ccsgenrcfiles hpc-lco-plessl

# Please verify that you are also a member of the 'hpc-lco-plessl' Unix group
# Checke whether the output of the following command contains 'hpc-lco-plessl'

groups

Note: This setup is for participants of the High-Performance Computing Lecture by Prof. Christian Plessl.

Logging into the Oculus cluster, compiling and running a basic MPI program

We will be using Intel MPI and the Intel compiler here. OpenMPI and gcc work similarly

 

# log into Oculus frontend server ssh fe.pc2.uni-paderborn.de # load desired MPI version and compiler module load intel-mpi module load intel/compiler/15.0.1 # compile application mpiicc -g -Wall -std=c99 -o global-sum-simple global-sum-simple.c # submit interactive job to scheduler, run with 8 MPI ranks ccsalloc -I --res=rset=8:ncpus=1 impi -- ./global-sum-tree 

 

Copying files to Oculus home directory

You can use scp (secure copy) to transer files from your computer to Oculus.

 

# copy file my-program.c to Oculus home directory scp my-program.c yourusername@fe.pc2.uni-paderborn.de:

 

Mounting the Oculus on your computer

You can directly mount your home directory on Oculus on your local computer via CIFS/SMB. If your username is alberto the path for mounting is: 

 

smb://fs-cifs.uni-paderborn.de/upb/departments/pc2/users/a/alberto

 

Note that the last component of the path is your username and the component before is the first letter of your username.

On a Mac you can mount this server in Finder using the "Go → Connect to Server" command. On Windows there is also an similar command. On Linux you can mount the remote filesystem using smbmount.

Use vTune for performance profiling

Intel vTune allows you to identify performance critical parts of applications. For example, vTune can determine how well the application is making use of multi-threading, vectorization, caches, etc.

vTune can run on any node of the Oculus cluster, but for detailed analysis a special configuration of the server is required. the OpenCCS workload manager offers a special resource class named 'vtune' for this purpose, that can be use to allocate a machine with vTune support. vTunes can be run both interactively and in batch mode, where the application characteristics are collected for later, interactive analysis. For first steps, it is recommended to use the interactive mode.

 

# allocate an interactive session on a node that supports Intel vTune # reserve the node for 1 hour, get exclusive acces for the node ssh fe.pc2.uni-paderborn.de ccsalloc -t 1h --res=rset=vtune=t,place=:excl -I # after a while you will be logged into a node with a hostname like washington005 # this interactive shell does however not support X forwarding # hence, you need an additional login to frontend and a ssh connection to the host # with the interactive session ssh -X fe.pc2.uni-paderborn.de ssh -X washington005 # load Intel Parallel Studio # compile program (use option -g to include debug information to allow vTune to # associate hotspots with code) module load ps_xe_2017 icc -std=c99 -g -O2 -fopenmp -o my_program my_program.c # run vTune # create a new project and start a new new analysis # configure name and command line parameters of the application amplxe-gui & # investigate the different analysis types of vTune # -> basic hotspot hotspot # -> advanced hotspot analysis with OpenMP regions # -> concurrency analysis # -> HPC analysis (floating point, parallel regions,etc.)

 

Analyze communication patterns of MPI applications with the built-in profiling and statistics methods of the Intel MPI library (low overhead)

 

# load Intel Parallel Studio module and compile program ssh fe.pc2.uni-paderborn.de module load ps_xe_2017 mpiicc -g -O2 -fopenmp -o mpi_application mpi_application.c # run job on 8 nodes each having 1 MPI rank 4 OpenMP threads # we need a workaround here, because CCS does not yet know about the updated Intel MPI provided ccsalloc -I --res=rset=8:ncpus=8:ompthreads=4,place=scatter:shared module load ps_xe_2017 mpirun -r ssh -machinefile $CCS_NODEFILE -genv I_MPI_STATS 3 -genv I_MPI_STATS_SCOPE all ./mpi_application

 

After the application has terminated, you will find an analysis of the communication behavior in the generated file 'stats.txt'.

MPI communication analysis with Intel Trace Collector and Analyzer (high overhead but more detailed results)

A detailed analysis of the MPI communication behavior for an application can be obtained using the Intel Trace Collector and Trace Analyzer. To this end, the MPI launcher is started with option '-trace' which generates trace file for the applications that can be analyzed after the execution of the application.

 

ssh fe.pc2.uni-paderborn. module load ps_xe_2017 mpiicc -g -O2 -fopenmp -o mpi_application mpi_application.c # run job on 4 nodes each having 1 MPI rank that spawns 4 OpenMP threads # we need a workaround here, because CCS does not yet know about the updated # Intel MPI provided with Intel Parallel Studio (module ps_xe_2017) ccsalloc -I --res=rset=4:ncpus=4:ompthreads=4,place=scatter:shared module load ps_xe_2017 mpirun -r ssh -machinefile $CCS_NODEFILE -trace ./mpi_application # afther the application has terminated we can analyze the results traceanalyzer mpi_application.stf &