Performance-Monitoring for Supercomputers

Systementwurfs-Teamprojekt

You can find general information on the Systementwurfs-Teamprojekt (SET) at https://cs.uni-paderborn.de/ceg/teaching/courses/ws-202021/systementwurf-teamprojekt/.

Large supercomputers and HPC-clusters cost millions of Euros and are used by many different researchers with a large set of programs. Thus, for a HPC-center like Paderborn Center for Parallel Computing (PC²), that installs, operates and manages such systems to be used for research it is very important to support reserachers in an efficient use of these systems.

One very important aspect of support is to monitor the usage of the available resources to identify inefficietly compiled or inefficiently executed programs. This Monitoring is done at a per-job basis, so that individual compute jobs can be identified that are problematic. PC² is currently running such a job-specific monitoring system in closed Beta test and actively developing it. The current state is documented at https://wikis.uni-paderborn.de/pc2doc/Job-Specific-Monitoring.

The goal of this project is to advance this system in various directions.

For example:

The data is currently beeing collected in a time-series database (InfluxDB). Even though this was rather simple to set up, it doesn't scale well to thousands of compute ndoes. Hence, a replacement needs to be developed.
The automated analysis of the performance data is currently limited to averages over the whole job. However, a compute job is not homogenous over time in its properties but has different phases that might have different performance characteristics. Thus, the time series data is to be automatically split up into chraracteristic sections so the these sections can be analysed individually.

Language for this project can be either German or English depending on the wishes of the participants. Due to

Interests:

databases
data analysis
web programming
high-performance computing

Goals/Challenges:

setup of a virtualized test system
development of a replacement for the timeseries database that scales well
automated identification or characteristic sections
adaption of the web frontend and rule-based analysis to characteristic subsections of compute jobs

Applications:

Python (mainly for data analysis)
PHP, Symfony (for extensions of the web frontend or API)
SQL (for extensions/modifications of the database containing the job meta data)
https://github.com/ClusterCockpit/ClusterCockpit
some language for the new database