Number Crunching with GPU-Tensor Cores

Systementwurfs-Teamprojekt

You can find general information on the Systementwurfs-Teamprojekt (SET) at https://cs.uni-paderborn.de/ceg/teaching/courses/ws-201920/systementwurf-teamprojekt/.

Algorithms in the area of machine learning (ML), especially Deep Leaning, require many linear-algebra operations on multidimensional matrices and vectors. In order to accelerate these computations and increase their efficiency NVidia has introduced so-called tensor cores with their Volta architecture (NVidia Youtube video about tensor cores). This new kind of compute units are especially tailored for matrix-matrix multiplications. One of these GPUs, for example, an NVidia RTX 2080 Ti can perform up to 100 TFlops (10¹⁴floating-point operations per second in single/half precision) with tensor cores while only consuming 250 Watts of power.

At a comparable power usage, state-of-the-art CPU-based compute nodes can only perforn around 6 TFlops (6*10¹²floating-point operations per second in single precision).

This project will try to harness the computational power and efficiency of the tensor cores for scientific programs from computational chemistry. Two computational hot-spots in the computational-chemistry code CP2K (https://www.cp2k.org/) have identified been detected and are expected to be suitable for the acceleration with tensor cores.

The project will be advised by an expert from computational chemistry who will handle all chemistry-related details and the integration into the scientific code.

No knowledge of chemistry is required for this project. Several Nvidia GPUs (RTX 2080 Ti) are already available.

Language for this project can be either German or English depending on the participants.

Interests:

high-performance computing
GPU-Programming
programming in general (C or C++ experience is helpful)
acceleration of scientific applications
linear algebra

Goals:

Use tensor cores to accelerate the quantum chemistry code CP2K with NVidia tensor cores on GPUs.
Two hot-spots:
- matrix-matrix multiplications for small matrices in the underlying library DBCSR (https://www.cp2k.org/dbcsr)
- computation of the matrix-sign-function in the submatrix method (https://arxiv.org/abs/1710.10899)
Performance modelling, implementation, optimization and testing of the optimized linear algebra methods
Publication of results as open source codes

Challenges:

programming tensor cores with cuBLAS, CUTLASS, or CUDA
understanding some basic concepts from linear algebra (eigenvalue problems, matrix-sign-function, submatrix method)

Applications:

NVidia CUDA (https://developer.nvidia.com/cuda-zone)
NVidia CUTLASS (https://github.com/NVIDIA/cutlass)
CP2K (https://www.cp2k.org/)
compiler