Seminar: High-Performance Computing (SS 2021) - High-Performance Computing

Seminar: High-Performance Computing

Course number	L.079.08006
Term	Summer term 2021
Program	Master's program Computer Science and Computer Engineering
Lecturers	Prof. Dr. Christian Plessl

News

2021-02-10 Public course page launched. Internal PANDA page will be created soon.

Organization

In this seminar, we will study influential and classic research papers in high-performance computing covering the topics of Models and Foundations, Applications and Algorithms, Computer Architecture, Optimization Techniques and Programming models.

You will be assigned a topic and 1-3 papers that you should read, understand, and summarize as a scientific report and present to your fellow students. The final results (report and presentation) are however just one aspect of the seminar. Equally important is the process and the intermediate steps how you arrive at these results. These skills shall prepare you for successfully writing and defending your Master's thesis.

Hence, at the end of this seminar you:

understand the scientific publishing process
will be familiar with the canonical structure of a scientific paper
can write a report following the scientific and technical conventions of the field
have practical experience in using the LaTeX typesetting environment
have acquired in-depth knowledge of the assigned topic of high-performance computing
learns how to write a peer review evaluation report for papers from fellow students

A preliminary list of topics is shown below.

Due to the SARS-CoV-2 epidemic the seminar will be held as a remote course that uses self-study, personal feedback and mutual feedback as methods.

Preconditions

The seminar builds on knowledge from Bachelor and Master-level courses in the area of computer engineering, such as Computer Architecture, Digital Design, Reconfigurable Computing, or High-Performance Computing. Having attended these courses is not a strict precondition. But you should have at least a good general understanding in one of these areas.

The seminar will be taught in English.

Expectations and Examples

This seminar is a course with 5 ECTS credit points. Each ECTS point corresponds to a time effort fo 25-30 hours. Hence, it is understood and expected that you invest about 125-150 hours, i.e., about four weeks full-time, into the preparation of your seminar. From this effort alone, it should be evident that a good seminar report is not just a technically accurate summary of an existing paper. Instead, it is expected that you read several papers on your topic to develop a good understanding of the subject and develop it into a paper whose structure and contents are tailored for the purpose and audience.

Preliminary List of Topics

Topics
Below you find the list of topics for the seminar papers. For each topic one or multiple references to papers are given in brackets (like this [Bai97]) that should form the basis for the seminar report. Where applicable, I have also added a few suggestions for specific questions that could be addressed in the report. The suggestions are meant for inspiration and are not exhaustive but should be expanded by you.
A. Models and Foundations
- A1: Amdahl's law in the multicore area [HM08][SC10]
  - What are the implications of Amdahl's law in the age of multicore CPUs? Analyze the different conclusions of the rebuttal paper [SC10] to [HM08]
- A2: LogP model of parallel computation [CKP+93]
  - In which way is the LogP model more realistic than previous models?
  - Is the model still valid today or does it need to be adapted?
- A3: Berkeley view on parallel computing landscape [ABC06]
  - What are the main findings and contributions of this paper?
  - Where has this work been influential?
  - Are the findings still relevant?
- A4: Scientific benchmarking of HPC systems [Bai91][Bai11][HB15]
  - What guidelines should be used for benchmarking HPC systems?
  - How have the possibilities to make mistakes or cheat changed between 1991 [Bai91] and today [HB15]?
B. Applications and Algorithms
- B1: PageRank algorithm for citation ranking [Gl15][PBMW99]
  - What techniques exist for computing the PageRank for very large graphs?
- B2: Tree algorithm for efficient force calculation in N-body problems (Barnes-Hut) [BH86][BAAD+97]
  - What scientific software uses the Barnes-Hut algorithm today?
  - What techniques exist to further reduce the algorithmic and practical complexity of force computation?
C. Computer Architecture
- C1: Disk system architecture for high-performance computing (RAID) [KGP89]
  - What implications does the availability of SSDs have for RAID?
  - What are the shortcomings of RAID and how are they addressed with new architectures?
- C2: Highly scalable network architecture (DragonFly) [KDSA08]
  - How does the DragonFly architecture compare to the upcoming Cray SlingShot architecture?
  - How does DragonFly compare to other network architectures that are widely used in HPC, such as fat-tree and hypercube?
- C3: Computer architecture organization and their effectiveness (Flynn's taxonomy) [Fly72]
  - Is Flynn's famous taxonomy still valid today?
  - Have fundamentally new ways of architecture organization emerged since [Fly72]? How do GPU and FPGAs fit into this taxonomy?
D. Optimization Techniques
- D1: Auto-tuned linear algebra library (ATLAS) [WD98][BACD97]
  - What problem is solved by auto-tuning?
  - What are the implications for performance portability of binary code?
- D2: Recursively blocked and tiled data structures for dense linear algebra [EGJK04]
  - What is the advantage of the recursive approach over blocking and auto-tuning for linear algebra?
- D3: Cache-oblivious algorithms [FLPR12]
  - What assumptions about the cache architecture are made in cache-oblivious algorithms?
- D4: Loop parallelization with iteration space tiling and skewing [Wol86][Wol89]
  - What is the purpose of iteration space tiling and skewing?
  - How these methods translate from vector processors to modern CPUs?
- D5: Spatial and temporal tiling for stencil computation [DKW+09]
  - What are the trade-offs between spatial and temporal tiling?
  - How does the technique translate to modern processors (multicores, SIMD, GPU)?
E. Programming Models
- E1: Scalable data processing on large clusters (MapReduce) [DG04][ZKJ+08]
  - Where is the MapReduce approach used today?
  - What happened after the big Hadoop hype?
- E2: Multi-threaded programming with work stealing (CILK) [BJK+95][BL99]
  - How does multi-threading with CILK relate and compare to OpenMP tasking?

Literature

[ABC+06] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 2006.
[AMD67] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings Spring Joint Computer Conference (SJCC), pages 483–485, New York, 1967. ACM.
[BAAD+97] U Becciani, R Ansaloni, V Antonuccio-Delogu, G Erbacci, M Gambera, and A Pagliaro. A parallel tree code for large N-body simulation: dynamic load balance and data distribution on a CRAY T3D system. Computer Physics Communications, 106(1):105–113, 1997
[BACD97] Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C methodology. In Proc. ACM Int. Conference on Supercomputing (ICS), pages 340--347, New York, 1997. ACM.
[Bai11] David H. Bailey. 12 ways to fool the masses: Fast forward to 2011 (talk). www.davidhbailey.com/dhbtalks/dhb-12ways.pdf, 2011.
[Bai91] David H. Bailey. Twelve ways to fool the masses when giving performance results on parallel computers. In Supercomputing Review, pages 54--55, June 1991.
[BJK+95] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. SIGPLAN Not., 30(8):207–216, August 1995.
[BL99] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720–748, September 1999.
[CKP+93] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards a realistic model of parallel computation. In Proc. ACM SIGPLAN Symp. on Principles and practice of parallel programming (PPoPP), pages 1--12, New York, 1993. ACM.
[DG04] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. Symp. on Operating System Design and Implementation (OSDI), pages 137--149. USENIX Association, 2004.
[DKW+09] Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 51(1):129--159, 2009.
[EGJK04] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kågström. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3--45, 2004.
[FLPR12] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. ACM Trans. Algorithms, 8(1), January 2012.
[Fly72] Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Trans. on Computers, 21(9):948–960, September 1972.
[Gle15] David F. Gleich. PageRank beyond the web. SIAM Review, 57(3):321--363, 2015.
[HB15] Torsten Hoefler and Roberto Belli. Scientific benchmarking of parallel computing systems: Twelve ways to tell the masses when reporting performance results. In Proc. Int. Conf. on Supercomputing (SC), New York, 2015. ACM.
[HM08] M. D. Hill and M. R. Marty. Amdahl’s law in the multicore era. IEEE Computer, 41(7):33--38, 2008.
[KDSA08] John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. Technology-driven, highly-scalable Dragonfly topology. In Proc. Int. Symp. on Computer Architecture (ISCA), pages 77--88. IEEE Computer Society, 2008.
[KGP89] R. H. Katz, G. A. Gibson, and D. A. Patterson. Disk system architectures for high performance computing. Proceedings of the IEEE, 77(12):1842--1858, 1989.
[PBMW99] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.
[SC10] Xian-He Sun and Yong Chen. Reevaluating Amdahl’s law in the multicore era. Journal of Parallel and Distributed Computing, 70(2):183--188, 2010.
[WD98] R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In Proc. Int. Conf. on Supercomputing (SC), page 1–27. IEEE Computer Society, 1998.
[Wol86] Michael Wolfe. Loops skewing: The wavefront method revisited. Int. Journal of Parallel Programming (IJPP), 15(4):279--293, 1986.
[Wol89] Michael Wolfe. More iteration space tiling. In Proc. Int. Conf. on Supercomputing (SC), page 655–664, New York, 1989. ACM.
[ZKJ+08] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. Improving MapReduce performance in heterogeneous environments. In Proc. Symp. on Operating System Design and Implementation (OSDI), page 29–42. USENIX Association, 2008.