From Zero to Multi-Node GPU Programming
This weekly workshop series is jointly organized by NHR@FAU, NHR@TUD and NVIDIA DLI. It covers the following DLI courses:
- Fundamentals of Accelerated Computing with CUDA C/C++
- Accelerating CUDA C++ Applications with Multiple GPUs
- Scaling CUDA C++ Applications to Multiple Nodes
Please indicate which parts you want to attend when registering.
Date and Time
The courses will be held online on September 18th, September 25th and October 2nd, from 9 am to 5 pm.
Prerequisites
A free NVIDIA developer account is required to access the course material. Please register before the training at https://learn.nvidia.com/join.
Part 1
- Basic C/C++ competency, including familiarity with variable types, loops, conditional statements, functions, and array manipulations
- No previous knowledge of CUDA programming is assumed
Parts 2 and 3
- Successful attendance of Part 1 (Fundamentals of Accelerated Computing with CUDA C/C++) or equivalent experience implementing CUDA C/C++ applications, including
- memory allocation, host-to-device and device-to-host memory transfers,
- kernel launches, grid-stride loops, and
- CUDA error handling.
- Familiarity with the Linux command line.
- Experience using Makefiles to compile C/C++ code.
Learning Objectives
Day 1
At the conclusion of the workshop, participants will have an understanding of the fundamental tools and techniques for GPU- accelerating C/C++ applications with CUDA and be able to:
- Write code to be executed by a GPU accelerator
- Expose and express data and instruction-level parallelism in C/C++ applications using CUDA
- Utilize CUDA-managed memory and optimize memory migration using asynchronous prefetching
- Leverage command-line and visual profilers to guide your work
- Utilize concurrent streams for instruction-level parallelism
- Write GPU-accelerated CUDA C/C++ applications, or refactor existing CPU-only applications, using a profile-driven approach
Day 2
At the conclusion of the workshop, you will be able to:
- Use concurrent CUDA streams to overlap memory transfers with GPU computation,
- Utilize all GPUs on a single node to scale workloads across available GPUs,
- Combine the use of copy/ compute overlap with multiple GPUs, and
- Rely on the NVIDIA Nsight Systems timeline to observe improvement opportunities and the impact of the techniques covered in the workshop.
Day 3
At the conclusion of the workshop, you will be able to:
- Use several methods for writing multi-GPU CUDA C++ applications,
- Use a variety of multi-GPU communication patterns and understand their tradeoffs,
- Write portable, scalable CUDA code with the single-program multiple-data (SPMD) paradigm using CUDA-aware MPI and NVSHMEM,
- Improve multi-GPU SPMD code with NVSHMEM’s symmetric memory model and its ability to perform GPU-initiated data transfers, and
- Get practice with common multi-GPU coding paradigms like domain decomposition and halo exchanges.
Certification
Upon successful completion of the assessment at the end of each day, participants will receive an NVIDIA DLI certificate to recognize their subject matter competency and support professional career growth.
Structure
Day 1
Accelerating Applications with CUDA C/C++
- Writing, compiling, and running GPU code
- Controlling the parallel thread hierarchy
- Allocating and freeing memory for the GPU
Managing Accelerated Application Memory with CUDA C/C++
- Profiling CUDA code with the command-line profiler
- Details on unified memory
- Optimizing unified memory management
Asynchronous Streaming and Visual Profiling for Accelerated Applications with CUDA C/C++
- Profiling CUDA code with NVIDIA Nsight Systems
- Using concurrent CUDA streams
Day 2
Introduction to CUDA Streams
- Get familiar with your GPU-accelerated interactive JupyterLab environment.
- Orient yourself with a single GPU CUDA C++ application that will be the starting point for the course.
- Observe the current performance of the single GPU CUDA C++ application using Nsight Systems.
- Learn the rules that govern concurrent CUDA stream behavior.
- Use multiple CUDA streams to perform concurrent host-to-device and device-to-host memory transfers.
- Utilize multiple CUDA streams for launching GPU kernels.
- Observe multiple streams in the Nsight Systems Visual Profiler timeline view.
Multiple GPUs with CUDA C++
- Learn the key concepts for effectively using multiple GPUs on a single node with CUDA C++.
- Explore robust indexing strategies for the flexible use of multiple GPUs in applications.
- Refactor the single-GPU CUDA C++ application to utilize multiple GPUs.
- See multiple-GPU utilization in the Nsight Systems Visual Profiler timeline.
Copy/ Compute Overlap with CUDA Streams
- Learn the key concepts for effectively performing copy/ compute overlap.
- Explore robust indexing strategies for the flexible use of copy/ compute overlap in applications.
- Refactor the single-GPU CUDA C++ application to perform copy/ compute overlap.
- See copy/ compute overlap in the Nsight Systems visual profiler timeline.
Copy/ Compute Overlap with Multiple GPUs
- Learn the key concepts for effectively performing copy/ compute overlap on multiple GPUs.
- Explore robust indexing strategies for the flexible use of copy/ compute overlap on multiple GPUs.
- Refactor the single-GPU CUDA C++ application to perform copy/ compute overlap on multiple GPUs.
- Observe performance benefits for copy/ compute overlap on multiple GPUs.
- See copy/ compute overlap on multiple GPUs in the Nsight Systems visual profiler timeline.
Day 3
Multi-GPU Programming Paradigms
- Survey multiple techniques for programming CUDA C++ applications for multiple GPUs using a Monte-Carlo approximation of Pi CUDA C++ program.
- Use CUDA to utilize multiple GPUs.
- Learn how to enable and use direct peer-to-peer memory communication.
- Write an SPMD version with CUDA-aware MPI.
Introduction to NVSHMEM
- Learn how to write code with NVSHMEM and understand its symmetric memory model.
- Use NVSHMEM to write SPMD code for multiple GPUs.
- Utilize symmetric memory to let all GPUs access data on other GPUs.
- Make GPU-initiated memory transfers.
Halo Exchanges with NVSHMEM
- Practice common coding motifs like halo exchanges and domain decomposition using NVSHMEM, and work on the assessment.
- Write an NVSHMEM implementation of a Laplace equation Jacobi solver.
- Refactor a single GPU 1D wave equation solver with NVSHMEM.
- Complete the assessment and earn a certificate.
Language
The courses will be held in English.
Instructors
Dr. Sebastian Kuckuk and Markus Velten, both certified NVIDIA DLI Ambassadors.
The course is co-organised by NHR@FAU, NHR@TUD and the NVIDIA Deep Learning Institute (DLI).
Prices and Eligibility
The course is open and free of charge for participants from academia from European Union (EU) member states and countries associated under Horizon 2020.
Withdrawal Policy
Please only register for the course if you are really going to attend. No-shows will be blacklisted and excluded from future events. If you want to withdraw your registration, please send an e-mail to sebastian.kuckuk@fau.de.
Wait List
To be added to the wait list after the course has reached its maximum number of registrations send an e-mail to sebastian.kuckuk@fau.de with your name, university affiliation and days you want to attend.