These materials are from my online course CISC372: Parallel Computing, offered by the University of Delaware in the Fall of 2020.
Day | Topic | Link |
0 | Introducing myself. | Instructor’s Introduction |
1 | Introduction to parallel computing. What is it and why do we need it? Hardware trends. Parallel programming models. Message-passing vs. shared variables. Applications. | Introduction to Parallel Computing |
2 | Systems. A practical introduction to Unix and related tools. History and philosophy of Unix. Shell commands, file system, text editors, bash, environment variables. Make, Subversion. | Systems |
3 | The C programming language, part 1. Quick introduction to sequential C programming, focusing on the particular tools we will use. History, translation phases, preprocessor, types. | C1 |
4 | The C programming language, part 2. Pointer arithmetic, void*, allocation, array-pointer pun, structs. | C2 |
5 | MPI introduction. The message-passing model, introduction to MPI, hello world, barriers and reductions. | MPI Intro |
6 | Performance measurement and analysis. Weak and strong scaling, speedup, efficiency, Amdahl’s law. Graphing performance experiments. | Performance |
7 | MPI point-to-point communication. Send and receive, getting the status and count, tags, message ordering deadlock, MPI_Sendrecv, data exchange patterns. | P2P |
8 | Data distribution. Cyclic and block distributions. The Standard Block Distribution Scheme. Nearest neighbor communication and ghost cell exchanges. 1d and 2d diffusion, striped and checkerboard decompositions. | Distribution |
9 | MPI collectives. The collective model of communication, MPI’s collective operations: MPI_Bcast, MPI_Scatter, MPI_Gather, …. | Collectives |
10 | MPI wildcards and nondeterminism. MPI_ANY_SOURCE, when is it needed and when not? Semantics of point-to-point, revisited: matching, ordering. Nondeterministic behavior. Manager-worker pattern. Floating-point nondeterminism. Example: numerical integration. | Wildcards |
11 | Review of part 1 of course. | Review 1 |
12 | Exam 1. | |
13 | Introduction to multithreaded programming. Processes vs. threads, programming language support. POSIX threads (Pthreads), thread creation and join. | Threads 1 |
14 | Data races, mutexes, and critical sections. What is a data-race and why do you want to avoid them? The meaning of “undefined behavior.” Pthread mutexes: API and semantics. The critical section problem. | Threads 2 |
15 | Condition variables and concurrency flags. The need to wait. Condition variables: API, semantics, usage patterns. Concurrency flags and a Pthreads implementation. A 2-thread barrier. | Threads 3 |
16 | Multithreaded barriers and reductions. Properties of barrier design. Counter barrier, coordinator barrier, combining tree barrier, butterfly & dissemination barrier. Reductions. | Threads 4 |
17 | OpenMP, part 1. Introduction. Fork-join parallelism, directive based programming, syntax, compiling & running, the parallel directive, getting the number of threads and thread ID. | OpenMP 1 |
18 | OpenMP, part 2. Private vs. shared variables. Initialization of private variables. The for loop directive. When can a loop be safely parallelized? The collapse clause. | OpenMP 2 |
19 | OpenMP, part 3. OpenMP reductions. Loop schedules. To wait or not to wait? The sections and single constructs. | OpenMP 3 |
20 | OpenMP, part 4. Synchronization, threadprivate, hybrid MPI+OpenMP programs. | OpenMP 4 |
21 | CUDA, part 1. Why use GPUs for scientific computing? The CUDA programming model, CUDA-C, and hello word. Compiling and running a CUDA program. The CUDA memory hierarchy. | CUDA 1 |
22 | Review of part 2 of course. | Review 2 |
23 | Exam 2. | |
24 | CUDA, part 2. Memory management functions: allocation, copying. Examples: vector addition, 1d-diffusion. 2d and 3d-grids. Example: matrix addition. | CUDA 2 |
25 | CUDA, part 3. Synchronization. Warps and a restriction on __syncthreads. Shared variables: why they are needed and how to create them. Reduction over threads in a block. Example: dot product. | CUDA 3 |
26 | CUDA, part 4. Matrix multiplication: naive version, and using shared memory. Performance experiment on 8000×8000 matrices of doubles. Hybrid programs using CUDA, MPI, and threads. | CUDA 4 |
27 | Future trends. |