CISC372: Parallel Computing

These materials are from my online course CISC372: Parallel Computing, offered by the University of Delaware in the Fall of 2020.

0Introducing myself.Instructor’s Introduction
1Introduction to parallel computing. What is it and why do we need it? Hardware trends. Parallel programming models. Message-passing vs. shared variables. Applications.Introduction to Parallel Computing
2Systems. A practical introduction to Unix and related tools. History and philosophy of Unix. Shell commands, file system, text editors, bash, environment variables. Make, Subversion.Systems
3The C programming language, part 1. Quick introduction to sequential C programming, focusing on the particular tools we will use. History, translation phases, preprocessor, types.C1
4The C programming language, part 2. Pointer arithmetic, void*, allocation, array-pointer pun, structs.C2
5MPI introduction. The message-passing model, introduction to MPI, hello world, barriers and reductions.MPI Intro
6Performance measurement and analysis. Weak and strong scaling, speedup, efficiency, Amdahl’s law. Graphing performance experiments. Performance
7MPI point-to-point communication. Send and receive, getting the status and count, tags, message ordering deadlock, MPI_Sendrecv, data exchange patterns.P2P
8Data distribution. Cyclic and block distributions. The Standard Block Distribution Scheme. Nearest neighbor communication and ghost cell exchanges. 1d and 2d diffusion, striped and checkerboard decompositions.Distribution
9MPI collectives. The collective model of communication, MPI’s collective operations: MPI_Bcast, MPI_Scatter, MPI_Gather, ….Collectives
10MPI wildcards and nondeterminism. MPI_ANY_SOURCE, when is it needed and when not? Semantics of point-to-point, revisited: matching, ordering. Nondeterministic behavior. Manager-worker pattern. Floating-point nondeterminism. Example: numerical integration.Wildcards
11Review of part 1 of course.Review 1
12Exam 1.
13Introduction to multithreaded programming. Processes vs. threads, programming language support. POSIX threads (Pthreads), thread creation and join.Threads 1
14Data races, mutexes, and critical sections. What is a data-race and why do you want to avoid them? The meaning of “undefined behavior.” Pthread mutexes: API and semantics. The critical section problem.Threads 2
15Condition variables and concurrency flags. The need to wait. Condition variables: API, semantics, usage patterns. Concurrency flags and a Pthreads implementation. A 2-thread barrier.Threads 3
16Multithreaded barriers and reductions. Properties of barrier design. Counter barrier, coordinator barrier, combining tree barrier, butterfly & dissemination barrier. Reductions.Threads 4
17OpenMP, part 1. Introduction. Fork-join parallelism, directive based programming, syntax, compiling & running, the parallel directive, getting the number of threads and thread ID. OpenMP 1
18OpenMP, part 2. Private vs. shared variables. Initialization of private variables. The for loop directive. When can a loop be safely parallelized? The collapse clause.OpenMP 2
19OpenMP, part 3. OpenMP reductions. Loop schedules. To wait or not to wait? The sections and single constructs.OpenMP 3
20OpenMP, part 4. Synchronization, threadprivate, hybrid MPI+OpenMP programs.OpenMP 4
21CUDA, part 1. Why use GPUs for scientific computing? The CUDA programming model, CUDA-C, and hello word. Compiling and running a CUDA program. The CUDA memory hierarchy.CUDA 1
22Review of part 2 of course.Review 2
23Exam 2.
24CUDA, part 2. Memory management functions: allocation, copying. Examples: vector addition, 1d-diffusion. 2d and 3d-grids. Example: matrix addition.CUDA 2
25CUDA, part 3. Synchronization. Warps and a restriction on __syncthreads. Shared variables: why they are needed and how to create them. Reduction over threads in a block. Example: dot product.CUDA 3
26CUDA, part 4. Matrix multiplication: naive version, and using shared memory. Performance experiment on 8000×8000 matrices of doubles. Hybrid programs using CUDA, MPI, and threads.CUDA 4
27Future trends.