CISC372: Parallel Computing – The Devil in the Details

These materials are from my online course CISC372: Parallel Computing, offered by the University of Delaware in the Fall of 2020.

Day	Topic	Link
0	Introducing myself.	Instructor’s Introduction
1	Introduction to parallel computing. What is it and why do we need it? Hardware trends. Parallel programming models. Message-passing vs. shared variables. Applications.	Introduction to Parallel Computing
2	Systems. A practical introduction to Unix and related tools. History and philosophy of Unix. Shell commands, file system, text editors, bash, environment variables. Make, Subversion.	Systems
3	The C programming language, part 1. Quick introduction to sequential C programming, focusing on the particular tools we will use. History, translation phases, preprocessor, types.	C1
4	*The C programming language, part 2.* Pointer arithmetic, void*, allocation, array-pointer pun, structs.	C2
5	MPI introduction. The message-passing model, introduction to MPI, hello world, barriers and reductions.	MPI Intro
6	*Performance measurement and analysis.* Weak and strong scaling, speedup, efficiency, Amdahl’s law. Graphing performance experiments.	Performance
7	*MPI point-to-point communication.* Send and receive, getting the status and count, tags, message ordering deadlock, MPI_Sendrecv, data exchange patterns.	P2P
8	Data distribution. Cyclic and block distributions. The Standard Block Distribution Scheme. Nearest neighbor communication and ghost cell exchanges. 1d and 2d diffusion, striped and checkerboard decompositions.	Distribution
9	MPI collectives. The collective model of communication, MPI’s collective operations: MPI_Bcast, MPI_Scatter, MPI_Gather, ….	Collectives
10	MPI wildcards and nondeterminism. MPI_ANY_SOURCE, when is it needed and when not? Semantics of point-to-point, revisited: matching, ordering. Nondeterministic behavior. Manager-worker pattern. Floating-point nondeterminism. Example: numerical integration.	Wildcards
11	Review of part 1 of course.	Review 1
12	*Exam 1.*
13	Introduction to multithreaded programming. Processes vs. threads, programming language support. POSIX threads (Pthreads), thread creation and join.	Threads 1
14	*Data races, mutexes, and critical sections.* What is a data-race and why do you want to avoid them? The meaning of “undefined behavior.” Pthread mutexes: API and semantics. The critical section problem.	Threads 2
15	Condition variables and concurrency flags. The need to wait. Condition variables: API, semantics, usage patterns. Concurrency flags and a Pthreads implementation. A 2-thread barrier.	Threads 3
16	Multithreaded barriers and reductions. Properties of barrier design. Counter barrier, coordinator barrier, combining tree barrier, butterfly & dissemination barrier. Reductions.	Threads 4
17	OpenMP, part 1. Introduction. Fork-join parallelism, directive based programming, syntax, compiling & running, the parallel directive, getting the number of threads and thread ID.	OpenMP 1
18	OpenMP, part 2. Private vs. shared variables. Initialization of private variables. The for loop directive. When can a loop be safely parallelized? The collapse clause.	OpenMP 2
19	*OpenMP, part 3.* OpenMP reductions. Loop schedules. To wait or not to wait? The sections and single constructs.	OpenMP 3
20	*OpenMP, part 4.* Synchronization, threadprivate, hybrid MPI+OpenMP programs.	OpenMP 4
21	CUDA, part 1. Why use GPUs for scientific computing? The CUDA programming model, CUDA-C, and hello word. Compiling and running a CUDA program. The CUDA memory hierarchy.	CUDA 1
22	*Review of part 2 of course.*	Review 2
23	*Exam 2.*
24	*CUDA, part 2.* Memory management functions: allocation, copying. Examples: vector addition, 1d-diffusion. 2d and 3d-grids. Example: matrix addition.	CUDA 2
25	*CUDA, part 3.* Synchronization. Warps and a restriction on __syncthreads. Shared variables: why they are needed and how to create them. Reduction over threads in a block. Example: dot product.	CUDA 3
26	*CUDA, part 4.* Matrix multiplication: naive version, and using shared memory. Performance experiment on 8000×8000 matrices of doubles. Hybrid programs using CUDA, MPI, and threads.	CUDA 4
27	*Future trends.*