Synchronization between host and device, for kernel calls, and for threads in a block. Warps and a restriction on __syncthreads. Shared variables: why they are needed and how to create them. Reduction over threads in a block. Example: dot product.
Slides: 25_cuda3.pdf