CISC372: CUDA, part 4: Matrix Multiplication, CPU-GPU Hybrid Programs

Matrix multiplication. Naive CUDA version. Version 1, no shared memory, and memory analysis. Version 2, streaming sub-matrices through the shared memory of the block. Performance experiment on 8000×8000 matrices of doubles. Hybrid programs using CUDA, MPI, and threads.