Matrix multiplication. Naive CUDA version. Version 1, no shared memory, and memory analysis. Version 2, streaming sub-matrices through the shared memory of the block. Performance experiment on 8000×8000 matrices of doubles. Hybrid programs using CUDA, MPI, and threads.
Slides: 26_cuda4.pdf