Matrix Multiplication


I am a newbie to CUDA and right now I am going through the NVIDIA Programming Guide 2.3. I am stuck on one concept that seems to be important.

In chapter 3, a matrix multiplication code is shown that does not take advantage of the shared memory. The Guide states that each thread reads one row of A and one column of B and computes the corresponding element of C. Therefore A is read B.width times from the global memory.

However, isn’t A copied to the device memory first via cudaMemcpy? Therefore, when the thread reads the row of A, isn’t it reading from the device memory and not the global memory?

Appreciate the help in advance.

The documentation uses “device memory” and “global memory” interchangeably to refer to the memory attached directly the GPU. You could probably draw a precise usage distinction between these two terms if you really wanted to, but in practice people flip between them on the forum freely.

The memory connected to the CPU is usually called “system memory” or “host memory” in CUDA usage.

Thank you. I figured this out as well and was just about to post that I didn’t need any help!