I am a newbie to CUDA and right now I am going through the NVIDIA Programming Guide 2.3. I am stuck on one concept that seems to be important.
In chapter 3, a matrix multiplication code is shown that does not take advantage of the shared memory. The Guide states that each thread reads one row of A and one column of B and computes the corresponding element of C. Therefore A is read B.width times from the global memory.
However, isn’t A copied to the device memory first via cudaMemcpy? Therefore, when the thread reads the row of A, isn’t it reading from the device memory and not the global memory?
Appreciate the help in advance.