Hi, I am new to this GPU programming and I am currently reading a simple CUDA example about copying one matrix to another one:
//a simple copy global void copy(float odata, const float idata){
int x = blockIdx.xTILE_DIM + threadIdx.x;
int y = blockIdx.yTILE_DIM + threadIdx.y;
int width = gridDim.x*TILE_DIM;
The test results show that the second one is faster by a factor of 7%, but I don’t understand why. It seems to me that the second one involves one more copy. Could someone give me some hints? Thanks!
Thanks very much for creating this test example! I just re-checked the code I was looking at. I found one optimization in the original code that helps to accelerate copySharedMem(). In the original code, instead of using BLOCK_ROWS=TILE_DIM=32, it uses BLOCK_ROWS=8.
const int TILE_DIM=32;
const int BLOCK_ROWS=8;
//a simple copy
__global__ void copy(float *odata, const float *idata){
int x = blockIdx.x*TILE_DIM + threadIdx.x;
int y = blockIdx.y*TILE_DIM + threadIdx.y;
int width = gridDim.x*TILE_DIM;
for(int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) //? why blocking?
odata[(y+j)*width + x] = idata[(y+j)*width + x];
}
//copy using shared memory
__global__ void copySharedMem(float *odata, const float*idata){
__shared__ float tile[TILE_DIM*TILE_DIM];
int x = blockIdx.x*TILE_DIM + threadIdx.x;
int y = blockIdx.y*TILE_DIM + threadIdx.y;
int width = gridDim.x*TILE_DIM;
for(int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*width + x];
__syncthreads();
for(int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x];
}
int main(){
int num_reps=200;
float *o, *i;
const int ds = 1024;
const int dsize = ds*ds*sizeof(o[0]);
cudaMalloc(&o, dsize);
cudaMalloc(&i, dsize);
dim3 block(TILE_DIM,BLOCK_ROWS);
dim3 grid(ds/TILE_DIM, ds/TILE_DIM);
for(int n=0; n<num_reps; n++)
copySharedMem<<<grid,block>>>(o, i);
for(int n=0; n<num_reps; n++)
copy<<<grid,block>>>(o, i);
cudaDeviceSynchronize();
}
And now copysharedMem() consistently takes less time than copy(). However, I haven’t figured out why this blocking trick helps to accelerate copy through shared memory. I am also using a Telsa P100 (one GPU of IBM Power8)
Case 1: TILE_DIM=32, BLOCK_ROWS=8
$ mpirun -np 1 nvprof ./main
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 51.18% 3.5459ms 200 17.729us 17.378us 18.595us copy(float*, float const *)
48.82% 3.3828ms 200 16.914us 16.226us 17.986us copySharedMem(float*, float const *)
Case 2: TILE_DIM=32, BLOCK_ROWS=32
$ mpirun -np 1 nvprof ./main
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 53.69% 4.2012ms 200 21.006us 20.387us 21.955us copySharedMem(float*, float const *)
46.31% 3.6243ms 200 18.121us 17.347us 18.947us copy(float*, float const *)
If you pull a bunch of stuff into shared memory, and then blast it out, you can get some small benefit from the fact that you are doing a bunch of reads followed by a bunch of writes.
The copy code on the other hand has reads and writes possibly closer to each other requiring turnaround on the DRAM bus.
This is just a guess, however I know that it can create about a 10% measured improvement in bandwidth testing on a P100 if you use a copy kernel as opposed to a kernel that simply reads memory in bulk, or writes memory in bulk. If you only do reads, or only do writes, you can witness higher bandwidth than if you do a mix of both. It stands to reason that if you introduce bursty behavior, that may also be better than just doing read/write/read/write…