But sorry…how do we data copy via kernel? I used to think that before we call a kernel we need to do data copy using cudaMemcpy(). Could you please help me how can we data copy via kernel?
I very often opt to such solutions. Especially where the problems are reasonably well defined and i can use some templates.
It is then often possible to place several data elements in registers which brings in additional advantage over using shared memory. Smem can often be quite slow compared to using just registers ( the programming guide will tell you it’s the same latency but its not ).
One example is an algorithm i did recently where one had to read in data and construct at vector V ( length(V) = 64 for ex) and then use that to construct the matrix M = VV’, and then in the next iteration read in and construct a new V and do M += VV’. Now i first stored the matrix M in shared memoy which gave me an overall performance of 80-100 GFLOPs.
But later i realized that all the columns of M could be stored in the registers of 64 threads, since there is no need for communication between the threads in this particular implementation. This got my overall perfomance up to about 240 GFLOPs.
You can give this problem a try yourself, just make sure you don’t spill over into lmem!
Thanks you very much. This make sense to me. :rolleyes:
But being somewhat new to CUDA, I am unable to understand how exactly you used registers for storing your data. Your data must be there in the global memory that you copied back to GPU register. I know how to bring data from global memory to shared memory, but I am unable to understand how do you transfer your data from global memory to registers. A two line code will help me understand this.
Thanks you very much. This make sense to me. :rolleyes:
But being somewhat new to CUDA, I am unable to understand how exactly you used registers for storing your data. Your data must be there in the global memory that you copied back to GPU register. I know how to bring data from global memory to shared memory, but I am unable to understand how do you transfer your data from global memory to registers. A two line code will help me understand this.
// normally maye you'd write something like:
__global__ void kernel(float* globalArray, int n)
{
// normal declaration like this will place the global value into 'register_val' which will be placed in a register
float register_val = globalArray[threaIdx.x + blockIdx.x*gridDim.x];
// But doing this will cause issues:
float register_array[n];
for(int i = 0; i < n; i++)
register_array[i] = globalArray[threaIdx.x + blockIdx.x*gridDim.x + i*n]; // <------ not in registers, will be placed in local memory
// this however is OK:
register_array[0] = globalArray[threaIdx.x + blockIdx.x*gridDim.x];
register_array[1] = globalArray[threaIdx.x + blockIdx.x*gridDim.x + 1*n];
register_array[2] = globalArray[threaIdx.x + blockIdx.x*gridDim.x + 2*n];
register_array[3] = globalArray[threaIdx.x + blockIdx.x*gridDim.x + 3*n];
// Acessing with constant indexes is doable, but not very practical => user template variables and unrolling
}
// this will be easier to handle :
template<int N>
__global__ void kernel(float* globalArray)
{
float register_array[N]
#pragma unroll
for(int i = 0; i < N; i++)
register_array[i] = globalArray[threaIdx.x + blockIdx.x*gridDim.x + i*n];
// do something with values on-chip
..............
}
But as i said you need to know your N otherwise your performance will be severly affected by this.