i have the following single kernel
global void my_kernel (cuFloatComplex* data_d, float* x_d, float* y_d)
shared float x[BLOCK_SIZE],y[BLOCK_SIZE];
int ind_col =(blockIdx.xBLOCK_SIZE)+threadIdx.x;
int ind_row =(blockIdx.yBLOCK_SIZE)+threadIdx.y;
the grid is a 3D grid of (N2/BLOCK_SIZE,N1_BLOCK_SIZE, M/BLOCK_SIZE) and the Block is BLOCK_SIZExBLOCK_SIZE sized.
My first aim is to copy a segment of BLOCK_SIZE dimension of input vector x_d and y_d into the Shared variables x and y. I have some doubts about.
First, can I use effectively cudaMemcpy? Or it doesn’t work inside a kernel?
If it works, accoridng to my code, each single thread will copy a BLOCK_SIZE amout of float data into the shared variables and hence I will have BLOCK_SIZExBLOCK_size threads writing BLOCK_SIZE float data. So each kernel overwrites useless the shared variables with the same data. Is it correct?
I need to use the all the BLOCK_SIZE x and y data in each of the threads of the block to perform a sum. So each thread of the block performs a weigthed sum based on BLOCK_SIZE element of x_d and y_d.
Then, my question is: “How can I get an efficient copy from global memory into shared memory of the two vectors?”.
Does anybody have a suggestion?
Thanks in advance,