Hello,

i have the following single kernel

**global** void my_kernel (cuFloatComplex* data_d, float* x_d, float* y_d)

{

**shared** float x[BLOCK_SIZE],y[BLOCK_SIZE];

int ind_col =(blockIdx.x*BLOCK_SIZE)+threadIdx.x;
int ind_row =(blockIdx.y*BLOCK_SIZE)+threadIdx.y;

int ind_m=blockIdx.z;

cudaMemcpy(x,(x_d+ind_m

*BLOCK_SIZE),BLOCK_SIZE,cudaMemcpyDev*

iceToHost);

cudaMemcpy(y,(y_d+ind_mBLOCK_SIZE),BLOCK_SIZE,cudaMemcpyDev

iceToHost);

cudaMemcpy(y,(y_d+ind_m

iceToHost);

__syncthreads();

}

the grid is a 3D grid of (N2/BLOCK_SIZE,N1_BLOCK_SIZE, M/BLOCK_SIZE) and the Block is BLOCK_SIZExBLOCK_SIZE sized.

My first aim is to copy a segment of BLOCK_SIZE dimension of input vector x_d and y_d into the Shared variables x and y. I have some doubts about.

First, can I use effectively cudaMemcpy? Or it doesn’t work inside a kernel?

If it works, accoridng to my code, each single thread will copy a BLOCK_SIZE amout of float data into the shared variables and hence I will have BLOCK_SIZExBLOCK_size threads writing BLOCK_SIZE float data. So each kernel overwrites useless the shared variables with the same data. Is it correct?

I need to use the all the BLOCK_SIZE x and y data in each of the threads of the block to perform a sum. So each thread of the block performs a weigthed sum based on BLOCK_SIZE element of x_d and y_d.

Then, my question is: “How can I get an efficient copy from global memory into shared memory of the two vectors?”.

Does anybody have a suggestion?

Thanks in advance,

Beetro