cudaMemcpy into shared variables


i have the following single kernel

global void my_kernel (cuFloatComplex* data_d, float* x_d, float* y_d)
shared float x[BLOCK_SIZE],y[BLOCK_SIZE];
int ind_col =(blockIdx.xBLOCK_SIZE)+threadIdx.x;
int ind_row =(blockIdx.y
int ind_m=blockIdx.z;


the grid is a 3D grid of (N2/BLOCK_SIZE,N1_BLOCK_SIZE, M/BLOCK_SIZE) and the Block is BLOCK_SIZExBLOCK_SIZE sized.

My first aim is to copy a segment of BLOCK_SIZE dimension of input vector x_d and y_d into the Shared variables x and y. I have some doubts about.

First, can I use effectively cudaMemcpy? Or it doesn’t work inside a kernel?
If it works, accoridng to my code, each single thread will copy a BLOCK_SIZE amout of float data into the shared variables and hence I will have BLOCK_SIZExBLOCK_size threads writing BLOCK_SIZE float data. So each kernel overwrites useless the shared variables with the same data. Is it correct?

I need to use the all the BLOCK_SIZE x and y data in each of the threads of the block to perform a sum. So each thread of the block performs a weigthed sum based on BLOCK_SIZE element of x_d and y_d.

Then, my question is: “How can I get an efficient copy from global memory into shared memory of the two vectors?”.

Does anybody have a suggestion?

Thanks in advance,


You cannot call CudaMemcpy inside a kernel…

What about using coalescing? Each thread copy one float in shared memory, something like

x[threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y] = x_d[same index]

I don’t remember if floats makes any trouble in coalescing, but if it works I think it’s the better way to copy some data from global to shared memory…

EDIT: look at page 82 of CUDA programming guide version 2.3 for coalescence

Ok, many thanks. I guessed it was not allowed. By the way, the code is compiled even if there a cudaMemcpy call in a kernel.

I think that is translated in a standard memcpy, but I don’t know how it’s executed in the kernel

yes, I want to perform coalescent access to memory, but my task is this: I need to multiply a 3D matrix (MxN1xN2), say A, with a vector C (1xM). Hence I want to do R=C*A. But, before evaluating R I need to generate A.

Solution I) I could use 3D grid of 3D blocks of threads that accomplish A filling with coalescent access to memory and then I can perform the multiplication with another kernel by exploiting shared memory in a fashion similar to the CUDA programming guide 2.3 section 3.2.2.

Solution II) As some operation in A generation and C*A multiplicaiton are similar (memory accesses) I would like to join togheter the two operations in one kernel, minimizing the memory access to global memory. In this case, I should size grid of blocks and blocks such that the number of total threads is not equal to MxN1xN2 (the 3D A matrix overall size), but is equal to N1xN2 (R size). Hence, i have available a less number of threads each managing a piece of A.

I do not know what is the best choiche. Should I have to to implement both and understand the most effective one by means of experimental trials?

What do you think?

Thank you,


Keep attention of max z-dimension of 3D blocks and grids!

In my experience what you have absolutely to avoid is a large amount of global memory accesses!

Nice to see some guys from Italian Universities! :-)



Ok, I think solution II is better.

What about something like this:

for (unsigned int i = 0; i < M; i++) {

   // load in shared memory A[i][threadIdx.y * blockDim.x][threadIdx.x]


Maybe I’ve not well understood your problem…

But this way you have M coalesced read from global memory, isn’t it?


Grazie Luca for your advices.

Indeed, it seems that 3d grid are not allowed, differntly from blocks of threads that can be 3d-indexed. So I’m wondering about the reason of that choice. I mean, if I have a 3D structure I probably would exploit 3d-blocks and 3d-grids of blocks to cover all the structure, don’t you think?



Ciao Pietro,

yes, a grid can only be 2D. I think that 3D blocks are enough to describe a 3D problem… What’s the field you’re working in? Mathematics?



I guess that with your example I can get coalescent acces to memory. Unfortunately, the values of M involved i my typical problem can be very big and i do not think that i can load all the M-values in the shared memory of a thread block. Indeed, I manage M value of even 10000 of order of magnitude, and as the type is a cuFLoatComplex, I need 40000 bytes =40 KB, that is > 16 KB available with shared. So this strategy should be very effective (theoretically) up to 4096 elements, but over this threshold it should show a significative drop of performance.

I not completely sure of my reasoning as I’m not expert as I have been using cuda for few weeks. So, please, your and somebodyelse’s comments are will be really appreciated.



Electromagnetics…and particularly Antenna synthesis and analysis.

By the way, in I want to exploit shared memory whitout a 3D grid, when dimensions of data structures are large, you cannot segment the 3rd dim (the depth for instance). This is exactly the problem I have. I want to generate a 3D matrix of values of a function F(u,v,w) on points corresponding to the cartesian product of three vector u (N2 elements) v(N1 elements) and w(M elements)…M, N1 and N2 present very large values and I cannot load oin Shared mem all the corresponding vector if I use a 2d grid of blocks…if I could have used 3D-grid, I would have loaded only a segment of the vector fitting in the shared mem.

Do you have any suggestion to workaround this?


If you’re dealing with huge numbers, have a look at this link:

Here data is stored on shared memory in steps.