cudaMemcpy into shared variables

Hello,

i have the following single kernel

global void my_kernel (cuFloatComplex* data_d, float* x_d, float* y_d)
{
shared float x[BLOCK_SIZE],y[BLOCK_SIZE];
int ind_col =(blockIdx.xBLOCK_SIZE)+threadIdx.x;
int ind_row =(blockIdx.y
BLOCK_SIZE)+threadIdx.y;
int ind_m=blockIdx.z;
cudaMemcpy(x,(x_d+ind_mBLOCK_SIZE),BLOCK_SIZE,cudaMemcpyDev
iceToHost);
cudaMemcpy(y,(y_d+ind_m
BLOCK_SIZE),BLOCK_SIZE,cudaMemcpyDev
iceToHost);
__syncthreads();

}

the grid is a 3D grid of (N2/BLOCK_SIZE,N1_BLOCK_SIZE, M/BLOCK_SIZE) and the Block is BLOCK_SIZExBLOCK_SIZE sized.

My first aim is to copy a segment of BLOCK_SIZE dimension of input vector x_d and y_d into the Shared variables x and y. I have some doubts about.

First, can I use effectively cudaMemcpy? Or it doesn’t work inside a kernel?
If it works, accoridng to my code, each single thread will copy a BLOCK_SIZE amout of float data into the shared variables and hence I will have BLOCK_SIZExBLOCK_size threads writing BLOCK_SIZE float data. So each kernel overwrites useless the shared variables with the same data. Is it correct?

I need to use the all the BLOCK_SIZE x and y data in each of the threads of the block to perform a sum. So each thread of the block performs a weigthed sum based on BLOCK_SIZE element of x_d and y_d.

Then, my question is: “How can I get an efficient copy from global memory into shared memory of the two vectors?”.

Does anybody have a suggestion?

Thanks in advance,

Beetro

You cannot call CudaMemcpy inside a kernel…

What about using coalescing? Each thread copy one float in shared memory, something like

x[threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y] = x_d[same index]

I don’t remember if floats makes any trouble in coalescing, but if it works I think it’s the better way to copy some data from global to shared memory…

EDIT: look at page 82 of CUDA programming guide version 2.3 for coalescence

Ok, many thanks. I guessed it was not allowed. By the way, the code is compiled even if there a cudaMemcpy call in a kernel.

I think that is translated in a standard memcpy, but I don’t know how it’s executed in the kernel

yes, I want to perform coalescent access to memory, but my task is this: I need to multiply a 3D matrix (MxN1xN2), say A, with a vector C (1xM). Hence I want to do R=C*A. But, before evaluating R I need to generate A.

Solution I) I could use 3D grid of 3D blocks of threads that accomplish A filling with coalescent access to memory and then I can perform the multiplication with another kernel by exploiting shared memory in a fashion similar to the CUDA programming guide 2.3 section 3.2.2.

Solution II) As some operation in A generation and C*A multiplicaiton are similar (memory accesses) I would like to join togheter the two operations in one kernel, minimizing the memory access to global memory. In this case, I should size grid of blocks and blocks such that the number of total threads is not equal to MxN1xN2 (the 3D A matrix overall size), but is equal to N1xN2 (R size). Hence, i have available a less number of threads each managing a piece of A.

I do not know what is the best choiche. Should I have to to implement both and understand the most effective one by means of experimental trials?

What do you think?

Thank you,

Pietro

Keep attention of max z-dimension of 3D blocks and grids!

In my experience what you have absolutely to avoid is a large amount of global memory accesses!

Nice to see some guys from Italian Universities! :-)

Ciao,

luca

Ok, I think solution II is better.

What about something like this:

for (unsigned int i = 0; i < M; i++) {

   // load in shared memory A[i][threadIdx.y * blockDim.x][threadIdx.x]

}

Maybe I’ve not well understood your problem…

But this way you have M coalesced read from global memory, isn’t it?

Michele

Grazie Luca for your advices.

Indeed, it seems that 3d grid are not allowed, differntly from blocks of threads that can be 3d-indexed. So I’m wondering about the reason of that choice. I mean, if I have a 3D structure I probably would exploit 3d-blocks and 3d-grids of blocks to cover all the structure, don’t you think?

Ciao,

Pietro

Ciao Pietro,

yes, a grid can only be 2D. I think that 3D blocks are enough to describe a 3D problem… What’s the field you’re working in? Mathematics?

Ciao,

luca

I guess that with your example I can get coalescent acces to memory. Unfortunately, the values of M involved i my typical problem can be very big and i do not think that i can load all the M-values in the shared memory of a thread block. Indeed, I manage M value of even 10000 of order of magnitude, and as the type is a cuFLoatComplex, I need 40000 bytes =40 KB, that is > 16 KB available with shared. So this strategy should be very effective (theoretically) up to 4096 elements, but over this threshold it should show a significative drop of performance.

I not completely sure of my reasoning as I’m not expert as I have been using cuda for few weeks. So, please, your and somebodyelse’s comments are will be really appreciated.

Thanks,

Pietro

Electromagnetics…and particularly Antenna synthesis and analysis.

By the way, in I want to exploit shared memory whitout a 3D grid, when dimensions of data structures are large, you cannot segment the 3rd dim (the depth for instance). This is exactly the problem I have. I want to generate a 3D matrix of values of a function F(u,v,w) on points corresponding to the cartesian product of three vector u (N2 elements) v(N1 elements) and w(M elements)…M, N1 and N2 present very large values and I cannot load oin Shared mem all the corresponding vector if I use a 2d grid of blocks…if I could have used 3D-grid, I would have loaded only a segment of the vector fitting in the shared mem.

Do you have any suggestion to workaround this?

Thanks.

If you’re dealing with huge numbers, have a look at this link:
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html

Here data is stored on shared memory in steps.