cudaMemcpy into shared variables

Beetro · September 22, 2009, 9:29am

Hello,

i have the following single kernel

global void my_kernel (cuFloatComplex* data_d, float* x_d, float* y_d)
{
shared float x[BLOCK_SIZE],y[BLOCK_SIZE];
int ind_col =(blockIdx.xBLOCK_SIZE)+threadIdx.x;
int ind_row =(blockIdx.yBLOCK_SIZE)+threadIdx.y;
int ind_m=blockIdx.z;
cudaMemcpy(x,(x_d+ind_mBLOCK_SIZE),BLOCK_SIZE,cudaMemcpyDev
iceToHost);
cudaMemcpy(y,(y_d+ind_mBLOCK_SIZE),BLOCK_SIZE,cudaMemcpyDev
iceToHost);
__syncthreads();

}

the grid is a 3D grid of (N2/BLOCK_SIZE,N1_BLOCK_SIZE, M/BLOCK_SIZE) and the Block is BLOCK_SIZExBLOCK_SIZE sized.

My first aim is to copy a segment of BLOCK_SIZE dimension of input vector x_d and y_d into the Shared variables x and y. I have some doubts about.

First, can I use effectively cudaMemcpy? Or it doesn’t work inside a kernel?
If it works, accoridng to my code, each single thread will copy a BLOCK_SIZE amout of float data into the shared variables and hence I will have BLOCK_SIZExBLOCK_size threads writing BLOCK_SIZE float data. So each kernel overwrites useless the shared variables with the same data. Is it correct?

I need to use the all the BLOCK_SIZE x and y data in each of the threads of the block to perform a sum. So each thread of the block performs a weigthed sum based on BLOCK_SIZE element of x_d and y_d.

Then, my question is: “How can I get an efficient copy from global memory into shared memory of the two vectors?”.

Does anybody have a suggestion?

Thanks in advance,

Beetro

CaLu · September 22, 2009, 9:40am

You cannot call CudaMemcpy inside a kernel…

mickey_mouse · September 22, 2009, 10:00am

What about using coalescing? Each thread copy one float in shared memory, something like

x[threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y] = x_d[same index]

I don’t remember if floats makes any trouble in coalescing, but if it works I think it’s the better way to copy some data from global to shared memory…

EDIT: look at page 82 of CUDA programming guide version 2.3 for coalescence

Beetro · September 22, 2009, 10:42am

Ok, many thanks. I guessed it was not allowed. By the way, the code is compiled even if there a cudaMemcpy call in a kernel.

mickey_mouse · September 22, 2009, 10:47am

I think that is translated in a standard memcpy, but I don’t know how it’s executed in the kernel

Beetro · September 22, 2009, 10:56am

What about using coalescing? Each thread copy one float in shared memory, something like
x[threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y] = x_d[same index]
I don’t remember if floats makes any trouble in coalescing, but if it works I think it’s the better way to copy some data from global to shared memory…

EDIT: look at page 82 of CUDA programming guide version 2.3 for coalescence

yes, I want to perform coalescent access to memory, but my task is this: I need to multiply a 3D matrix (MxN1xN2), say A, with a vector C (1xM). Hence I want to do R=C*A. But, before evaluating R I need to generate A.

Solution I) I could use 3D grid of 3D blocks of threads that accomplish A filling with coalescent access to memory and then I can perform the multiplication with another kernel by exploiting shared memory in a fashion similar to the CUDA programming guide 2.3 section 3.2.2.

Solution II) As some operation in A generation and C*A multiplicaiton are similar (memory accesses) I would like to join togheter the two operations in one kernel, minimizing the memory access to global memory. In this case, I should size grid of blocks and blocks such that the number of total threads is not equal to MxN1xN2 (the 3D A matrix overall size), but is equal to N1xN2 (R size). Hence, i have available a less number of threads each managing a piece of A.

I do not know what is the best choiche. Should I have to to implement both and understand the most effective one by means of experimental trials?

What do you think?

Thank you,

Pietro

CaLu · September 22, 2009, 11:27am

Keep attention of max z-dimension of 3D blocks and grids!

In my experience what you have absolutely to avoid is a large amount of global memory accesses!

Nice to see some guys from Italian Universities! :-)

Ciao,

luca

mickey_mouse · September 22, 2009, 11:31am

yes, I want to perform coalescent access to memory, but my task is this: I need to multiply a 3D matrix (MxN1xN2), say A, with a vector C (1xM). Hence I want to do R=C*A. But, before evaluating R I need to generate A.

Solution I) I could use 3D grid of 3D blocks of threads that accomplish A filling with coalescent access to memory and then I can perform the multiplication with another kernel by exploiting shared memory in a fashion similar to the CUDA programming guide 2.3 section 3.2.2.

Solution II) As some operation in A generation and C*A multiplicaiton are similar (memory accesses) I would like to join togheter the two operations in one kernel, minimizing the memory access to global memory. In this case, I should size grid of blocks and blocks such that the number of total threads is not equal to MxN1xN2 (the 3D A matrix overall size), but is equal to N1xN2 (R size). Hence, i have available a less number of threads each managing a piece of A.

Ok, I think solution II is better.

What about something like this:

for (unsigned int i = 0; i < M; i++) {

   // load in shared memory A[i][threadIdx.y * blockDim.x][threadIdx.x]

}

Maybe I’ve not well understood your problem…

But this way you have M coalesced read from global memory, isn’t it?

Michele

Beetro · September 23, 2009, 8:51am

Grazie Luca for your advices.

Indeed, it seems that 3d grid are not allowed, differntly from blocks of threads that can be 3d-indexed. So I’m wondering about the reason of that choice. I mean, if I have a 3D structure I probably would exploit 3d-blocks and 3d-grids of blocks to cover all the structure, don’t you think?

Ciao,

Pietro

CaLu · September 23, 2009, 9:00am

Ciao Pietro,

yes, a grid can only be 2D. I think that 3D blocks are enough to describe a 3D problem… What’s the field you’re working in? Mathematics?

Ciao,

luca

Beetro · September 23, 2009, 9:01am

Ok, I think solution II is better.

What about something like this:
for (unsigned int i = 0; i < M; i++) {

   // load in shared memory A[i][threadIdx.y * blockDim.x][threadIdx.x]

}
Maybe I’ve not well understood your problem…

But this way you have M coalesced read from global memory, isn’t it?

Michele

I guess that with your example I can get coalescent acces to memory. Unfortunately, the values of M involved i my typical problem can be very big and i do not think that i can load all the M-values in the shared memory of a thread block. Indeed, I manage M value of even 10000 of order of magnitude, and as the type is a cuFLoatComplex, I need 40000 bytes =40 KB, that is > 16 KB available with shared. So this strategy should be very effective (theoretically) up to 4096 elements, but over this threshold it should show a significative drop of performance.

I not completely sure of my reasoning as I’m not expert as I have been using cuda for few weeks. So, please, your and somebodyelse’s comments are will be really appreciated.

Thanks,

Pietro

Beetro · September 23, 2009, 9:08am

Electromagnetics…and particularly Antenna synthesis and analysis.

By the way, in I want to exploit shared memory whitout a 3D grid, when dimensions of data structures are large, you cannot segment the 3rd dim (the depth for instance). This is exactly the problem I have. I want to generate a 3D matrix of values of a function F(u,v,w) on points corresponding to the cartesian product of three vector u (N2 elements) v(N1 elements) and w(M elements)…M, N1 and N2 present very large values and I cannot load oin Shared mem all the corresponding vector if I use a 2d grid of blocks…if I could have used 3D-grid, I would have loaded only a segment of the vector fitting in the shared mem.

Do you have any suggestion to workaround this?

Thanks.

CaLu · September 23, 2009, 9:11am

If you’re dealing with huge numbers, have a look at this link:
[url=“http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html”]http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html[/url]

Here data is stored on shared memory in steps.

Topic		Replies	Views
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
Memory Coalescing CUDA Programming and Performance	7	2873	July 29, 2009
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13931	September 5, 2008
floyd on cuda--why so slow? CUDA Programming and Performance	15	5463	May 2, 2009
Cuda good practices for image processing CUDA Programming and Performance	8	15491	February 12, 2009
Using Shared Memory in CUDA C/C++ Technical Blog	36	1969	October 8, 2020
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2638	April 23, 2010
Reduction CUDA Programming and Performance	19	3463	May 16, 2012
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops Technical Blog	18	1501	September 12, 2019

cudaMemcpy into shared variables

Related topics