global void my_kernel (cuFloatComplex* data_d, float* x_d, float* y_d)
{ shared float x[BLOCK_SIZE],y[BLOCK_SIZE];
int ind_col =(blockIdx.xBLOCK_SIZE)+threadIdx.x;
int ind_row =(blockIdx.yBLOCK_SIZE)+threadIdx.y;
int ind_m=blockIdx.z;
cudaMemcpy(x,(x_d+ind_mBLOCK_SIZE),BLOCK_SIZE,cudaMemcpyDev
iceToHost);
cudaMemcpy(y,(y_d+ind_mBLOCK_SIZE),BLOCK_SIZE,cudaMemcpyDev
iceToHost);
__syncthreads();
}
the grid is a 3D grid of (N2/BLOCK_SIZE,N1_BLOCK_SIZE, M/BLOCK_SIZE) and the Block is BLOCK_SIZExBLOCK_SIZE sized.
My first aim is to copy a segment of BLOCK_SIZE dimension of input vector x_d and y_d into the Shared variables x and y. I have some doubts about.
First, can I use effectively cudaMemcpy? Or it doesn’t work inside a kernel?
If it works, accoridng to my code, each single thread will copy a BLOCK_SIZE amout of float data into the shared variables and hence I will have BLOCK_SIZExBLOCK_size threads writing BLOCK_SIZE float data. So each kernel overwrites useless the shared variables with the same data. Is it correct?
I need to use the all the BLOCK_SIZE x and y data in each of the threads of the block to perform a sum. So each thread of the block performs a weigthed sum based on BLOCK_SIZE element of x_d and y_d.
Then, my question is: “How can I get an efficient copy from global memory into shared memory of the two vectors?”.
I don’t remember if floats makes any trouble in coalescing, but if it works I think it’s the better way to copy some data from global to shared memory…
EDIT: look at page 82 of CUDA programming guide version 2.3 for coalescence
yes, I want to perform coalescent access to memory, but my task is this: I need to multiply a 3D matrix (MxN1xN2), say A, with a vector C (1xM). Hence I want to do R=C*A. But, before evaluating R I need to generate A.
Solution I) I could use 3D grid of 3D blocks of threads that accomplish A filling with coalescent access to memory and then I can perform the multiplication with another kernel by exploiting shared memory in a fashion similar to the CUDA programming guide 2.3 section 3.2.2.
Solution II) As some operation in A generation and C*A multiplicaiton are similar (memory accesses) I would like to join togheter the two operations in one kernel, minimizing the memory access to global memory. In this case, I should size grid of blocks and blocks such that the number of total threads is not equal to MxN1xN2 (the 3D A matrix overall size), but is equal to N1xN2 (R size). Hence, i have available a less number of threads each managing a piece of A.
I do not know what is the best choiche. Should I have to to implement both and understand the most effective one by means of experimental trials?
Indeed, it seems that 3d grid are not allowed, differntly from blocks of threads that can be 3d-indexed. So I’m wondering about the reason of that choice. I mean, if I have a 3D structure I probably would exploit 3d-blocks and 3d-grids of blocks to cover all the structure, don’t you think?
I guess that with your example I can get coalescent acces to memory. Unfortunately, the values of M involved i my typical problem can be very big and i do not think that i can load all the M-values in the shared memory of a thread block. Indeed, I manage M value of even 10000 of order of magnitude, and as the type is a cuFLoatComplex, I need 40000 bytes =40 KB, that is > 16 KB available with shared. So this strategy should be very effective (theoretically) up to 4096 elements, but over this threshold it should show a significative drop of performance.
I not completely sure of my reasoning as I’m not expert as I have been using cuda for few weeks. So, please, your and somebodyelse’s comments are will be really appreciated.
Electromagnetics…and particularly Antenna synthesis and analysis.
By the way, in I want to exploit shared memory whitout a 3D grid, when dimensions of data structures are large, you cannot segment the 3rd dim (the depth for instance). This is exactly the problem I have. I want to generate a 3D matrix of values of a function F(u,v,w) on points corresponding to the cartesian product of three vector u (N2 elements) v(N1 elements) and w(M elements)…M, N1 and N2 present very large values and I cannot load oin Shared mem all the corresponding vector if I use a 2d grid of blocks…if I could have used 3D-grid, I would have loaded only a segment of the vector fitting in the shared mem.