Hi ,
I have 512512=262144 pixels and I want to do the same process for all of them. So, I run a kernel for this. The process needs the coordinates of the pixels, so I need to transfer X_Coordinates (1512) and Y_Coordinates (1*512) of the pixels to the device. if i do it normally, then it goes to the global memory and all the threads need to read from it. However, if i transfer it to the shared memory, then my kernel will be much faster i guess.
I know how do store data on the dynamic shared memory:
"
if (TID < 512 )
{
This is the normal way, but does not make sense to me because if i do this, I hold 262144 threads for only 512 numbers,!! This highly imposes latency in my Kernel (what i want to do after the __syncthreads()). So, the question is how i can store the X_Coordinates and Y_Coordinates on the shared memory efficiently?
Buffering data in shared memory can make sense if there is data re-use. Your description suggests that for your use case the data processing is of a streaming nature without any data re-use, in which case shuffling data through shared memory will at best have no performance impact but likely will slow down the processing.
Hi,
Well, i did not put all of the code of my kernel here, but I have multiple Kernels in my program and I’ll be using Y_Coordinates and X_Coordinates in all of them. So, the memory indicated by these two pointers will be repeatedly used in my kernel(s).
does it make sense to run a kernel with 1024 hreads to transfer the Y_Coordinates and X_Coordinates to the shared memory at the beginning of my program and then use them in other Kernels?
No sure I fully understand the requirements, but on the face of it, sounds like constant memory would be the place to store them.
Later: To elaborate:
__constant__ float c_X_coord[512];
__constant__ float c_Y_coord[512];
__global__
void kernel_1(...){
do things with c_X_coord
do things with c_Y_coord
}
__global__
void kernel_2(...){
do things with c_X_coord
do things with c_Y_coord
}
cudaMemcpyToSymbol(c_X_coord, hostDataX, 512*sizeof(float));
cudaMemcpyToSymbol(c_Y_coord, hostDataY, 512*sizeof(float));
kernel_1<<<...>>>
kernel_2<<<...>>>