Hi. For example, in CUDA kernel i have shared buffer 1025 floats size and i have 512 threads per block. What is the fastest way to fill the buffer?
__global__ void processAL1(float* samples, uchar* out, int samplesNum)
{
int bidx = blockIdx.x;
int tidx = threadIdx.x; //512 threads per block
__shared__ float sampleBlock[1025];
// need to copy from global memory 'samples' to 'sampleBlock' from bidx + to bidx+1025 bytes;
Well, i can read on each thread 2 bytes from global and i will read 512*x = 1024, but i need 1025, so what to do?
P.S.: OpenCL has a good function for this - async_work_group_copy, what can i do in CUDA?