Hi, I’m new to Cuda and I have a question (I think quite simple), should I transfer a one-dimensional (float) array, which contains a saved image by lines, from global memory to shared memory. For now I have written a possible code but I really think it is not efficient because it is executed by only one threads per block. How could I have all the threads in the block do this? The blocks are two-dimensional (32,32) and the grid is made up of N blocks (with N according to the size of the image).
__shared__ float s_template[4800];
if (threadIdx.x == 0) {
for (int j = 0;j < Th;j++) {
for (int t = 0;t < Tw;t++) {
s_template[t+j * Tw] = T[t+j * Tw];
}
}
}
__syncthreads();