I’ll try to look at that. Thanks.
Is the following correctly understood:
If I launch a kernel with a 2d dimgrid and 2d blockgrid, then every block will do all it’s work before being rescheduled? (I’m not doing anything explicit to halt execution - i.e. waiting for something to happen. I’m simply doing a 5-point stencil for each block).
It’s important because I swap an amount of data between global and shared memory for each block.
For each block, this is what I plan to do (let’s assume each threadIdx.x spans 4 floats/16 bytes and each threadIdx.y spans 1 float and that the kernel is launched with a blockDim and gridDim that covers all data in the input matrix).
//Move data from global til shared memeory - each thread copies it’s own four bytes - this should total all data for the current block
for-loop that copies 4 bytes from global to shared memory indexed by blockIdx and threadIdx information.
//Do operation on data in shared memeory
for-loop doing a red or black iteration accross the data in shared memory - reading values from global memory for the border cases
//Move data back to global memory - again each thread moves four bytes, which should total the movement of the whole block of data when the code is done executing for the current block.
I guess my questions can be broken down to these:
Can I be sure that all threads (in 2d) belonging to the same block are always synchronized regarding intructions (so that I know that all data is present in the shared memory when I’m doing 5-point stencil operations on it in the middle part of the pseudocode)? Just making sure… :)
Can I be sure that blocks are not swapped out in the middle of execution, which would potentially cause incorret data to be present in the shared memory?
Is it a good idea to let each thread of the block copy it’s own four bytes of data, or is it better to let fewer threads copy more data? I guess this would be the case if each memory transfer has some constant timefactor attached to it.