For example, there is a kernel whose code is as following, and three warps are invoked on a SM.
[codebox]global void test(int *a, int *c)
{
__shared__ int sdata[96];
int tid = threadIdx.x;
// memory load from global mem to shared mem
sdata[tid] = a[tid];
__syncthreads();
// computation
for(int i = 0; i < 10; i++)
c[tid] += sdata[tid]*i;
......
}[/codebox]
I have two concepts needed to be verified.
1 No matter there is a __syncthreads() or not between mem loading and computation, computation must start after mem loading is completely finished in the procedure of a warp when the kernel is invoked, and __syncthreads() is used to make sure that any computation of a warp starts after mem loading of all warps is finished.
2 Hardware of mem transfer and ALU work at the same time as possible as they can, namely, while computation of a warp is executed on ALU and transfer channel is not occupied, mem loading of another warp can be issued and performed.
Is that true? I think so, but not sure about it.