running on GT 440 (Fermi).
I have 384 threads per block ( dim3 threads (384, 1, 1)); I am batching my data into 8 groups of 48 values (to speed up other aspects of the routine). Essentially performing a matrix (48X48) times a vector (48X1). The code snippet WITHOUT the __syncthreads does not deliver correct results; requires each of the __syncthread calls to deliver correct results.
The code assumes that each group of 48 as defined by tx is one matrix row times column vector. I break up the additions into 3 blocks of 16; add up the sums of each group of 16 at the end. Each grouping of 16 threads should be in the same warp – (0 to 15) (16 to 32) (33 so I should not need to __syncthread for these entries as each line of the code executes in each thread simultaneously within the warp, right?
__shared__ float facc ; int tx = threadIdx.x; facc[tx] = x0[tx] * Cij[ind]; __syncthreads(); facc [tx] += facc[tx + 1]; // 0+1; 2+3; ... __syncthreads(); facc [tx] += facc[tx + 2]; // 0 + 2; 4 + 6; ... __syncthreads(); facc [tx] += facc[tx + 4]; // 0 + 4; 8 + 12; __syncthreads(); facc [tx] += facc[tx + 8]; // 0 + 8; __syncthreads (); // wait here for all half warps to do this calc if ( tx modulo 48 is zero ) then float answer = facc[tx] + facc[tx + 16] + facc[tx + 32]; // store result
As i noted above, without the sync call the answer is wrong. but using the syncthreads calls is slowing the module by a factor of 2! so i want to eliminate the sync calls.
Can someone explain why the sync is needed if the computation is all internal to the same warp?