Hi,
We work with a dynamic programming algorithm with a huge number of data. Kernel functions contains several loops, and data are stored into array to be consecutive :
Address 1 : data for thread 1
Address 2 : data for thread 2…
Data are always float or float2.
Let consider blocksize to be 64 in only one dimension. The following access to global memory are non-coalesced (except for some of them, I guess for j=0). There isn’t warp serialize according to the cuda profiler.
number_reference is a multiple of 16.
[codebox]
int num_thread = blockIdx.x*BLOCKSIZE + threadIdx.x;
for(int j = reference_size-1; j >= 0; j–) {
int base_address = j*number_reference;
float reference_vertex = reference_vector[base_address+ num_thread];
[/codebox]
According to the Cuda profiler, I have on this part of code gld_incoherent=[ 25904 ] (over more 17000 threads). This code is simplify (I have 2 more load and 1 write in other part of code in loops), so the non-coalesced access slow the execution.
My question is why the load are coalesced assuming that data of a half warp are in same segment of words of 32bits (or 64bits) and reference_vector begins to a multiple of 32 (all my arrays in globlal memory are float or float2 array, so I don’t make mistake reference_vector has to begin at address%32 = 0). And apparently the first loads are coalesced. Only thing I see is that the device is incapable to predicate that “base address” will be the same for all thread at same time at same j.
Thank you