I am doing some tests on single precision atomic (reduction) transactions using the P100 and I am getting random unexpected results. I am hoping that someone has an idea of the cause
Following is a test program I am profiling
__global__ void atomic_test(float * out)
{
int x=threadIdx.x+blockIdx.x*blockDim.x;
//Create a pattern
int myP=x/8;
int myU=x%8;
int myNewX=myP*myP*32+myU;
atomicAdd(out+myNewX,1.0f);
}
main()
{
int blocks=1;
float * out;
cudaMalloc(&out,sizeof(float)*16*32);
dim3 threadDim;
threadDim.x=32;
threadDim.y=1;
threadDim.z=1;
for (int x=0;x<5;x++)
atomic_test<<<blocks,threadDim>>>(out);
cudaDeviceSynchronize();
cudaFree(out);
exit(0);
}
atomic_test is run with just 1 warp and all it does is atomic adds. The warp is somehow split in 4 and every group of 8 threads will execute atomic add on a properly aligned 32Byte word.
My understanding of the P100 is any memory related transactions work on 32-byte aligned words, so there should be 4 atomic transactions, generated by the Warp.
The weird thing is that many time the profiler gives 4 transactions as shown below
atomic_transactions Atomic Transactions 4 4 4
5 atomic_transactions_per_request Atomic Transactions Per Request 4.000000 4.000000 4.000000
5 l2_atomic_throughput L2 Throughput (Atomic requests) 104.20MB/s 105.28MB/s 104.87MB/s
5 l2_atomic_transactions L2 Transactions (Atomic requests) 16 16 16
But some times it reports 6 transactions:
Kernel: atomic_test(float*)
5 atomic_transactions Atomic Transactions 6 6 6
5 atomic_transactions_per_request Atomic Transactions Per Request 6.000000 6.000000 6.000000
5 l2_atomic_throughput L2 Throughput (Atomic requests) 104.47MB/s 105.28MB/s 105.00MB/s
5 l2_atomic_transactions L2 Transactions (Atomic requests) 16 16 16
Does anybody have an idea why this happens? Some alignment problems maybe? Also, does anybody know why 4/6 atomic transactions create 16 read transactions on L2?
Thanks in advance for the help
Daniel