On atomic transactions on the P100

I am doing some tests on single precision atomic (reduction) transactions using the P100 and I am getting random unexpected results. I am hoping that someone has an idea of the cause

Following is a test program I am profiling

__global__ void atomic_test(float * out)
{
        int x=threadIdx.x+blockIdx.x*blockDim.x;
        //Create a pattern
        int myP=x/8;
        int myU=x%8;
        int myNewX=myP*myP*32+myU;
        atomicAdd(out+myNewX,1.0f);
}
main()
{
        int blocks=1;
        float * out;
        cudaMalloc(&out,sizeof(float)*16*32);
        dim3 threadDim;
        threadDim.x=32;
        threadDim.y=1;
        threadDim.z=1;
        for (int x=0;x<5;x++)
        atomic_test<<<blocks,threadDim>>>(out);
        cudaDeviceSynchronize();
        cudaFree(out);
        exit(0);
}

atomic_test is run with just 1 warp and all it does is atomic adds. The warp is somehow split in 4 and every group of 8 threads will execute atomic add on a properly aligned 32Byte word.

My understanding of the P100 is any memory related transactions work on 32-byte aligned words, so there should be 4 atomic transactions, generated by the Warp.

The weird thing is that many time the profiler gives 4 transactions as shown below

atomic_transactions                                                   Atomic Transactions           4           4           4
          5           atomic_transactions_per_request                                       Atomic Transactions Per Request    4.000000    4.000000    4.000000
          5                      l2_atomic_throughput                                       L2 Throughput (Atomic requests)  104.20MB/s  105.28MB/s  104.87MB/s
          5                    l2_atomic_transactions                                     L2 Transactions (Atomic requests)          16          16          16

But some times it reports 6 transactions:

Kernel: atomic_test(float*)
          5                       atomic_transactions                                                   Atomic Transactions           6           6           6
          5           atomic_transactions_per_request                                       Atomic Transactions Per Request    6.000000    6.000000    6.000000
          5                      l2_atomic_throughput                                       L2 Throughput (Atomic requests)  104.47MB/s  105.28MB/s  105.00MB/s
          5                    l2_atomic_transactions                                     L2 Transactions (Atomic requests)          16          16          16

Does anybody have an idea why this happens? Some alignment problems maybe? Also, does anybody know why 4/6 atomic transactions create 16 read transactions on L2?

Thanks in advance for the help
Daniel