Why we observe 2X write dram transactions in Pascal TITANX?


When I am running a very simple divergent kernel like the one shown below with 2621440 threads, on Pascal TITANX, i get 2X extra dram writes, which i did not get on old architectures, like Fermi, Kepler and Maxwell. The table below shows the L2 cache reads, write, dram reads and writes I get from the nvprof 8.0 on different HW architecture. Can you please explain, where are these extra writes coming from on Pascal TITANX?

global void irreguler(const float* A, float* C, int N)

int i = blockDim.x * blockIdx.x + threadIdx.x;
C[i*32] = A[i*32];


,l2_l1_read_transactions     ,l2_l1_write_transactions	  ,dram_read_transactions ,dram_write_transactions

Fermi GTX 480 ,2621440 ,2621440 ,2621459 ,2621440
Kepler GeForce TITAN ,2621652 ,2621465 ,2621448 ,2621441
Maxwell GeForce GTX Titan X ,2621440 ,2621440 ,2621448 ,2655629
Pascal GTX TITANX ,2621440 ,2621440 ,2622244 ,4968328