When I am running a very simple divergent kernel like the one shown below with 2621440 threads, on Pascal TITANX, i get 2X extra dram writes, which i did not get on old architectures, like Fermi, Kepler and Maxwell. The table below shows the L2 cache reads, write, dram reads and writes I get from the nvprof 8.0 on different HW architecture. Can you please explain, where are these extra writes coming from on Pascal TITANX?
global void irreguler(const float* A, float* C, int N)
int i = blockDim.x * blockIdx.x + threadIdx.x; C[i*32] = A[i*32];
,l2_l1_read_transactions ,l2_l1_write_transactions ,dram_read_transactions ,dram_write_transactions
Fermi GTX 480 ,2621440 ,2621440 ,2621459 ,2621440
Kepler GeForce TITAN ,2621652 ,2621465 ,2621448 ,2621441
Maxwell GeForce GTX Titan X ,2621440 ,2621440 ,2621448 ,2655629
Pascal GTX TITANX ,2621440 ,2621440 ,2622244 ,4968328