Hi, I wrote a cuda code to access the memory in the device memory and i run in it a loop, incrementing each value in the process.
global void dram_load(float *a)
for (int i=0;i<1200;++i)
When i profile it, i get the required numbers of L2 reads and writes but the Device memory reads are completely off. It shows reads somewhere between 0-50 everytime i rerun the analysis, also the write counts are double the amount.
The results of one such analysis are:
Could anyone explain it to me why it is happening like this? Whay are there no reads from the device memory. Once they are in cache, i understand but initially there should be data being fetched, right? And why are the writes like this as well.
Any insight would be appreciated.