So here is the situation, as accurate as I can define it with words (hopefully).
I have a structure definition, lets call is A. sizeof(A) = 60 bytes.
Inside the structure definition I have function definitions which use the members of the struct plus some parameters passed to the function.
I have an array of A in shared memory.
I pass this array to a device function which has a signature device myfunc(…, A* shA)
Inside said device function I will loop over the elements of shA and call a member function of the struct on every element.
Now here is the problem. An insane amount of L2 read misses and dram reads are coming out of that device function, which should only be consuming shared memory. The way I isolate the source of the dram reads is by calling the device function twice with results that the number of dram reads and L2 read misses double.
Does anyone have any idea why L2 would be invoked at all? Two hypothesis I had was 1) struct is too big and the compiler doesnt like putting it in shared memory. However, the profiler does tell me that I have count*sizeof(A) bytes of shared memory use.
Hypothesis 2 is that using member functions of a structure somehows copies everything to global memory, for some reason.
Now, these are completely groundless hypothesis, so if anyone has any idea about where all these dram reads are coming from, i’m all ears!
Fine print : windows 7 64 bits, cuda 3.2 32 bits, GTX480