Where are these dram reads coming from?

So here is the situation, as accurate as I can define it with words (hopefully).

I have a structure definition, lets call is A. sizeof(A) = 60 bytes.
Inside the structure definition I have function definitions which use the members of the struct plus some parameters passed to the function.

I have an array of A in shared memory.

I pass this array to a device function which has a signature device myfunc(…, A* shA)

Inside said device function I will loop over the elements of shA and call a member function of the struct on every element.

forall
shA[i].doSomething(extraparameters)

Now here is the problem. An insane amount of L2 read misses and dram reads are coming out of that device function, which should only be consuming shared memory. The way I isolate the source of the dram reads is by calling the device function twice with results that the number of dram reads and L2 read misses double.

Does anyone have any idea why L2 would be invoked at all? Two hypothesis I had was 1) struct is too big and the compiler doesnt like putting it in shared memory. However, the profiler does tell me that I have count*sizeof(A) bytes of shared memory use.
Hypothesis 2 is that using member functions of a structure somehows copies everything to global memory, for some reason.

Now, these are completely groundless hypothesis, so if anyone has any idea about where all these dram reads are coming from, i’m all ears!

Fine print : windows 7 64 bits, cuda 3.2 32 bits, GTX480

Are you sure the function calls get inlined? Otherwise you might have stack accesses. Note that the stack accesses are not necessarily abysmal for performance as they get cached.

I see no function calls in the .ptx, either to myfunc() or dosomething().

I have however added forceline instructions to all related device function and it did not change anything. But thanks for the suggestion!

–edit: to add to the fun, between the single and double call of the function, the number of “L2 read requests” stays roughly the same, while the number of L2 read misses doubles.