Large local memory overhead but no spills or local memory instructions

I have a complex kernel written in PTX that shows lots of local memory transactions when profiled. These seems to grow proportionally to work volume of the kernel. It is a long-running persistent kernel with a processing loop, each iteration of which adds about 10000 of local memory transactions. Cache hit rate for these is reported at 50%. The kernel makes use of device runtime.

What puzzles me is that ptxas reports no spills, stack, or local memory use at all. Register count limit is not reached either (I set it at 255 for trials but the kernel uses 128 max). In disassembly view in nvvp I too can’t find where this local memory access is happening.

Is it possible that profiler is wrong and there is no such high local traffic? I’m a my wit’s end here, having spent the whole day just trying to figure this out.

There is also no local memory access in cudobjdump output.

In case anyone’s wondering, the culprit was CUDA device runtime API call.