I’ve analyzed a CUDA kernel via Nsight and found something strange:
I do not know where these loads come from.
My kernel only reads from two texture references:
texture<uint4, cudaTextureType3D, cudaReadModeElementType> tex1;
tex3D(tex1, (float)i, (float)index.y, (float)index.x);
texture<float, cudaTextureType3D, cudaReadModeElementType> tex2;
tex3D(tex2, cellCoord.x, cellCoord.y, cellCoord.z);
Is is possible that the uint4 reads don’t go through the texture unit?
I have no other global memory references in the kernel that it could read from.
The item circled in red with the question mark is local memory, not global memory (although it appears to me there is some global traffic as well.) Are you using any local memory? (i, index.y, index.x, cellCoord.x, cellCoord.y, cellCoord.z could all be local memory references.) Local memory usage that does not fit in registers may get stored in L1/L2/devicemem. Another possibility is register spilling.
Thank you for your response.
The way I read this graph is that [local] <— [L1 cache] <—[L2 Cache] <—[Device memory] represents reads from device memory loading into registers (local memory). Register spillage I thought goes through [Global RO].
The kernel shows 65 registers used on Kepler GK110, which should be fine no?
Those blocks in the diagram immediately to the right of the “kernel” block refer to memory spaces and/or transaction types. You’ll note that there is no “register” block. Registers are not a memory space.
If the kernel reads from global memory space, it will show up as a transaction flowing through the “Global” box. If the kernel reads from the local memory space, it will show up as a transaction flowing through the “local” box, which is connected to the link you have circled. Both Global and Local traffic can flow through L1/L2/Devicemem.
You haven’t really answered my question about local memory usage, except to indicate what the register usage is, which does not answer the question.
One possible example of local memory usage could be something like this:
in kernel code. Such a construct is in the “local” memory space, but will not get stored in registers, nor have a direct impact on register usage. It will be stored in device memory, and accesses to it will flow through L1/L2 as appropriate, and they are distinct from “global” memory transactions.
This presentation may be of interest:
It may be that register spilling on GK110 flows through Global RO instead of ordinary LMEM space, but if that’s the case I wasn’t aware of it.
The -Xptxas=-v option will give you information about local memory usage as well as register spilling.
No local memory used. But I see lots of spillage in that output.
ptxas info : 11 bytes gmem, 24 bytes cmem
It must be spillage, but the volume is unbelievably high. I’ll have to do a quick calculation to reality check that.
Okay, I got it. It shows the spillage from “-G” debug info. Got Heisenberg’ed.