Interpreting output from cuobjdump --dump-resource-usage

I’m working on chasing down a weird bug in my CUDA Fortran code that triggers an out of bound memory access in places that are accessing local memory that should be in registers. In the process of hunting this down I’ve been using cuobjdump --dump-resource-usage to look for overused registers but it has a lot of other info and I’m not sure what is reasonable or not for those other values.

The output from cuobjdump --dump-resource-usage is below. The shared memory usage is what I expect but the register usage is lower than I would expect and I have no idea how to interpret the rest. I’ve looked but can’t find any documentation on what the other values mean, what reasonably values are, etc. Any advice?


Fatbin elf code:
================
arch = sm_90
code version = [1,7]
host = linux
compile_size = 64bit
has debug info
compressed
identifier = ../gpu/interface_states.cuf

Resource usage:
 Common:
  GLOBAL:43
 Function gpu_interface_states_magnitude_squared_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_compute_pressure_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function __cuda_sm20_div_rn_f64_full:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_compute_energy_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_sound_speed_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function __cuda_sm20_dsqrt_rn_f64_mediumpath_v1:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_slope_minmod_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_conserved_2_primitive_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_primitive_2_conserved_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_index_1dto3d_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_subgrid_conserved_2_primitive_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_trace_3d_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_hll_flux_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_hll_fluxes_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_riemann_driver_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_conservative_update_:
  REG:0 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function gpu_interface_states_compute_interface_states_:
  REG:64 STACK:1032 SHARED:40760 LOCAL:0 CONSTANT[0]:568 TEXTURE:0 SURFACE:0 SAMPLER:0

Fatbin ptx code:
================
arch = sm_90
code version = [8,4]
host = linux
compile_size = 64bit
has debug info
compressed
identifier = ../gpu/interface_states.cuf
ptxasOptions =  -g --dont-merge-basicblocks --return-at-end

Generally speaking the different resources are memory spaces. In many cases there are clearer definitions in the PTX manual.

GLOBAL is global space memory
REG refers to register usage.
STACK is stack usage, a kind of local memory.
SHARED is shared memory usage.
LOCAL is local memory
CONSTANT is constant memory (__constant__, etc.)
TEXTURE is texture memory (old style tex ref/bound texture)
SURFACE is surface memory, similar to texture memory

Its quite reasonable that many of these are zero.

Not using textures in CUDA C++? Texture resource usage will probably be zero.

Not using surfaces in CUDA C++ ? Surface resource usage will probably be zero.

I haven’t tried to see what GLOBAL resource usage refers to, but it would probably include for example __device__ variable declarations in CUDA C++.

shared will correspond to __shared__ usage in CUDA C++

Registers don’t explicitly appear in CUDA C++ for the most part, so this is tracking something that happens at the back end of the compilation process. Zero register usage is not reasonable, but many of the outputs you have should be interpreted as “what is incrementally being used by this entry point/function”. In that context, zero register usage just means “no additional register usage beyond what was defined at the function entry”.

Stack usage is also typically not evident/fully explicit at the C++ level. You could make inferences of course, but it is mostly a function of the compiler. Zero stack usage might be reasonable for code that is fully inlined.

Local usage will typically involve immediate variables, e.g.:

int a;
float b[4];

in your kernel thread code. It may also include local usage associated with register “spilling”, i.e. values that are sometimes in registers but sometimes not, and need a backing store. There are many many forum questions on this particular topic.

Thanks, that mostly answers my question.

I’m working in CUDA Fortran right now and mostly I’m trying to track down an out of bounds memory access that appears to happen on a device function call itself. If I manually inline the function it works fine but if I call it I get the access issue. Could this be related to the stack usage? What are reasonable values for the stack memory usage?

I’m not an expert on Fortran. But in C++ when a function calls itself we usually refer to that as recursion (I think). Recursion has implications for stack usage. The compiler doesn’t (usually) know the recursion depth, so it does not know how much stack space to allocate for call frames. As a result, if you “recurse” too far, then you will run out of stack space. That could manifest as an out-of-bounds memory access, because there is no inherent stack checking on calls that I am aware of.

This would be the “usual” hazard with recursion. Diagnosing could involve manually increasing the stack space, and re-running your code with compute-sanitizer, increasing until the problem disappears. If it worked out that way, it would tend to confirm the idea of a stack overflow, due to recursion.

If you know the recursion depth max a-priori, you can handle this situation deterministically with a bit of effort. If not, you have to judge how valuable the recursion is to your methodology, and what risks you can afford.

I’m going to mention a few things now relevant to CUDA C++ and nvcc, but if you are using CUDA Fortran the compiler you are probably using is nvfortran, and that has a separate forum where you could ask how to do some of these things if possible.

  • you can enable stack pointer bounds checking I believe. It seems evident this has some cost to it, but is automatically enabled when device-debug code is generated (-G). This could be another diagnostic avenue.
  • you can increase the stack size up to a limit. To increase the stack size you would use cudaDeviceSetLimit and to query the stack size cudaDeviceGetLimit. There are various ramifications to the stack size for GPU memory usage/requirement.
  • nvcc, at least, warns you about recursion with a message that “stack size cannot be statically determined”. I don’t know for sure about nvfortran.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.