How can I determine what launch bounds a kernel was built with?

Hello again, 11:30PM here, so one more before last call, eh?

If I compile a kernel with __launch_bounds__(max_threads_per_block, min_blocks_per_sm), is there a way, at run time, to determine what those launch bounds actually were? The reason I ask is that I am using lots of pre-processor directives to specify the optimal configurations of various kernels for different architectures. I know that people are saying __launch_bounds__ is becoming more of a specialty thing, but this, erm, my specialty? Besides, there are clear cases where the perf difference is just huge. The danger, which makes me understand why new programmers are being told DON’T LOOK IN THE BOX, is that if you have launch bounds at compile time exceeded by the launch parameters at run time, poof your kernel does nothing, or perhaps the program crashes, who knows. I get that.

In order to guard against all of this, and to collect all of the launch information into a coherent plan, I am designing a C++ object to store each kernel’s selected launch parameters given various conditions at run time as well as different branches of the kernel itself. (Will it compute energy, forces on particles, both? Want single- or double-precision numbers? What about the way it will accumulate results?) All of these details can have minor effects on the register pressure and thus the ideal number of threads to use, and of course the GPU itself has a lot to do with it. I seek to set the launch bounds with pre-processor directives, but then have details of the GPU and other conditions (MiG active? Size and configuration of the workload?) handled at run time to further tune the launch parameters. To guard against the aforementioned rookie mistakes, I would like to have methods in the C++ object to check that these launch parameters fit with the launch bounds.

I know that there are methods for setting things like __shared__ memory configuration for specific kernels at run time. I hope there is a way to query the launch bounds configuration of that kernel, too?

I believe
cudaFuncGetAttributes

maxThreadsPerBlock

should do it.

What a wealth of information–I will continue to compile with --ptxas-options=“-v” but the ability to pull out the number of registers per thread at runtime could be a very nice asset. This and related attributes handled through my GPU launch controls object would give me a way to offer users a CLI method for printing critical information about their kernels on whatever architecture–I wouldn’t have to have their specific GPU to at least get some indications of what their compilation looks like.

Now where’s the :thanks: emoji?

Reviving this discussion, as I am not seeing the expected results from cudaFuncGetAttributes.

When I compile with --ptxas-options="-v" I am seeing 55, 56, 61 registers per thread in single-precision variants of my kernels and upwards of 100 registers per thread in the double-precision variants. When I query those properties, I often find register usage of 28 to 32 for the single-precision variants, and consistently just 16 for double-precision variants. The single-precision non-bonded kernel I wrote fares a little better under this analysis, compiling with 40 or 48 registers according to --pxtas-options="-v" output and returning accurate results when queried with cudaFuncGetAttributes. But again, the double-precision variants of even that function report much lower register usage (again, 26 to 32) when it should be in the 70s or 80s.

Any idea what I might be doing wrong? I am calling cudaFuncGetAttributes like this:

  cudaFuncAttributes attr;
  if (cudaFuncGetAttributes(&attr, ktgfNonbondedForceEnergy) != cudaSuccess) {
    rtErr("Error obtaining attributes for kernel kfsValenceForceAccumulation.",
          "queryValenceKernelRequirements");
  }
  wisdom->catalogNonbondedKernel(PrecisionModel::SINGLE, NbwuKind::TILE_GROUPS, EvaluateForce::YES,
                                 EvaluateEnergy::YES, ForceAccumulationMethod::WHOLE,
                                 attr.maxThreadsPerBlock, attr.numRegs, attr.sharedSizeBytes);

I see no runtime exceptions thrown by these calls–rtErr is there to create such things in the event of anything I flag as a problem. The second call there is one of my own objects for cataloging all this stuff, but unless cudaFuncAttributes.numRegs is not the thing that stores the register count (and why would it get the right answer in agreement with --ptxas some of the time otherwise?), I’m lost here.

Cheers!

Just a complete guess here. It may have something to do with relocatable device code and device code linking. You haven’t indicated whether that is involved, nor have you given me an example I can work with. My 5 minutes spent trying to come up with an example quickly didn’t pan out.

And I’m assuming this inquiry is completely separate from the maxThreadsPerBlock discussion.

A good thought, but I am not using --rdc in my compilation (at least I hope relocatable device code is OFF unless one calls for it). The question is related to the maxThreadsPerBlock discussion. Whereas I can infer from the maxThreadsPerBlock that the register usage is less than or equal to 64k / maxThreadsPerBlock I would have liked to know what the actual register usage was. If I compiled the kernel for maxThreadsPerBlock > 768, I can be reasonably assured that the compilation has only budgeted for a single block per multiprocessor, but for blocks of, say, 256 threads, the number of registers is a valuable thing to know so that I understand how many of such blocks can fit onto a single SM. The __shared__ usage is accurate, from what I can tell.

If you use --rdc you are presumably also using link-time optimization? Register counts presumably are not finalized until after the link step in this case.

Have you tried dumping all the ELF sections of the binary (e.g. cuobjdump --dump-elf) to see whether the register count of kernels is stored in some vendor-specific section? The information on per-kernel register usage is presumably stored somewhere in order to configure the hardware on kernel launch …

The fact that the original __launch_bound() data is no longer available in the compiler backend makes sense to me, though: As far as the compiler is concerned, that information can be discarded after it has been used to control the compilation process.

But this is precisely the opposite of what I was saying–I DO get the launch bounds data out of cudaFuncAttributes correctly, and I am NOT using --rdc, so far as I know. It’s the funny register counts that have me scratching my head.

Sorry, I got confused. I guess my brain is fried after working on a partial translation of a German publication from 1770 for most of the afternoon. I promise to work on my reading comprehension :-)

1 Like

I have my head deep in a mess of kernel launch parameters. This object I’ve made to track all my kernels is the way to go, I think, but it’s forcing me to re-think everything up to this point on how I decide on the block and thread counts, plus how I design objects to “get ready” for the calculations. I think I’ve got all the fires under control, however, and with a little more experience I’ll be able to encapsulate more and make the code look cleaner. But, when you’ve got many variants of one kernel, and its thread blocks take one work unit at a time with the goal of incrementing a counter in GMEM to reach the finish line (asynchronous work unit scheduling) you have to be very careful about how you set the counters initially. The kernel management object gets fed into the initialization now, and the whole principle of collecting all the nitty-gritty details of different cards and kernel variants into one place should start paying off now. The launch_bounds business that we talked about a few days ago is just one facet of it. Thanks for trying to help, though! I’ll post again if I find anything new.