Performance-Tuning PTX Directives in CUDA

Is there a nice trick to specify performance-tuning PTX directives such as .reqntid from within CUDA kernel code manually? In OpenCL this is possible using attribute((reqd_work_group_size(x, y, z))).

As workaround exists in nvrtc. One can use CU_JIT_THREADS_PER_BLOCK option. However, this will apply to all kernels in the same source code. I need to set it individually for each kernel. Plus it covers only the .reqntid directive.

Inline assembly will not work because of the scope.

Use launch__bounds.

Many thanks for the trick! Using launch__bounds compiled to .maxntid/.minnctapersm instead of .reqntid, but the effect is the same.

A big advantage of launch__bounds() over setting an explicit limit for the number of registers is that it is less dependent on properties of the concrete architecture, so your code will scale better to future GPU generations.