Is there a nice trick to specify performance-tuning PTX directives such as .reqntid from within CUDA kernel code manually? In OpenCL this is possible using attribute((reqd_work_group_size(x, y, z))).
As workaround exists in nvrtc. One can use CU_JIT_THREADS_PER_BLOCK option. However, this will apply to all kernels in the same source code. I need to set it individually for each kernel. Plus it covers only the .reqntid directive.
Inline assembly will not work because of the scope.