cudaFuncAttributes.maxThreadsPerBlock cudaDeviceProp.maxThreadsPerBlock mismatch problem

It would have been beneficial if you had mentioned that from the start. The name of the attribute is self-explanatory, I would claim: you are asking the compiler to impose a bound (= hard limit). Which it duly did.

The approach of squeezing the compiler on registers used in order to increase occupancy was sometimes necessary on older GPUs, say, pre-Kepler. Can you actually demonstrate performance gains from using it? I have not found it useful in almost a decade, and have not used either -maxrregcount or __launch_bounds__ as a consequence.

You can do all that with __launch_bounds__.

Have you read the documentation? It explains how to accomplish what you are asking about. The suggestion there is not a per-kernel tweak but a per-architecture tweak. You can do that with __launch_bounds__. It also points out the following:

A kernel will fail to launch if it is executed with more threads per block than its launch bound maxThreadsPerBlock.

I don’t exactly remember all the lines from the documentation, but now after I was bitten by this , I can clearly see this line.

I think I’ve had a wrong assumption about what launch_bounds actually does. At some points I’ve hit a problem that with appearance of new hardware I’ve started to get warnings that some particular launch_bounds configurations do not exist (for example launch_bounds(512,4) will get you that warning on sm_8_6). In addition I was wrongfully assuming compiler doesn’t know max number of resident warps possible, I thought it’s a parameter which may change from board to board (between rtx_3060/rtx_3090 for example) . This led me into ‘maxregcount’ way of thinking.

That’s something in my life I happen to see not as often as I would like :)

In my experience it helps in cases where compiler just flips over the right register amount threshold and occupancy drops noticeably

Anyway, my problem is resolved now, thanks to everyone for your input! Best Regards!