I have a RTX 4060 and want know, what best combination of dim3 Block and dim3 Grid to program run fast.
That depends on the program. I would suggest testing different block sizes and select the fastest one.
This is one of the most commonly asked questions about CUDA. With a bit of searching you will find many suggestions for guidelines. Here is one example, there are many others.
Thank you all