Personally, I would use a strided for-loop in the kernel where each thread processes multiple elements instead of 1.
There is also the possibility to launch multiple kernels instead of 1.
For the grid dimension: You can separate a large number of blocks into 32 bits for the x dimension and 16 bits each for the y and z dimension.
For the block dimension: The number of threads per block has to be a number << 64 bits for current SM architectures. Not sure, how fast Nvidia will catch up. But to give a hint, the maximum number of threads increased from 512 (early Cuda versions) to 1024. So it doubled (exponential increase), but slowly.
I would not expect any future expansions to the maximum grid dimensions supported right now. The runtime of a kernel launched with a maximally-dimensioned grid using current limits will exceed the physical life of the GPU, and with a tiny bit of math a multi-dimensional grid can be effectively re-shaped into a virtual single-dimensional one with almost 263 blocks (as Curefab already pointed out).
That is a bit puzzling. Use of int64_t seems to compile fine.
CUDA does have hardware-imposed limits on grid and block dimensions. These are documented and can be immediately retrieved with the deviceQuery sample code.
Yes, and I would say the godbolt link I provided proves it.
There are no such dimensions that exceed the range of positive numbers available in int type in C++.
If you have a number larger than that range, it would be illegal to use it in grid or block dimensions in CUDA, currently.