Launching 2**41 threads?


According to the 2.0 programming manual, the max values for grid.x and grid.y are 65535 and 65535 (call it 216).
Furthermore, the maximum number of threads that can be launched per block are 512 (2

The manual talks about writing scalable code. Only the runtime knows the exact number of SM’s
that are available. The manual also talks about writing kernels where each thread processes a different
data value.

Assuming that you actually had a problem that had say 241 data elements, could you actually
launch 2
41 threads (where 241 = 216 * 216 * 29)?

CUDA can be used in 3D computer graphics where you might have 200M verticies in a 3D model.
If your 3D application is such that each vertex can be transformed independently, is it reasonable
to launch 200M threads?

The CUDA hardware is amazing. We’ve done some really cool things with it already. The SIMT approach
makes it very easy to reason about kernels. I was just curious about some of these upper bounds.


I think the more immediate problem here is that no CUDA device can store 2**41 data elements of any size in the device memory. Even if each element is a bit, that’s 256 GB, and if each element is a byte, then the storage requirement is 2 TB.