According to the 2.0 programming manual, the max values for grid.x and grid.y are 65535 and 65535 (call it 216).
Furthermore, the maximum number of threads that can be launched per block are 512 (29).
The manual talks about writing scalable code. Only the runtime knows the exact number of SM’s
that are available. The manual also talks about writing kernels where each thread processes a different
Assuming that you actually had a problem that had say 241 data elements, could you actually
launch 241 threads (where 241 = 216 * 216 * 29)?
CUDA can be used in 3D computer graphics where you might have 200M verticies in a 3D model.
If your 3D application is such that each vertex can be transformed independently, is it reasonable
to launch 200M threads?
The CUDA hardware is amazing. We’ve done some really cool things with it already. The SIMT approach
makes it very easy to reason about kernels. I was just curious about some of these upper bounds.