Does Cuda driver API cuLaunchKernel has limit on gridDimX?The sample vectorAddDrv can't run when the N = 70000000


I came across a problem when running the cuda demo vectorAddDrv. I know vectorAddDrv using cuda driver API cuLaunchKernel to launch GPU kernel. The threadsPerBlock = 1024, when the element number N = 50000000, the blockDimX = (50000000 + 1023) / 1024 = 48829. there is no error launching cuLaunchKernel. But if I change N = 70000000, the blcokDimX = 68360, there has CUDA_ERROR_INVALID_VALUE error throw from cuLaunchKernel. I know someone said there is 65536 limitation on function parameter. But if I need large element number, how can I work around this limitation? Please help me? Thank you

Compile for sm_30 or higher, and run on a sm_30 or higher machine. Or else rewrite the grid block ordering to use 2D or 3D grids.