Did you modify the project settings to compile for a cc3.0 device? (project…properties…CUDA…device)
I don’t happen to have that code in front of me, but if memory serves the default project is set up to compile for a cc2.0 target (limited to 65535 blocks in the x grid dimension), and I suspect that code launches blocks of 128 threads. 128 threads * 65535 = 8388480
If you compile for a cc3.0 device, the limit on the x grid dimension becomes some large number like 2^31-1 I think.
Again, I don’t have the code in front of me. This is just conjecture.
You’ll see something like compute_20,sm_20 in the project properties under the device setting. Change that to compute_30,sm_30