Tesla K80 limit of number of blocks in one dimension

Hello all,
Recently I started coding CUDA and I wrote a kernel and ran it as follows:

converted_sum<<<491520,512>>>(power_sum_dev, powers_gbest_dev);

Running this gives me the following error: invalid argument

but when I reduce the number of blocks down to say, 60000, the error goes away.

converted_sum<<<60000,512>>>(power_sum_dev, powers_gbest_dev);

So this is clearly a problem of me over allocating blocks. However, when I run device query, it clearly shows that I could go way beyond 491520 blocks with my K80.

Max grid dimensions:  (2147483647, 65535, 65535)

So what’s the problem here? Any help would be appreciated.

are you compiling for the correct architecture (3.7) ?

probably you are not. If you are using CUDA 8 or before, and compiling for the default architecture, or any cc2.x architecture, your code will fail exactly as you describe, even on K80.

Oh I see. I didn’t know that. That should be the problem. As I mentioned, it’s just recently I started working with CUDA. Thank you very much!