Hello all,
Recently I started coding CUDA and I wrote a kernel and ran it as follows:
converted_sum<<<491520,512>>>(power_sum_dev, powers_gbest_dev);
Running this gives me the following error: invalid argument
but when I reduce the number of blocks down to say, 60000, the error goes away.
converted_sum<<<60000,512>>>(power_sum_dev, powers_gbest_dev);
So this is clearly a problem of me over allocating blocks. However, when I run device query, it clearly shows that I could go way beyond 491520 blocks with my K80.
Max grid dimensions: (2147483647, 65535, 65535)
So what’s the problem here? Any help would be appreciated.