I have the following kernel:
__global__ void kernel(int *d_array) {
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int z = blockIdx.z*blockDim.z + threadIdx.z;
if (x < 2000000 && y < 12500 && z < 100000) {
/* do stuff */
}
It compiles fine, but when running it it gives: invalid configuration argument.
At the end I found out that I can only use Dim3 ThreadsPerBlocks as following:
Dim3 ThreadsPerBlocks(1,32,32)
The C programming guide says: “A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common choice.”, so does not help at all and there is no information anywhere how dim3 ThreadsPerBlocks works when using all the three dimensions.
So, why I am not able to use something like Dim3 ThreadsPerBlocks(1024,1024,64) and then dim3 numBlocks(1954,13,1563)?