In some specific situations, large amount of threads let say 4G (2^32) is required.

So defining:

dim3 grd(65536,256,1);

dim3 blck(256,1,1);

should be legal (supposing you have enough available resources) but unfortunately it doesn’t work. Other combinations which give 4G threads work perfectly. For example:

dim3 grd(4096,4096,1); // 2^24 blocks

dim3 blck(256,1,1); // x256 = 4G threads

or

dim3 grd(32768,512,1); // also 2^24 blocks

dim3 blck(256,1,1);

work fine but require complicated unique thread ID calculation and usage.

An example with obvious needs for that would be, let say you need to find all 32bits numbers which are product of exactly two 16bits prime numbers. The idea is based on fact that each thread can represent one 32bit number and instead using slow operations like mod or division, factorization is done just splitting index. Also array of let say 65536 prime numbers is precalculated and passed to kernel function. Using existing available grid dimension code looks like

```
dim3 grd(4096,4096,1);
dim3 blck(256,1,1);
__global__ __void__ FindAll(unsigned int* Primes)
{
register unsigned int UniqId = (blockIdx.x << 20) + (blockIdx.y << 8) + threadIdx.x;
register unsigned int index1= UniqId >>16;
register unsigned int index2= UniqId & 0x0000ffff;
if(Prime[index1] * Prime[index2] == UniqId) {
....// number is found
... // store routine
}
}
```

code would be faster and readable without unnecessary bits shifting and masking if grid could be defined as

```
dim3 grd(65536,256,1);
dim3 blck(256,1,1);
__global__ __void__ FindAll(unsigned int* Primes)
{
register unsigned int UniqId = (blockIdx.x << 16);
register unsigned int index = (blockIdx.y << 8) + threadIdx.x;
UniqId += index;
if(Prime[blockIdx.x] * Prime[index] == UniqId) {
....// number is found
... // store routine
}
}
```

According to CUDA2.0 documentation maximal grid dimension should be 65536 but it doesn’t work. Why? Bug?