BUG in CUDA2.0 Beta2 Max grid dimension

In some specific situations, large amount of threads let say 4G (2^32) is required.

So defining:

dim3 grd(65536,256,1);

dim3 blck(256,1,1);

should be legal (supposing you have enough available resources) but unfortunately it doesn’t work. Other combinations which give 4G threads work perfectly. For example:

dim3 grd(4096,4096,1); // 2^24 blocks

dim3 blck(256,1,1); // x256 = 4G threads

or

dim3 grd(32768,512,1); // also 2^24 blocks

dim3 blck(256,1,1);

work fine but require complicated unique thread ID calculation and usage.

An example with obvious needs for that would be, let say you need to find all 32bits numbers which are product of exactly two 16bits prime numbers. The idea is based on fact that each thread can represent one 32bit number and instead using slow operations like mod or division, factorization is done just splitting index. Also array of let say 65536 prime numbers is precalculated and passed to kernel function. Using existing available grid dimension code looks like

dim3 grd(4096,4096,1);

dim3 blck(256,1,1);

__global__ __void__ FindAll(unsigned int* Primes)

{

register unsigned int UniqId = (blockIdx.x << 20) + (blockIdx.y << 8) + threadIdx.x;

register unsigned int index1= UniqId >>16;

register unsigned int index2= UniqId & 0x0000ffff;

  if(Prime[index1] * Prime[index2] == UniqId) {

   ....// number is found

   ... // store routine 

  }

}

code would be faster and readable without unnecessary bits shifting and masking if grid could be defined as

dim3 grd(65536,256,1);

dim3 blck(256,1,1);

__global__ __void__ FindAll(unsigned int* Primes)

{

register unsigned int UniqId = (blockIdx.x << 16);

register unsigned int index =  (blockIdx.y << 8) + threadIdx.x;

UniqId += index;

  if(Prime[blockIdx.x] * Prime[index] == UniqId) {

   ....// number is found

   ... // store routine 

  }

}

According to CUDA2.0 documentation maximal grid dimension should be 65536 but it doesn’t work. Why? Bug?

If you read that part of the documentation again, I think you will find the limit is 65535.

Yes, you are right about documentation.

But it looks like software limitation. Declaring dimension of 65535 elements would mean index of it counts from 0 to 65534. ??? (one missing) but special hardware registers (blockIdx.x and blockIdx.y) are 16 bits in size and could handle full range of 65536 element (0 to 65535).

Another thing about that, dim3 types defined in software.for GridDim and BlockDim purpose can contain value 0. Doc says that any omitted value in it will be initialized to 1 but if value persists it can be 0. The only problem is in the part which calculate how many blocks will be created because it interprets that 0 as is and calculates like 0 blocks should be created which results in no kernel execution. Anyway, dimension of size 0 has no meaning so interpreting it as 65536 has sense. I think existing hardware could work without any changes the only thing should be changed is software and only part which calculate how many blocks should be lunch. If it could lunch 65536 blocks then everything else should work as expect the only difference will be in special registers like blockDim.x or blockDim.y (which are also 16 bits in size). They will have value of 0.

Oh, I see your point. Extending limit by 1 could be usefull in some cases.

But on your place I would consider reducing grid to much lower numbers.

Yes, it can speed up small kernels written on maximal optimized way where usage of one extra register or one extra assembler instruction are considered as very costly operations

I think this concrete problem can not be solved on faster way than this. On this way in single kernel execution without mod and div operations and without iteration process you find all 32bits numbers which are product of exactly two primes, so reducing grid size in this example has no sense.

I doubt it. E.g. you are using

Prime[blockIdx.x]

since this is used in all threads, you should probably cache it in a shared variable.

This is even more effective, if you combine multiple into one, and doing something like

if (threadIdx.x < 16) cache[threadIdx.x] = Prime[16 * blockIdx.x + threadIdx.x];

__syncthreads();

...

if(cache[0] * Prime[index] == UniqId) {

...

}

if(cache[1] * Prime[index] == UniqId) {

...

}

...

it takes a bit more calculation, but reduces the memory bandwidth a lot and reduces the number of blocks.

Disclaimer: as always when I write untested code it is likely to have bugs, nevertheless I think you can greatly reduce memory bandwidth for your code, and this is even more possible when you combine blocks.

I got your point but not sure about that since you have half threads diverge. Also time for storing and reading shared mem. Only tests with time measurement (will be posted soon) could show result of trading between that and accessing texture or global memory where all threads from warp read the same memory location (it should be only one memory reading since hardware cache should handle it).

There is no cache for global memory unless you use tex1Dfetch, and even then the cache is not that fast. And the way I understand page 54 of the CUDA 2.0 beta 2 programming guide, unless you have a GTX260/280 N threads accessing the same global memory location results in N memory transfers.

Even on a GTX260/280 you only avoid the major penalty, you still do not get the speedup due to coalesced reads.

There is also the issue of reading Prime[index] multiple times in different blocks, even if there was a cache since you do not know in which order the blocks are executed it is not at all certain it will still be in the cache.

I am also unsure what you mean by “half threads diverge” - if you mean the == UniqID comparisons, they diverge just as much in your original code.

But you are of course right, without benchmarks it is all only guessing, though I would be quite surprised if your current implementation is close to optimal :tongue: .

Also you are optimizing the amount of calculation, which in this case is probably not needed at all, since you will be memory bandwidth bound.