Kernel Dimensions


I know the exact number of threads I will need. The problem is, I am having trouble figuring out how to map that number into Grid and Block dimensions.

The total combinations for a simulation will be: 133,784,560.
Let’s assume anything prefixed with ‘g’ is grid and ‘b’ is block
we basically have this equation: gDimx * gDimy * gDimz * bDimx * bDimy * bDimz = 133,784,560.
all we know about the variables are limitations given by the hardware. ie, gDimz = 1; bDimxbDimybDimz < 512; gDimx || gDimy <= 65535
There are plenty of combinations that will suffice for example:
[indent] gDimx: 611
gDimy: 595
gDimz: 1
bDimx: 368
bDimy: 1
bDimz: 1[/indent]
Now, I’m not very proud of the way I am figuring that out at the moment as I am basically doing a brute force. I’m sure there is a much more elegant algebraic way to solve this getting a range for each of the variables since we have the hardware constraints. But, I am finding my algebra skills are far from good. Can anyone provide some insight?


Well, do you need to map it in a 2 dimensional way? If not, I’d stick with 1D grid dimensions and 1D block dimensions.
The second thing I’d do is to stop worrying about getting the number 133,784,560 exactly. So I’d fix the number of threads per block bDimx=512 (max for compute 1.x) and calculate the number of blocks I would need. So suppose the problem size is N. to calculate the number of blocks ceil(((float)N)/bDimx) will be sufficient.

Then in my kernel I would simply execute conditionally if the calculated thread id is less then N. if(tid<N) {}.
I hope I understood what you need to do correctly.

Yes, you did. I was just making it more complicated than it needed to be.

Although, I am having difficulty seeing how the problem will map to the framework, given your solution. Here is some more detail:

I have 52 unique values from which I am choosing 7 without replacement. 52 nCr 7 = 133784560. I was hoping to build the kernel dimensions in such a way that I could easily have the ‘id’ map to one of the 52 unique values.

maybe some pseudo code from the old code I am trying to map into CUDA will help.

pseudo code:

for i in 1..52

  for j=i+1 in 1..52

    for k=j+1 in 1..52

      for l=k+1 in 1..52

        for m=l+1 in 1..52

          for n=m+1 in 1..52

            for o=n+1 in 1..52

               values = [i, j, k, l, m, n, o]

      something with values..

So, in terms of a kernel I am having trouble seeing how I would calculate ‘distinct’ id’s for the ‘values’ array referenced in the code above.

Hopefully I am making sense here, please let me know if I need to clarify something. Thanks so much for the response.

I see what you are trying to do.
I don’t think you can pre-map the thread ID’s in such a way to get all the combinations without replacement and without repetition. I think the problem is too complex for that. I see 3 ways you can approach this.

  1. Precalculate the indices and pass them to the GPU. This should be a good approach if you reuse these values and if the work you do with them is more intensive than actually generating them. It’s even better if you can reuse these indices.
  2. There are parallel algorithms for enumerating the combinations.
  3. You can brute force through the combinations and check for duplicates.

Perhaps someone else knows a better way.

If I precalculate and pass to the GPU I think that would take up more resources than I would want. Roughly, 7*133 = 931MB? I could be super off here, or just have the wrong perception of what you mean by number 1. I am assuming you me precalculate a 2D array of indicies such as:




and pass this to device, suppose I could chunk it and do it in multiple kernels…

What do you think of this (below)?!?!?

for the sake of index simplicity just setting up simple dimensions GRID(52,52,1) and BLOCK(52,1,1)

and inside the kernel do something simple like:

kernel<<GRID, BLOCK>>()


  for(i = 0; i < 52; i++)

    for(j = i+1; j < 52; j++)

      for(k = j+1; k < 52; k++)

        for(l = k+1; l < 52; l++)

          // check for matching ids

          // if none match

          values[i, j, k, l, gridx, gridy, blockx]

          // do some calculations with values


Removes some of the magnitude, but probably not the best use of the hardware etc…very easy to code though.