Grid 4x4 but only runs 10 blocks?


I’m trying to run a simulation and i have to manipulate number of blocks and thread sizing. For threads I think there is a maximum of 512x512? and blocks have a 65535x65535 maximum. So anyways i setup my grid to be 4,4,1 and threads to bs 2,2,1 but it only runs 10 of the blocks in simulation and on the card.

Anyone know why this occurs?

there is a maximum of 512 threads
this info you can find with devicequery (or from the manual)

I also thought that the grid is 2D and the threads can be 3D but I’m not sure. The max of threadblocks in a grid is 65k in each direction. All this information you can find in the programming guide (some appendix).

How are you certain that only 10 blocks are executed?

When i run the program, it only outputs the data up to the 10 blocks

My program kernel for each thread will execute and spit out data. Currently it only does it up to 10 blocks.

i’m still not sure what the problem could be.

as for grid, threads sizing if you do not specify a 3rd dimension value, i think it just defaults to 1 anyways.

I can try without but i dont think that helps.

could you post code? It is kind of hard to see what might be wrong.

The code is quite long and convoluted but I figured out the problem.

It was due to me trying to copy all memory i was working with to the output in every thread instead of just what the thread should touch.

So this is an FYI in case this happens to anyone.