Mapping the element into "thread block"

Dear All,

I’m working on finite element methods. Previously, I considered on “Element” for each “Thread” on the GPU. Therefore, the required computations for each element had to be serialized. Now, I want to map each element into each “Block of Threads” in order to parallelized the element computations. I did this in the bellow format:

thread_per_block = number of Degree of Freedom
block_per_grid = number of elements

global void kernel ( element* e )
{
.
.
.
e[blockIdx.x].R[threadIdx.x] = …
.
.
.
}

This worked like the previous one, BUT has more computation time which is not what I expected…???

How I can map correctly each element into each “ThreadBlock”???

Thanks a lot,
Behzad

My advice without actually having programmed anything yet, but just reading a lot of stuff:

ThreadsPerBlock = ceil( TotalThreads / NumberOfMultiProcessors );

BlocksPerGrid = ceil(TotalThreads / ThreadsPerBlock);

However I haven’t figured out yet what to do if the threads per block exceeds the limit, or blocks exceed the limit.

But perhaps your app won’t exceed the limit.

ThreadsPerBlock should at least be 32, it’s better if it’s a multiple of 32 (=warpsize),

So try to keep ThreadsPerBlock 32 + 32 + 32 + 32 + 32 and so forth… so either 32, 64, or 64+32 or 128, or 128+32, etc… 256, 256+32 etc…

But if you have more then 1 multi processor then it must also be distributed across multi processor thus the division formula’s above… it spread it out over warps and multi processors.

Also see “excell spread sheet calculator” for kernel launch parameter calculations or something…

You usually want more than one block per SM, and the blocksize a multiple of 64 (two warps).
I’ve tried to summarize the main rules for choosing the blocksize in another thread.