Dear All,

I’m working on finite element methods. Previously, I considered on “Element” for each “Thread” on the GPU. Therefore, the required computations for each element had to be serialized. Now, I want to map each element into each “Block of Threads” in order to parallelized the element computations. I did this in the bellow format:

thread_per_block = number of Degree of Freedom

block_per_grid = number of elements

**global** void kernel ( element* e )

{

.

.

.

e[blockIdx.x].R[threadIdx.x] = …

.

.

.

}

This worked like the previous one, BUT has more computation time which is not what I expected…???

How I can map correctly each element into each “ThreadBlock”???

Thanks a lot,

Behzad