Hi,

i wrote a kernel that does not use shared memory and that works on a problem that can be split into parallel parts very well.

At runtime i get to calculate how many parts there are. In one example it is approx. 15000 parts.

In the kernel i calculate / loop “tid”:

for(tid = threadIdx.x + blockIdx.x * blockDim.x;

tid < max_tid;

tid += blockDim.x * gridDim.x)

{

if(tid < max_tid) {

… do the work

}

}

Starting the kernel as <<max_tid, 1>> seemed reasonable to me, but gives bad results.

Starting the kernel <<mx_tid, 50>> gives much better results, but i can’t explain why.

Is there a rule or formula to get the best performance?

Thanks for any hints,

Torsten.