kernel optimization

I had a kernel where I was doing this:

global void myKernel( /* some params */, unsigned N) {
unsigned k = threadIdx.x % N; // need k to access an array

// …
}

kernel was being called as follows:

myKernel<<<NB, NNH>>>(/ params */, N);

Since I realize the operator % takes a lot of time, I’m trying to remove all references from my kernels. Then I realize that if I change the way I call the kernel:

dim3 dimBlock(N, NH);
myKernel<<<NB, dimBlock>>>(/* params */, N);

I could change my kernel to:

global void myKernel( /* some params */, unsigned N) {
unsigned k = threadIdx.x; // notice I remove the % N

// of course more changes needed (but that’s not the point)
//…
}

So my newbie question is:
By doing this change I hope my kernel will run faster, but I don’t know about launch times. Will the time needed to launch my kernel increase or this is always a win/win situation? or I’m wrong?

I really don’t understand your question clearly.

Depend on your algorithm.

consider that N*NH <= 512.

Can you post a piece of your code? :)

A single modulus operator per thread will probably not make a measurable difference.

Launching a 2-dimensional block will probably not be any slower than a one-dimensional block, since they are always launched as 3-dimensional blocks even if some of the dimensions are 1.

Another consideration is if N is a compile-time constant and a power of two, then % optimizes to a & operation, which is fast, and * and / translate into << and >>. I usually define the block size as compile-time constants so that I can do these types of calculations without worrying about the computation cost of evaluating dimensions.

Thanks, that helped