global void myKernel( /* some params */, unsigned N) {
unsigned k = threadIdx.x % N; // need k to access an array
// …
}
kernel was being called as follows:
myKernel<<<NB, NNH>>>(/ params */, N);
Since I realize the operator % takes a lot of time, I’m trying to remove all references from my kernels. Then I realize that if I change the way I call the kernel:
global void myKernel( /* some params */, unsigned N) {
unsigned k = threadIdx.x; // notice I remove the % N
// of course more changes needed (but that’s not the point)
//…
}
So my newbie question is:
By doing this change I hope my kernel will run faster, but I don’t know about launch times. Will the time needed to launch my kernel increase or this is always a win/win situation? or I’m wrong?
A single modulus operator per thread will probably not make a measurable difference.
Launching a 2-dimensional block will probably not be any slower than a one-dimensional block, since they are always launched as 3-dimensional blocks even if some of the dimensions are 1.
Another consideration is if N is a compile-time constant and a power of two, then % optimizes to a & operation, which is fast, and * and / translate into << and >>. I usually define the block size as compile-time constants so that I can do these types of calculations without worrying about the computation cost of evaluating dimensions.