Im not quite sure i understand how you could do anything useful without using the thread indexes. You would be doing the same work, on the same variables, yielding the same result, BLOCKSIZE times for nothing?
I know that it’s happening what you said: doing the same work, on the same variables. So my question is how one can deal when in a kernel you have to calculate some data that are not dependent on each thread. Imagine, for example, that one wants to load some data from the shared memory and calculate a constant that afterwards each thread will use.
What I realized after experimenting with some kernel timing, is that if I force each thread to work on the same variables and produce the same results, the kernel is much slower than having in the beginning an if statement to force only 1 thread do all the work (however I don’t know if such thing is good practice):
//store results to shared memory
//threads use results from shared memory
Could this degradation be happening because all threads eventually would write the final result to the same shared memory address, thus leading to wait all threads overwriting the same value?