I am writing a kernel that has the following structure:
Data are read from the global GPU memory and stored to the shared one. Each thread fetches one element.
Some calculations are performed which are COMMON for ALL threads, by fetching data from the constant memory and the results are stored to the shared memory.
Each thread uses the above common results to perform further processing.
Final data are written back to global memory.
Since in step 2, all data are common for all threads, I don’t use any thread index. In contrast, I declare some variables in the shared memory and use them to do my calculations.
I would like to ask if step 2 could be a bottleneck for the kernel.
Im not quite sure i understand how you could do anything useful without using the thread indexes. You would be doing the same work, on the same variables, yielding the same result, BLOCKSIZE times for nothing?
I know that it’s happening what you said: doing the same work, on the same variables. So my question is how one can deal when in a kernel you have to calculate some data that are not dependent on each thread. Imagine, for example, that one wants to load some data from the shared memory and calculate a constant that afterwards each thread will use.
What I realized after experimenting with some kernel timing, is that if I force each thread to work on the same variables and produce the same results, the kernel is much slower than having in the beginning an if statement to force only 1 thread do all the work (however I don’t know if such thing is good practice):
if (threadIdx.x=1)
{
//do work
//store results to shared memory
}
//threads use results from shared memory
Could this degradation be happening because all threads eventually would write the final result to the same shared memory address, thus leading to wait all threads overwriting the same value?
Doing threadIdx.x==whatever isnt a bad practice, sometimes it needs to be done. If theres a good portion of the code that is in the if statement, you will lost the parallel aspect though.