Let me give a simple example:
I want to do string searching, I have an array of 65536 (256*256) unique values copied to constant GPU memory (searchNums). I would like to search this array for the value X. I have an unsigned int (isFound) allocated in GPU memory (initialized to -1) to stored the index of X if found.
So I launch the kernel with grid size 256 and 256 threads in each block.
search<<<256, 256>>(isFound);
Kernel is simply this.
__global__ void search(unsigned int* isFound)
{
unsigned int tid = blockIdx.x * 256 + threadIdx.x;
if(searchNums[tid] == X)
*isFound = tid;
}
This works fine, what I am interested in is if it is efficient to return the result this way, even though only a maximum of one thread with hit the line “*isFound = tid;” With this cause any kind of bottlenecking or bank conflicts?
Thanks for the help!