Newbie: Will this work?

Let me give a simple example:

I want to do string searching, I have an array of 65536 (256*256) unique values copied to constant GPU memory (searchNums). I would like to search this array for the value X. I have an unsigned int (isFound) allocated in GPU memory (initialized to -1) to stored the index of X if found.

So I launch the kernel with grid size 256 and 256 threads in each block.

search<<<256, 256>>(isFound);

Kernel is simply this.

__global__ void search(unsigned int* isFound)

{

 unsigned int tid = blockIdx.x * 256 + threadIdx.x;

 if(searchNums[tid] == X)

  *isFound = tid;

}

This works fine, what I am interested in is if it is efficient to return the result this way, even though only a maximum of one thread with hit the line “*isFound = tid;” With this cause any kind of bottlenecking or bank conflicts?

Thanks for the help!

There is no such thing as too little work in a CUDA thread. There is no context switching penalty, so typically the best strategy is to do as little work as possible in each thread to leverage all the multiprocessors to their fullest. Having each thread work on a single element of a an array is standard practice.

And your method of handing isFound is fine. As you say, only one thread can possibly make the write so there is no problem with any sort of race conditions. I’ve been using this type of write in one of my kernels for a while now.

As long as the values in the array are unique you should be fine. Also, have you tried playing around with the block size? You might be able to get slightly better results, but not guaranteed. Try running CudaVisualProfiler if complete optimization is a necessity.