Simple Question: Counting Image Pixels

In_Kyu_Park · March 17, 2007, 11:54pm

Dear Experts

Simple example: I want to count the number of threads.

I did this.

dim3 dimBlock(16, 16, 1);

dim3 dimGrid(Width/dimBlock.x, Height/dimBlock.y, 1);

kernel<<< dimGrid, dimBlock>>>(pSum_g);

Here, pSum_g is an array (its contents are initialized to 0) in global memory returned by cudaMalloc.

In the kernel, I did simply this

pSum[0]++;

__syncthreads();

What I expect was pSum_g has the number of threads i.e. Width*Height, but it does not increase as I expected. (the value was 1)

Hope to hear anything about this.

Thanks.

JaredHoberock · March 18, 2007, 12:04am

Seems that all your threads read 0 from pSum[0] and write back 1. Even though you’ve got the __syncthreads() call in there, that’s not enough for this race condition.

seibert · March 18, 2007, 1:06am

CUDA does not provide the normal thread synchronization tools that people are used to from most operating systems (mutexs, semaphores, atomic operations, etc). General code strategies include:

Design your algorithm such that read/write communication between threads is not needed.
Break your algorithm into phases such that threads do not communicate during a phase, but can use the results computed by other threads in the previous phases. Separate these phases with __syncthreads(). (This is the technique used by the matrix multiplication example. It has a “load” phase followed by a “multiplication” phase.)
If you can do most of the work without communication, but need to do some merging at the end, you can use an if statement to force a special thread, like thread 0, to do the final combining:

__syncthreads();

if (threadIdx.x == 0) {

  // merge results from other threads here

}

This is not very efficient, but is good enough if you have only a small amount of data to merge together. There are more efficient merging techniques which take log(n) time.