Simple Question: Counting Image Pixels

Dear Experts

Simple example: I want to count the number of threads.

I did this.

dim3 dimBlock(16, 16, 1);

dim3 dimGrid(Width/dimBlock.x, Height/dimBlock.y, 1);

kernel<<< dimGrid, dimBlock>>>(pSum_g);

Here, pSum_g is an array (its contents are initialized to 0) in global memory returned by cudaMalloc.

In the kernel, I did simply this



What I expect was pSum_g has the number of threads i.e. Width*Height, but it does not increase as I expected. (the value was 1)

Hope to hear anything about this.


Seems that all your threads read 0 from pSum[0] and write back 1. Even though you’ve got the __syncthreads() call in there, that’s not enough for this race condition.

CUDA does not provide the normal thread synchronization tools that people are used to from most operating systems (mutexs, semaphores, atomic operations, etc). General code strategies include:

  1. Design your algorithm such that read/write communication between threads is not needed.

  2. Break your algorithm into phases such that threads do not communicate during a phase, but can use the results computed by other threads in the previous phases. Separate these phases with __syncthreads(). (This is the technique used by the matrix multiplication example. It has a “load” phase followed by a “multiplication” phase.)

  3. If you can do most of the work without communication, but need to do some merging at the end, you can use an if statement to force a special thread, like thread 0, to do the final combining:


if (threadIdx.x == 0) {

  // merge results from other threads here


This is not very efficient, but is good enough if you have only a small amount of data to merge together. There are more efficient merging techniques which take log(n) time.