How to count the element in block

Suppose I have elecment like this.

Call matrix A.

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0

And then divided into blocks. Each block contains 16 threads.

Therefore, 1 element 1 thread.

0.0 1.0 2.0 3.0 	4.0 <b>5.0</b> 6.0 7.0

1.0 2.0 3.0 4.0 	<b>5.0</b> 6.0 7.0 8.0

2.0 3.0 4.0 <b>5.0</b> 	6.0 7.0 8.0 9.0

3.0 4.0 <b>5.0</b> 6.0 	7.0 8.0 9.0 10.0

4.0 <b>5.0</b> 6.0 7.0 	8.0 9.0 10.0 11.0

<b>5.0</b> 6.0 7.0 8.0 	9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 	10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 	11.0 12.0 13.0 14.0

After that, I would like my GPU to do a counting. For instance,

count the number five. My result must be

Block(0,0) has 2 memebers

Block(0,1) has 2 ‘’

Block(1,0) has 2 ‘’

Block(1,1) has 0 ‘’

I had tried built up a matrix a (small a) 2x2

to hold that result.

__global__ void nupspin(double **A, double **a)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

    	int y = threadIdx.y + blockIdx.y * blockDim.y;

	if(A[x][y] == 5){

		a[blockIdx.x][blockIdx.y] +=1.0 ;

		__syncthreads();

	}

}

This is result from cudav4.0 on GTX570

1.00    1.00

1.00    0.00

This is result from cuda v.4 on GT 240

0.00    0.00

0.00    0.00

I am thinkging about avoiding the race condition.

In the real program I can not use a block with 2-based dimension.

My question are

  1. How do I queue threads ?

  2. Or if there any other sophisticated ways to count the element and exploiting GPU,

please hint to me.

I tried “Programming Guide”, “CUDA by example”. They usually skip to graphical

issues rather than computing.

?

Well, you do have a race condition in your code. And I don’t know why you’re interested in counting on a per block basis because blocks & threads are simply a way to map the GPU hardware resources to a programming model.

Regardless, you could use an atomicAdd operation instead of( =+1 ). This would effectively queue or serialize threads withing a warp. This is simple but slow.

A better approach is to use a parallel reduction. This is good if you intend to do this for large matrices. Flatten your block into arrays and perform a “conditional” parallel reduction. You can use the parallel reduction code which is available but instead of adding the values you could add 1 if element==5 and 0 if element!=5.

My model is a square lattice to perform simulation of spin in molecule.

Unfortunately prior works of mine I used 100x100 lattice (layman number).

It can not divided block to be 2-based and perform parallel reduction code. Therefore, I prefer counting on a per block.

I promise my later works will have 2-based element for better condition & performance.

Thank you for atomicAdd topic I will try my best.

:]

It works !

No need to divided to be 4 blocks

__global__ void nupspin(double **A, int *a)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

    	int y = threadIdx.y + blockIdx.y * blockDim.y;

	if(A[x][y] == 5){

		atomicAdd(a,1);

		__syncthreads();

	}

}

Result

============GPU===============

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0

We have 5, 6 elements.

:]

If the number of possible conditions of molecules is small then a histogram approach might help

There is a histogram example in the SDK

Thank you.

:]