# How to count the element in block

Suppose I have elecment like this.

Call matrix A.

``````0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
``````

And then divided into blocks. Each block contains 16 threads.

``````0.0 1.0 2.0 3.0 	4.0 <b>5.0</b> 6.0 7.0

1.0 2.0 3.0 4.0 	<b>5.0</b> 6.0 7.0 8.0

2.0 3.0 4.0 <b>5.0</b> 	6.0 7.0 8.0 9.0

3.0 4.0 <b>5.0</b> 6.0 	7.0 8.0 9.0 10.0

4.0 <b>5.0</b> 6.0 7.0 	8.0 9.0 10.0 11.0

<b>5.0</b> 6.0 7.0 8.0 	9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 	10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 	11.0 12.0 13.0 14.0
``````

After that, I would like my GPU to do a counting. For instance,

count the number five. My result must be

Block(0,0) has 2 memebers

Block(0,1) has 2 ‘’

Block(1,0) has 2 ‘’

Block(1,1) has 0 ‘’

I had tried built up a matrix a (small a) 2x2

to hold that result.

``````__global__ void nupspin(double **A, double **a)

{

int x = threadIdx.x + blockIdx.x * blockDim.x;

int y = threadIdx.y + blockIdx.y * blockDim.y;

if(A[x][y] == 5){

a[blockIdx.x][blockIdx.y] +=1.0 ;

}

}
``````

This is result from cudav4.0 on GTX570

``````1.00    1.00

1.00    0.00
``````

This is result from cuda v.4 on GT 240

``````0.00    0.00

0.00    0.00
``````

I am thinkging about avoiding the race condition.

In the real program I can not use a block with 2-based dimension.

My question are

1. How do I queue threads ?

2. Or if there any other sophisticated ways to count the element and exploiting GPU,

I tried “Programming Guide”, “CUDA by example”. They usually skip to graphical

issues rather than computing.

?

Well, you do have a race condition in your code. And I don’t know why you’re interested in counting on a per block basis because blocks & threads are simply a way to map the GPU hardware resources to a programming model.

Regardless, you could use an atomicAdd operation instead of( =+1 ). This would effectively queue or serialize threads withing a warp. This is simple but slow.

A better approach is to use a parallel reduction. This is good if you intend to do this for large matrices. Flatten your block into arrays and perform a “conditional” parallel reduction. You can use the parallel reduction code which is available but instead of adding the values you could add 1 if element==5 and 0 if element!=5.

My model is a square lattice to perform simulation of spin in molecule.

Unfortunately prior works of mine I used 100x100 lattice (layman number).

It can not divided block to be 2-based and perform parallel reduction code. Therefore, I prefer counting on a per block.

I promise my later works will have 2-based element for better condition & performance.

Thank you for atomicAdd topic I will try my best.

:]

It works !

No need to divided to be 4 blocks

``````__global__ void nupspin(double **A, int *a)

{

int x = threadIdx.x + blockIdx.x * blockDim.x;

int y = threadIdx.y + blockIdx.y * blockDim.y;

if(A[x][y] == 5){

}

}
``````

Result

``````============GPU===============

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0

We have 5, 6 elements.
``````

:]

If the number of possible conditions of molecules is small then a histogram approach might help

There is a histogram example in the SDK

Thank you.

:]