How to count the element in block

Rit · July 5, 2011, 4:55pm

Suppose I have elecment like this.

Call matrix A.

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0

And then divided into blocks. Each block contains 16 threads.

Therefore, 1 element 1 thread.

0.0 1.0 2.0 3.0 	4.0 <b>5.0</b> 6.0 7.0

1.0 2.0 3.0 4.0 	<b>5.0</b> 6.0 7.0 8.0

2.0 3.0 4.0 <b>5.0</b> 	6.0 7.0 8.0 9.0

3.0 4.0 <b>5.0</b> 6.0 	7.0 8.0 9.0 10.0

4.0 <b>5.0</b> 6.0 7.0 	8.0 9.0 10.0 11.0

<b>5.0</b> 6.0 7.0 8.0 	9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 	10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 	11.0 12.0 13.0 14.0

After that, I would like my GPU to do a counting. For instance,

count the number five. My result must be

Block(0,0) has 2 memebers

Block(0,1) has 2 ‘’

Block(1,0) has 2 ‘’

Block(1,1) has 0 ‘’

I had tried built up a matrix a (small a) 2x2

to hold that result.

__global__ void nupspin(double **A, double **a)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

    	int y = threadIdx.y + blockIdx.y * blockDim.y;

	if(A[x][y] == 5){

		a[blockIdx.x][blockIdx.y] +=1.0 ;

		__syncthreads();

	}

}

This is result from cudav4.0 on GTX570

1.00    1.00

1.00    0.00

This is result from cuda v.4 on GT 240

0.00    0.00

0.00    0.00

I am thinkging about avoiding the race condition.

In the real program I can not use a block with 2-based dimension.

My question are

How do I queue threads ?
Or if there any other sophisticated ways to count the element and exploiting GPU,

please hint to me.

I tried “Programming Guide”, “CUDA by example”. They usually skip to graphical

issues rather than computing.

?

akavo · July 5, 2011, 11:41pm

Well, you do have a race condition in your code. And I don’t know why you’re interested in counting on a per block basis because blocks & threads are simply a way to map the GPU hardware resources to a programming model.

Regardless, you could use an atomicAdd operation instead of( =+1 ). This would effectively queue or serialize threads withing a warp. This is simple but slow.

A better approach is to use a parallel reduction. This is good if you intend to do this for large matrices. Flatten your block into arrays and perform a “conditional” parallel reduction. You can use the parallel reduction code which is available but instead of adding the values you could add 1 if element==5 and 0 if element!=5.

Rit · July 6, 2011, 2:51am

My model is a square lattice to perform simulation of spin in molecule.

Unfortunately prior works of mine I used 100x100 lattice (layman number).

It can not divided block to be 2-based and perform parallel reduction code. Therefore, I prefer counting on a per block.

I promise my later works will have 2-based element for better condition & performance.

Thank you for atomicAdd topic I will try my best.

:]

Rit · July 6, 2011, 1:42pm

It works !

No need to divided to be 4 blocks

__global__ void nupspin(double **A, int *a)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

    	int y = threadIdx.y + blockIdx.y * blockDim.y;

	if(A[x][y] == 5){

		atomicAdd(a,1);

		__syncthreads();

	}

}

Result

============GPU===============

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0

7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0

We have 5, 6 elements.

:]

kbam · July 8, 2011, 12:19am

If the number of possible conditions of molecules is small then a histogram approach might help

There is a histogram example in the SDK

Rit · July 8, 2011, 8:04am

Thank you.

:]

Topic		Replies	Views
Simple Question: Counting Image Pixels CUDA Programming and Performance	2	2716	March 18, 2007
Urgent help with threads please! CUDA Programming and Performance	21	10860	March 6, 2008
need consider synchronization problem? CUDA Programming and Performance	1	2086	September 24, 2007
Implementing Counter within GPU CUDA Programming and Performance	3	1347	September 3, 2009
Parallelize function which will count all vectors with sum equal of vector elements and elements not CUDA Programming and Performance	1	687	October 19, 2013
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6544	April 13, 2009
Counting values --- what's wrong? CUDA first timer's naive code~ help please! CUDA Programming and Performance	2	991	March 22, 2009
Control Bock Execution CUDA Programming and Performance	1	599	February 28, 2011
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	879	July 16, 2010
Mapping the element into "thread block" CUDA Programming and Performance	2	1028	June 12, 2011

How to count the element in block

Related topics