Suppose I have elecment like this.
Call matrix A.
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0
5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0
7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
And then divided into blocks. Each block contains 16 threads.
Therefore, 1 element 1 thread.
0.0 1.0 2.0 3.0 4.0 <b>5.0</b> 6.0 7.0
1.0 2.0 3.0 4.0 <b>5.0</b> 6.0 7.0 8.0
2.0 3.0 4.0 <b>5.0</b> 6.0 7.0 8.0 9.0
3.0 4.0 <b>5.0</b> 6.0 7.0 8.0 9.0 10.0
4.0 <b>5.0</b> 6.0 7.0 8.0 9.0 10.0 11.0
<b>5.0</b> 6.0 7.0 8.0 9.0 10.0 11.0 12.0
6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0
7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
After that, I would like my GPU to do a counting. For instance,
count the number five. My result must be
Block(0,0) has 2 memebers
Block(0,1) has 2 ‘’
Block(1,0) has 2 ‘’
Block(1,1) has 0 ‘’
I had tried built up a matrix a (small a) 2x2
to hold that result.
__global__ void nupspin(double **A, double **a)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if(A[x][y] == 5){
a[blockIdx.x][blockIdx.y] +=1.0 ;
__syncthreads();
}
}
This is result from cudav4.0 on GTX570
1.00 1.00
1.00 0.00
This is result from cuda v.4 on GT 240
0.00 0.00
0.00 0.00
I am thinkging about avoiding the race condition.
In the real program I can not use a block with 2-based dimension.
My question are
-
How do I queue threads ?
-
Or if there any other sophisticated ways to count the element and exploiting GPU,
please hint to me.
I tried “Programming Guide”, “CUDA by example”. They usually skip to graphical
issues rather than computing.
?