Hi Everyone,
I need to do some quick array comparison on two large arrays, and basically increase a counter with each mismatch.
So far my logic run as follows, I allocate 2 arrays of the same size on the device to global memory, and another with the size of the amount of blocks.
I then copy then contents of these arrays over to the device.
I also have an one shared memory variable.
In the kernel,
I compare the arrays , and increase the block’s shared memory variable if they match.
Lastly I copy the shared value to the global array the size of the amount of blocks in the grid, thus each block now has a counter.
Afterwards, on the host, I copy and add these values are added together.
The implementation however does not seem to work , which brings me to my questions.
1)Is it possible to have an single shared int value, and increase it as possible.
The value returned of this sum is always wrong, except when compiling in the emu mode.
2)Is there some special function to call to ensure that the sum value is added correctly, or
does Cuda automatically resolve the read-write problem?
3) How does one initialize the shared int value?
4) Why does the code work with emu=1 but not normaly.
5)Is this the best approach for array comparison with cuda?
Here is the implementation code:
#include <stdio.h>
#include <cuda.h>
global void vecCompare(int *a,int *b, int *out)
{
volatile shared int sum;
sum=0; //initilize value
__syncthreads();
int i= blockIdx.x * blockDim.x + threadIdx.x;
if (a[i]==b[i]) sum++;
__syncthreads();
out[blockIdx.x ]= sum;
}
int main()
{
int totalSize=307712;
int blockSize=512;
int totalBlocks=totalSize/blockSize;
dim3 dimBlock(blockSize);
dim3 dimGrid(totalBlocks);
int * ahost= new int [totalSize];
int * bhost= new int [totalSize];
int * adev;
int * bdev;
int* outdev;
int* outhost= new int [totalBlocks];
for (int i=0; i<totalSize ;i++) {ahost[i]=255; bhost[i]=255;}
cudaMalloc( (void **) &adev, sizeof(int)*totalSize );
cudaMalloc( (void **) &bdev, sizeof(int)*totalSize );
cudaMalloc( (void **) &outdev, sizeof(int)*totalBlocks);
cudaMemcpy(adev, ahost, sizeof(int)*totalSize, cudaMemcpyHostToDevice);
cudaMemcpy(bdev, bhost, sizeof(int)*totalSize, cudaMemcpyHostToDevice);
vecCompare<<<dimGrid, dimBlock>>>(adev,bdev,outdev);
cudaThreadSynchronize();
cudaMemcpy(outhost, outdev, sizeof(int)*totalBlocks, cudaMemcpyDeviceToHost);
int sum=0;
for (int i=0; i<totalBlocks;i++)
{ sum=sum+outhost[i];
printf(“%i: %i : %i \n”,i,outhost[i],sum);
}
printf(“%i \n”,sum);
cudaFree(adev);
cudaFree(bdev);
cudaFree(outdev);
delete ahost;
delete bhost;
delete outhost;
}