Array Comparision

Hi Everyone,
I need to do some quick array comparison on two large arrays, and basically increase a counter with each mismatch.

So far my logic run as follows, I allocate 2 arrays of the same size on the device to global memory, and another with the size of the amount of blocks.
I then copy then contents of these arrays over to the device.
I also have an one shared memory variable.

In the kernel,
I compare the arrays , and increase the block’s shared memory variable if they match.
Lastly I copy the shared value to the global array the size of the amount of blocks in the grid, thus each block now has a counter.

Afterwards, on the host, I copy and add these values are added together.

The implementation however does not seem to work , which brings me to my questions.
1)Is it possible to have an single shared int value, and increase it as possible.
The value returned of this sum is always wrong, except when compiling in the emu mode.
2)Is there some special function to call to ensure that the sum value is added correctly, or
does Cuda automatically resolve the read-write problem?
3) How does one initialize the shared int value?
4) Why does the code work with emu=1 but not normaly.
5)Is this the best approach for array comparison with cuda?

Here is the implementation code:
#include <stdio.h>
#include <cuda.h>

global void vecCompare(int *a,int *b, int *out)
volatile shared int sum;
sum=0; //initilize value
int i= blockIdx.x * blockDim.x + threadIdx.x;
if (a[i]==b[i]) sum++;
out[blockIdx.x ]= sum;

int main()
int totalSize=307712;
int blockSize=512;
int totalBlocks=totalSize/blockSize;
dim3 dimBlock(blockSize);
dim3 dimGrid(totalBlocks);

int * ahost= new int [totalSize];
int * bhost= new int [totalSize];
int * adev;
int * bdev;
int* outdev;
int* outhost= new int [totalBlocks];

for (int i=0; i<totalSize ;i++) {ahost[i]=255; bhost[i]=255;}

cudaMalloc( (void **) &adev, sizeof(int)*totalSize );
cudaMalloc( (void **) &bdev, sizeof(int)*totalSize );
cudaMalloc( (void **) &outdev, sizeof(int)*totalBlocks);

cudaMemcpy(adev, ahost, sizeof(int)*totalSize, cudaMemcpyHostToDevice);
cudaMemcpy(bdev, bhost, sizeof(int)*totalSize, cudaMemcpyHostToDevice);

vecCompare<<<dimGrid, dimBlock>>>(adev,bdev,outdev);


cudaMemcpy(outhost, outdev, sizeof(int)*totalBlocks, cudaMemcpyDeviceToHost);

int sum=0;
for (int i=0; i<totalBlocks;i++)
{ sum=sum+outhost[i];
printf("%i: %i : %i \n",i,outhost[i],sum);
printf("%i \n",sum);

delete ahost;
delete bhost;
delete outhost;

  1. yes it is possible

  2. yes: Check out atomic functions in the Programming Guide. In your case you may try something like this:

atomicAdd( &sum , 1);

however, you need Compute Capability 1.3, otherwise your hardware won’t be able to handle atomic operation on shared memory

  1. You cannot assign it when it is created, but shortly afterwards. I see you already did that in your code :)

  2. since sum++ is not atomic it works as follows:

  • Load sum into register X. Each thread holds separate register.

  • increment X

  • store X back to sum.

Now, if you have 32 threads performing those operations truly in parallel, each increment a separate copy and your sum is incorrect.

In deviceemu there is only one thread, so you don’t encounter that problem.

  1. No it is not. You should try to avoid using atomic instructions. Of course sometimes you need them but not in this case. Conider the algorithm (SIZE is size of your block known at compile time)
__shared__ int isTheSame;

int i= blockIdx.x * blockDim.x + threadIdx.x;


/* now perform reduction */

for (r=SIZE/2; r>0; r/=2) {

  if (threadIdx.x<r)




if (threadIdx.x==0)


It is still not the best you can get, but you should get a general idea how to approach this or simmilar problems.

On compute 1.3 devices you could use shared memory atomics. But this is not the best approach (see below).


if (threadIdx.x == 0) // (assuming a 1D block)

	shared = 0;


emu mode is correct because the threads in the block are run in serial. On the device, all threads in the block are run in parallel and thus you have a race condition.

No. Using atomic operations and having all threads trying to increment the same variable at once will basically turn your parallel threads into serial operations. The best way would be to perform a parllel reduction. See the reduction sample in the SDK, particularly read the whitepaper that comes with it as that explains the basis of the algorithm. For your application, you would create a shared array with each value equal to 1 or 0 given the comparison and then perform a sum reduction on that.

Thanks for the swift replies.
I’ll implement this tomorrow and post the results.

Rather than keep a running sum in a shared variable, I would recommend having each thread loop over multiple elements and keep a running sum in a register. Then just before the threads finish, they can coordinate with each other using shared memory (using say an intra-block reduction sum), and blocks can coordinate using global memory like you described. Rather than 6000 blocks of 512, perhaps use 60 blocks of 128, with each thread comparing 400 elements. Experiment with different combinations.

Registers are faster than shared memory and never have bank conflicts and never have contention issues. Though in your case it might not make much difference in performance since you are going to be I/O bound transferring to global memory.