noob question about ATOMIC operations...

Is there a way to create a customized atomic operation in cuda?

I am asking because i got a problem that i need to check this kernel code:

if (img[index] != clone[index])
atomicExch(cmp, index);

the main problem is that img, clone and cmp are device memories, HOWEVER, cmp is a SINGLE unity memory, so it can be seen as a shared memory, so, all threads have acesss to it, with the atomicExch i guarantee that once someone triggers it, no one will stop it, so far so good, the main problem is the if that can trigger race conditions… i would like to make the both the if and the atomicExch to be executed in a atomic way… anyone have any idea of how to do that? or if it is at least possible?

If you are trying to do what I think you are then how about making a flag in shared memory

if (img[index] != clone[index])
set flag

synchthreads
if ( flag is set )

NB it is OK for several or even all threads in the block to set the flag as the hardware will queue them
e.g. if you set the flag to tid then its value will be one of the tid’s but which one is undefined.
(NB tid is not a good value to use because one thread has tid of zero.)

Hope that this helps

I wrote a long reply here about avoiding such complex atomic operations, and how to use an inefficient lock token as a last-resort answer to rare atomic issues anyway.

But I replaced that now because as I was writing up the code, I noticed that your example doesn’t have a problem that atomics or locks would solve anyway.
Since you don’t use the value stored in cmp, the atomicExch is effectively just an assignment.
Your test condition does not test cmp itself, so there’s no need to lock or exchange anything.
So just write “if (img[index] != clone[index]) *cmp=index;”

Now the final value of *cmp is subject to a race condition… if multiple threads have set it, then ONE will have succeeded at the end, but it is undefined which one was that winner.
If you really need a consistent winner, then just use atomicMin(cmp, index) or atomicMax(cmp, index), and at the end of processing, cmp will hold the smallest or largest index respectively.

it seems odd, but this are the two options that i am using:

kernel code here...
__global__ void comparacao_paralela(unsigned char *img, unsigned char *clone, unsigned int *cmp, int N , int width , int widthStep , int nChannels)
{
	unsigned int indice , index;
.
.
.
#ifdef no_race
if (img[indice] != clone[indice])
	atomicExch(&cmp[indice], indice);
#endif
		
#ifdef race
if (img[indice] != clone[indice])
	atomicExch(&cmp[0], 1); //i dont care the indice number, i just need it to be != 0
#endif
.
.
.

end of kernel code

host code:
.
.
.
cudaMemset(CUDAcmp, 0, sizeof(unsigned int) * N);
comparacao_paralelaCPP(nBlocks, blockSize, CUDAClone_data, CUDAClone2_data, CUDAcmp, N, width, widthStep, nChannels);  <== the kernell call
cudaMemcpy(j, CUDAcmp, sizeof(unsigned int) * N, cudaMemcpyDeviceToHost);
#ifdef no_race
		for(k=0 ; k < N ; k++)
			if (j[k] != 0)
			{
				printf("-%d- ",j[k]);
				break;  //breaks this _for_
			}

		if (k == N)
		{
			printf("finalizing!!!!!! \n");
			break;
		}
		else
			printf("still run!!!!!! \n");
#endif

#ifdef race
		if (j[0] != 0)
			printf("-%d- ",j[0]);
		else
		{
			printf("finalizing \n",j[0]);
			break;
		}
#endif

if i set the RACE define, the code SOMETIMES run, SOMETIMES enter in a infinite loop.
if i set the NO_RACE define, the code ALWAYS run!

i REALLY dont know how the race is crushing me up here :(