memory access problem

Hi, :rolleyes:

I have a problem with a kernel function. I have 1024 threads which have to increase elements of a matrix Cmat.

There’s my kernel (it calculates a cooccurrence matrix) :

[codebox]global void kernel_cooc(unsigned char* Image, int M, int N, float*Cmat, unsigned char nbNiveaux, int Tx, int Ty, float increment)

{ // pour une image 32*32 : 2 * 2 blocks de 16 * 16 threads.

unsigned int i = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;

unsigned int j = __umul24(blockIdx.y,blockDim.y) + threadIdx.y;

if((i < (M - Ty)) && (j < (N - Tx)))


		unsigned int index_a = i + __umul24(j, M);

		unsigned int index_b = i + Ty + __umul24((j + Tx), M);

		unsigned int index_cmat = Image[index_a] - 1 + __umul24(Image[index_b] - 1, nbNiveaux);

		Cmat[index_cmat]  += increment;




The problem is that several threads may need to access the same element of the matrix Cmat. When I test, elements of Cmat are increased of 1 or 2 while its have to be increased of 10 for example. :blink:

How could I manage the access of my matrix ?


You should probably use atomics like atomicInc if multiple threads are accessing the same location in memory.


Of course the problem here is using a floating point array, and there is no floating point atomic add function.

You’re right, I thought that atomic add was also available for fp32, but it looks like fp32 support is restricted to atomicExch only.

Thanks for the correction. (I usually stay as far away as possible from using atomics)


Thanks for the atomic functions what I didn’t know.

So is there no other way to avoid conflicts of memory access ?
I could work on a matrix Cmat of int type and convert it after in float, but it risk to spend a lot of time …

You really don’t won’t to use global memory atomics unless you absolutely have to. They will have an enormous impact on performance - effectively serializing your kernel execution and global memory access.

I would suggest rethinking the algorithm, if you can. The access patterns it seems to use are going to perform very poorly when reading from global memory.