Updating Same Global Memory Concurrently While doing sparse matrix multiplication, write error occur

Hello,
I have the following problem:
While doing inner product, each multiplication of inner product is handled by a thread.

result = ax + by
ax and by are calculated by different threads, and they are added up within indiviual threads after calculation.

In my sparse matrix kernel only
result = by, only the final elements are stored.

What maybe the reason of this, and what can I do?

Mustafa

The Kernel code:
matrixMulCOO( float * values, int * rowIndices, int * colIndices, int numberOfElements,
float* B,
float* C)
{
// Block index
//int idx = /blockIdx.x blockDim.x +*/ threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;

if(idx < numberOfElements )
for(int j = 0; j < WB; j++)
{
	//A.values[idx];
	C[ rowIndices[idx]*WC + j ]
	= values[idx] * B[WB*colIndices[idx] +  j];
	__syncthreads();
}

//if(idx < HC )
//C[ idx] = values[idx];

}

Results:

Device 0: “GeForce GTX 460” with Compute 2.1 capability
array2d <3, 4>
0 1 2 0
0 0 0 0
0 0 10 11
Row: 0 Col: 1 Value: 1.000000
Row: 0 Col: 2 Value: 2.000000
Row: 2 Col: 2 Value: 10.000000
Row: 2 Col: 3 Value: 11.000000

host B Mat 4 x 3
0.00 1.00 2.00
3.00 4.00 5.00
6.00 7.00 8.00
9.00 10.00 11.00

Device C Mat 3 x 3
12.00 14.00 16.00
0.00 0.00 0.00
99.00 110.00 121.00

Hello,
I have the following problem:
While doing inner product, each multiplication of inner product is handled by a thread.

result = ax + by
ax and by are calculated by different threads, and they are added up within indiviual threads after calculation.

In my sparse matrix kernel only
result = by, only the final elements are stored.

What maybe the reason of this, and what can I do?

Mustafa

The Kernel code:
matrixMulCOO( float * values, int * rowIndices, int * colIndices, int numberOfElements,
float* B,
float* C)
{
// Block index
//int idx = /blockIdx.x blockDim.x +*/ threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;

if(idx < numberOfElements )
for(int j = 0; j < WB; j++)
{
	//A.values[idx];
	C[ rowIndices[idx]*WC + j ]
	= values[idx] * B[WB*colIndices[idx] +  j];
	__syncthreads();
}

//if(idx < HC )
//C[ idx] = values[idx];

}

Results:

Device 0: “GeForce GTX 460” with Compute 2.1 capability
array2d <3, 4>
0 1 2 0
0 0 0 0
0 0 10 11
Row: 0 Col: 1 Value: 1.000000
Row: 0 Col: 2 Value: 2.000000
Row: 2 Col: 2 Value: 10.000000
Row: 2 Col: 3 Value: 11.000000

host B Mat 4 x 3
0.00 1.00 2.00
3.00 4.00 5.00
6.00 7.00 8.00
9.00 10.00 11.00

Device C Mat 3 x 3
12.00 14.00 16.00
0.00 0.00 0.00
99.00 110.00 121.00

__syncthreads() only synchronizes threads inside one block, not globally over your kernel.
Even in the pseudocode you posted, you’re using __syncthreads() inside a branch, which gives undefined behavior (usually a timeout and aborted kernel)

__syncthreads() only synchronizes threads inside one block, not globally over your kernel.
Even in the pseudocode you posted, you’re using __syncthreads() inside a branch, which gives undefined behavior (usually a timeout and aborted kernel)