NVIDIA Developer Forums

Updating Same Global Memory Concurrently While doing sparse matrix multiplication, write error occur

Accelerated Computing CUDA CUDA Programming and Performance

Mustafa_Teke November 27, 2010, 10:28pm 1

Hello,
I have the following problem:
While doing inner product, each multiplication of inner product is handled by a thread.

result = ax + by
ax and by are calculated by different threads, and they are added up within indiviual threads after calculation.

In my sparse matrix kernel only
result = by, only the final elements are stored.

What maybe the reason of this, and what can I do?

Mustafa

The Kernel code:
matrixMulCOO( float * values, int * rowIndices, int * colIndices, int numberOfElements,
float* B,
float* C)
{
// Block index
//int idx = /blockIdx.x blockDim.x +*/ threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;

if(idx < numberOfElements )
for(int j = 0; j < WB; j++)
{
	//A.values[idx];
	C[ rowIndices[idx]*WC + j ]
	= values[idx] * B[WB*colIndices[idx] +  j];
	__syncthreads();
}

//if(idx < HC )
//C[ idx] = values[idx];

}

Results:

Device 0: “GeForce GTX 460” with Compute 2.1 capability
array2d <3, 4>
0 1 2 0
0 0 0 0
0 0 10 11
Row: 0 Col: 1 Value: 1.000000
Row: 0 Col: 2 Value: 2.000000
Row: 2 Col: 2 Value: 10.000000
Row: 2 Col: 3 Value: 11.000000

host B Mat 4 x 3
0.00 1.00 2.00
3.00 4.00 5.00
6.00 7.00 8.00
9.00 10.00 11.00

Device C Mat 3 x 3
12.00 14.00 16.00
0.00 0.00 0.00
99.00 110.00 121.00

Mustafa_Teke November 27, 2010, 10:28pm 2

Hello,
I have the following problem:
While doing inner product, each multiplication of inner product is handled by a thread.

result = ax + by
ax and by are calculated by different threads, and they are added up within indiviual threads after calculation.

In my sparse matrix kernel only
result = by, only the final elements are stored.

What maybe the reason of this, and what can I do?

Mustafa

The Kernel code:
matrixMulCOO( float * values, int * rowIndices, int * colIndices, int numberOfElements,
float* B,
float* C)
{
// Block index
//int idx = /blockIdx.x blockDim.x +*/ threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;

if(idx < numberOfElements )
for(int j = 0; j < WB; j++)
{
	//A.values[idx];
	C[ rowIndices[idx]*WC + j ]
	= values[idx] * B[WB*colIndices[idx] +  j];
	__syncthreads();
}

//if(idx < HC )
//C[ idx] = values[idx];

}

Results:

Device 0: “GeForce GTX 460” with Compute 2.1 capability
array2d <3, 4>
0 1 2 0
0 0 0 0
0 0 10 11
Row: 0 Col: 1 Value: 1.000000
Row: 0 Col: 2 Value: 2.000000
Row: 2 Col: 2 Value: 10.000000
Row: 2 Col: 3 Value: 11.000000

host B Mat 4 x 3
0.00 1.00 2.00
3.00 4.00 5.00
6.00 7.00 8.00
9.00 10.00 11.00

Device C Mat 3 x 3
12.00 14.00 16.00
0.00 0.00 0.00
99.00 110.00 121.00

SPWorley November 27, 2010, 11:45pm 3

__syncthreads() only synchronizes threads inside one block, not globally over your kernel.
Even in the pseudocode you posted, you’re using __syncthreads() inside a branch, which gives undefined behavior (usually a timeout and aborted kernel)

SPWorley November 27, 2010, 11:45pm 4

__syncthreads() only synchronizes threads inside one block, not globally over your kernel.
Even in the pseudocode you posted, you’re using __syncthreads() inside a branch, which gives undefined behavior (usually a timeout and aborted kernel)

Topic		Replies	Views	Activity
Problem in parellel sum CUDA Developer Tools	0	412	November 13, 2020
Need help about global memory access by threads CUDA Programming and Performance	4	1315	April 6, 2010
Need of _sync_threads in disjoint set of address access CUDA Programming and Performance cuda	2	105	September 3, 2024
__syncthreads() not syncing the threads, although not in if statement CUDA Programming and Performance	1	666	April 26, 2016
The result is unpredictable. CUDA Programming and Performance	6	1184	October 25, 2013
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	26091	July 5, 2007
using syncthreads still at n00b status CUDA Programming and Performance	4	16107	December 1, 2010
Variable Initialisation on Device Routine CUDA Programming and Performance	4	2596	May 24, 2008
thread writing into global memory (thread sync) CUDA Programming and Performance	2	1634	August 23, 2009
A __syncthreads question CUDA Programming and Performance	6	1292	January 10, 2011