Hello,
I have the following problem:
While doing inner product, each multiplication of inner product is handled by a thread.
result = ax + by
ax and by are calculated by different threads, and they are added up within indiviual threads after calculation.
In my sparse matrix kernel only
result = by, only the final elements are stored.
What maybe the reason of this, and what can I do?
Mustafa
The Kernel code:
matrixMulCOO( float * values, int * rowIndices, int * colIndices, int numberOfElements,
float* B,
float* C)
{
// Block index
//int idx = /blockIdx.x blockDim.x +*/ threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < numberOfElements )
for(int j = 0; j < WB; j++)
{
//A.values[idx];
C[ rowIndices[idx]*WC + j ]
= values[idx] * B[WB*colIndices[idx] + j];
__syncthreads();
}
//if(idx < HC )
//C[ idx] = values[idx];
}
Results:
Device 0: “GeForce GTX 460” with Compute 2.1 capability
array2d <3, 4>
0 1 2 0
0 0 0 0
0 0 10 11
Row: 0 Col: 1 Value: 1.000000
Row: 0 Col: 2 Value: 2.000000
Row: 2 Col: 2 Value: 10.000000
Row: 2 Col: 3 Value: 11.000000
host B Mat 4 x 3
0.00 1.00 2.00
3.00 4.00 5.00
6.00 7.00 8.00
9.00 10.00 11.00
Device C Mat 3 x 3
12.00 14.00 16.00
0.00 0.00 0.00
99.00 110.00 121.00
Hello,
I have the following problem:
While doing inner product, each multiplication of inner product is handled by a thread.
result = ax + by
ax and by are calculated by different threads, and they are added up within indiviual threads after calculation.
In my sparse matrix kernel only
result = by, only the final elements are stored.
What maybe the reason of this, and what can I do?
Mustafa
The Kernel code:
matrixMulCOO( float * values, int * rowIndices, int * colIndices, int numberOfElements,
float* B,
float* C)
{
// Block index
//int idx = /blockIdx.x blockDim.x +*/ threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < numberOfElements )
for(int j = 0; j < WB; j++)
{
//A.values[idx];
C[ rowIndices[idx]*WC + j ]
= values[idx] * B[WB*colIndices[idx] + j];
__syncthreads();
}
//if(idx < HC )
//C[ idx] = values[idx];
}
Results:
Device 0: “GeForce GTX 460” with Compute 2.1 capability
array2d <3, 4>
0 1 2 0
0 0 0 0
0 0 10 11
Row: 0 Col: 1 Value: 1.000000
Row: 0 Col: 2 Value: 2.000000
Row: 2 Col: 2 Value: 10.000000
Row: 2 Col: 3 Value: 11.000000
host B Mat 4 x 3
0.00 1.00 2.00
3.00 4.00 5.00
6.00 7.00 8.00
9.00 10.00 11.00
Device C Mat 3 x 3
12.00 14.00 16.00
0.00 0.00 0.00
99.00 110.00 121.00
__syncthreads() only synchronizes threads inside one block, not globally over your kernel.
Even in the pseudocode you posted, you’re using __syncthreads() inside a branch, which gives undefined behavior (usually a timeout and aborted kernel)
__syncthreads() only synchronizes threads inside one block, not globally over your kernel.
Even in the pseudocode you posted, you’re using __syncthreads() inside a branch, which gives undefined behavior (usually a timeout and aborted kernel)