GTX 1080 60x slower than i7 6700k

Luxex · December 7, 2018, 7:17pm

hi guys,
I am currently working on simple kernel which multiplies huge matrix by a vector. It uses matrix in form of 3 corresponding vectors (row(vox) index, col(beam) idex and cell(depos) value).
The code is shown below:

__global__ void AtomicKernel(float *resultVector, const int *vox, const int *beam, const float *depos, const float *settings, const int *chunksize, int *lastVoxel)
{
	int i = threadIdx.x;
	int bx = blockIdx.x;
	int startindex = (bx * 1024 + i);
	float old;

	long kUpperLimit = (startindex + 1)*(*chunksize);
	if (kUpperLimit > *lastVoxel)
	{
		kUpperLimit = *lastVoxel;
	}
	//#pragma unroll
	__syncthreads();
        if (startindex * (*chunksize) < lastVoxel) 
        {
	    for (size_t k = startindex * (*chunksize); k < kUpperLimit; k++)
        	{
	     	atomicAdd(resultVector + *(vox + k), *(depos + k) * (*(settings + *(beam + k))));
	        }
	__syncthreads();
}

Does anyone have any idea why is this so slow? (6ms for 20k blocks, 1024 threads each compared to 0,1 ms on CPU on a single thread)
I have everything in global memory, but can this be the reson for such big difference?
I have also heard of function fmaf from CUDA math, but trying to implement it broke the code.
Please Halp :-)

Robert_Crovella · December 7, 2018, 7:29pm

I would try using a sparse matrix vector multiply operation from cusparse

https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-lt-t-gt-csrmv

or the one from cub

If your matrix is in COO format, you may need to convert to CSR format. Cusparse has functions to help with that.

Topic		Replies	Views
Vector-Matrix Multiplication is this a fast kernel? CUDA Programming and Performance	6	1793	April 15, 2010
optimization of a sparse matrix vector multiplication kernel CUDA Programming and Performance	0	412	September 19, 2017
performace question:sparse matrix multiplication CUDA Programming and Performance	3	6935	April 26, 2007
SpMV implementation problem CUDA Programming and Performance	3	4161	April 17, 2009
Vector-Matrix Multiplication Is this a fast kernel? CUDA Programming and Performance	5	6674	April 19, 2010
matrix-vector mult in CSR format CUDA Programming and Performance	1	1139	November 12, 2010
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	313672	November 22, 2010
Matrix by vector multiplication CUDA Programming and Performance	4	924	June 16, 2013
Low speed MatrixVectorMult & Vector Sum on CUDA (x10 slower then CPU) GPU-Accelerated Libraries	0	664	May 2, 2018
Matrix multiplication + reduction with atomicAdd is slower than CPU code, how can I make it faster? CUDA Programming and Performance	17	2897	January 14, 2018

GTX 1080 60x slower than i7 6700k

Related topics