GTX 1080 60x slower than i7 6700k

hi guys,
I am currently working on simple kernel which multiplies huge matrix by a vector. It uses matrix in form of 3 corresponding vectors (row(vox) index, col(beam) idex and cell(depos) value).
The code is shown below:

__global__ void AtomicKernel(float *resultVector, const int *vox, const int *beam, const float *depos, const float *settings, const int *chunksize, int *lastVoxel)
{
	int i = threadIdx.x;
	int bx = blockIdx.x;
	int startindex = (bx * 1024 + i);
	float old;

	long kUpperLimit = (startindex + 1)*(*chunksize);
	if (kUpperLimit > *lastVoxel)
	{
		kUpperLimit = *lastVoxel;
	}
	//#pragma unroll
	__syncthreads();
        if (startindex * (*chunksize) < lastVoxel) 
        {
	    for (size_t k = startindex * (*chunksize); k < kUpperLimit; k++)
        	{
	     	atomicAdd(resultVector + *(vox + k), *(depos + k) * (*(settings + *(beam + k))));
	        }
	__syncthreads();
}

Does anyone have any idea why is this so slow? (6ms for 20k blocks, 1024 threads each compared to 0,1 ms on CPU on a single thread)
I have everything in global memory, but can this be the reson for such big difference?
I have also heard of function fmaf from CUDA math, but trying to implement it broke the code.
Please Halp :-)

I would try using a sparse matrix vector multiply operation from cusparse

https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-lt-t-gt-csrmv

or the one from cub

If your matrix is in COO format, you may need to convert to CSR format. Cusparse has functions to help with that.