GTX 1080 60x slower than i7 6700k

hi guys,
I am currently working on simple kernel which multiplies huge matrix by a vector. It uses matrix in form of 3 corresponding vectors (row(vox) index, col(beam) idex and cell(depos) value).
The code is shown below:

__global__ void AtomicKernel(float *resultVector, const int *vox, const int *beam, const float *depos, const float *settings, const int *chunksize, int *lastVoxel)
	int i = threadIdx.x;
	int bx = blockIdx.x;
	int startindex = (bx * 1024 + i);
	float old;

	long kUpperLimit = (startindex + 1)*(*chunksize);
	if (kUpperLimit > *lastVoxel)
		kUpperLimit = *lastVoxel;
	//#pragma unroll
        if (startindex * (*chunksize) < lastVoxel) 
	    for (size_t k = startindex * (*chunksize); k < kUpperLimit; k++)
	     	atomicAdd(resultVector + *(vox + k), *(depos + k) * (*(settings + *(beam + k))));

Does anyone have any idea why is this so slow? (6ms for 20k blocks, 1024 threads each compared to 0,1 ms on CPU on a single thread)
I have everything in global memory, but can this be the reson for such big difference?
I have also heard of function fmaf from CUDA math, but trying to implement it broke the code.
Please Halp :-)

I would try using a sparse matrix vector multiply operation from cusparse

or the one from cub

If your matrix is in COO format, you may need to convert to CSR format. Cusparse has functions to help with that.