hi guys,

I am currently working on simple kernel which multiplies huge matrix by a vector. It uses matrix in form of 3 corresponding vectors (row(vox) index, col(beam) idex and cell(depos) value).

The code is shown below:

```
__global__ void AtomicKernel(float *resultVector, const int *vox, const int *beam, const float *depos, const float *settings, const int *chunksize, int *lastVoxel)
{
int i = threadIdx.x;
int bx = blockIdx.x;
int startindex = (bx * 1024 + i);
float old;
long kUpperLimit = (startindex + 1)*(*chunksize);
if (kUpperLimit > *lastVoxel)
{
kUpperLimit = *lastVoxel;
}
//#pragma unroll
__syncthreads();
if (startindex * (*chunksize) < lastVoxel)
{
for (size_t k = startindex * (*chunksize); k < kUpperLimit; k++)
{
atomicAdd(resultVector + *(vox + k), *(depos + k) * (*(settings + *(beam + k))));
}
__syncthreads();
}
```

Does anyone have any idea why is this so slow? (6ms for 20k blocks, 1024 threads each compared to 0,1 ms on CPU on a single thread)

I have everything in global memory, but can this be the reson for such big difference?

I have also heard of function fmaf from CUDA math, but trying to implement it broke the code.

Please Halp :-)