Hi all,

A collegue and myself have implemented a naive cuda kernel that is designed to calculate the covarience of a sparse matrix on the GPU.

We aim to do this mostly as an effort to do some statistical inference, the company I work for operates in the web domain, and part of its operation is datamining large quantities of keywords, tokens and n-grams for relevance against each other. In this system we aim to find relations between tokens.

During tuning we were suprised to find that the kernel is very slow, but seemingly only at writing the result back to global memory

The following (also available here http://github.com/GregBowyer/cuda-test/blo…_kernel.cu#L37) is the implementation of the kernel, this is wrapped in a program

```
__device__ float get_intersections(int* intr, int t1, int t2, int wI) {
int n = 0;
for (int i = 0; i < wI; i++) {
int x1 = (t1 * wI) + i;
if (intr[x1] == -1)
break;
for (int j = 0; j < wI; j++) {
int x2 = (t2 * wI) + j;
if (intr[x2] == -1)
break;
if (intr[x1] == intr[x2])
n++;
}
}
return (float) n;
}
__global__ void calc(float* result, int* tokens, int* intr, int wT, int wK, int wI) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
float t1 = (float) tokens[i];
float t2 = (float) tokens[j];
float v = 0;
if (i >= j) {
float t00 = -t1 / wK;
float t01 = 1 - t1 / wK;
if (i == j) {
// calculate diagonal
v = ((t01 * t01 * t1) + (t00 * t00 * (wK - t1))) / wK;
} else {
float nn = get_intersections(intr, i, j, wI);
float t10 = -t2 / wK;
float t11 = 1 - t2 / wK;
v = ((nn * t01 * t11) + ((t1 - nn) * t01 * t10) + ((t2 - nn) * t00 * t11) + ((wK - (t2 + t1 - nn)) * t00 * t10)) / wK;
}
result[i + (j * wT)] = v;
result[j + (i * wT)] = v;
}
}
```

If we run the kernel as is, we find that it is slower than a CPU implementation (taking with our example datasets about 5 seconds)

If however the following lines are commented out, then the kernel is extremely fast, offering a very good speedup over the CPU

```
result[i + (j * wT)] = v;
result[j + (i * wT)] = v;
```

For background reading I am guessing that the writes to global memory are not aligned or coalesced, but I cannot see how we would be able to coalesce these, this is more my lack of understanding and immaturity about CUDA more than anything else.

Does anyone have any ideas on what we are doing wrong, and how we can improve this kernel ?.

I can naturally provide more details as needed

Many thanks in advance