I am looking for fresh ideas how I can get this code quicker. It now takes 5 - 8 seconds, which is 100 times faster than doing it in matlab on a quadcore, so I should be already happy, but I don’t like kernel calls of more than 5 seconds. I would like people to be able to run this on their normal desktop when it has a GTX280 in it. It now runs on a C1060 sample.

normal C:

```
for (int index = 0; index < 20402; index++) {
Â for(int ix = 0; ix < 1608; ix++) {
Â Â for(int iy=0; iy < 1608; iy++) {
Â Â Â out[ix,iy] += input1[index, ix] * input2[index, iy] * function(A[index], B[index]);
Â Â }
Â }
}
```

I currently have the following CUDA code which takes 5 - 8 seconds

```
NUM_THREADS = 512;
calc_relatively_slow<<<dim3(1608, 1608,1), NUM_THREADS>>>(20402, input1, input2, A,B,out);
////////////////////////////
function calc_relatively_slow(num_index, input1, input2, A,B, out) {
__shared__ complex s_out[NUM_THREADS];
s_out[threadIdx.x] = 0.0f;
unsigned index = threadIdx.x;
while (index < num_index) {
Â s_out[threadIdx.x] += input1[index + blockIdx.x*num_index] * input2[index + blockIdx.y*num_index] * function(A[index], B[index]);
Â index+=NUM_THREADS;
}
__syncthreads();
//here comes a standard reduction
out[blockIdx.x + blockIdx.y * gridDim.x] = s_out[0];
}
```

So each block calculates one output-value.

I thought of letting each block calculate let’s say a 16x16 block of the output-array, to have 2 half warp read in input1 and input2, but I am afraid that will lead to impossible shared memory usage (16*16*256*8>16k), and still I have the idea I can make better use of shared memory.

Anybody have any ideas?