So, I am having some trouble with a big kernel called from Matlab (unspecified launch error), so I have made a smaller & simpler version and put that in the SDK directory to test if I am doing things ok and slowly expand on things until I have what I need.

The simple code is :

```
for (i = 1; i < array_size; i++) {
for (k = 1;k < vec_size; k++)
out_array[index] += in_array[index] * in_vec[k];
}
```

(In my kernel I take index = blockIdx.x;, so I let each block calculate one value in the output array (and it also only needs 1 value from the input array)

Now, when I run my code with blockDim.x = 256 everything is running ok, and the results are the same as a C-version. But when blockDim.x = 1536 the results are not the same.

I have attatched my code, as I am completely baffled since the maximum gridsize is 65535, so I am still far from that… Does anyone have clue what I am doing wrong?

Thanks,

Denis

test.tar.gz (6.99 KB)