I am attempting to create a project that solves deconvolution problems using CUDA. As part of the solution to these problems, I need to convolve multiple real functions together. Currently, I am having problems with the convergence of my project when given artificial data with an exact solution. I believe the source of this error is coming from the CUDA convolution function I have implemented, however I cannot seem to work out how it is incorrect. The kernel function I have written is based on the serial loop “full” convolution shown here in C++ taken from StackOverflow (c++ - How to perform 1-dimensional "valid" convolution? - Stack Overflow):

```
template<typename T>
std::vector<T>
conv(std::vector<T> const &f, std::vector<T> const &g) {
int const nf = f.size();
int const ng = g.size();
int const n = nf + ng - 1;
std::vector<T> out(n, T());
for(auto i(0); i < n; ++i) {
int const jmn = (i >= ng - 1)? i - (ng - 1) : 0;
int const jmx = (i < nf - 1)? i : nf - 1;
for(auto j(jmn); j <= jmx; ++j) {
out[i] += (f[j] * g[i - j]);
}
}
return out;
}
```

My kernel is written thus:

```
__global__ void convolve(double* vec1, double* vec2, double* res, int vec1_len, int conv_len)
{
const int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx >= conv_len)
{
return;
}
double temp = 0;
const int jmn = (idx >= X_h - 1) ? (idx - (X_h - 1)) : 0;
const int jmx = (idx < (vec1_len - 1)) ? (idx) : (vec1_len - 1);
for (int j = jmn; j <= jmx; j++)
{
temp += (vec1[j] * vec2[idx - j]);
}
res[idx] = temp;
}
```

In this example `X_h`

is the length of the kernel (`vec2`

) and `conv_len`

is the length of the output array (`X_h + vec1_len - 1`

). However, when I compare the output data from the output of the MATLAB or NumPy convolution functions, my output does not match. Can anyone help with why this is?

I have considered that I could take the FFT of both `vec1`

and `vec2`

, multiply them together then take the inverse FFT of the output; however I am not familiar with the implementation of FFTs in CUDA and wonder whether it would be faster for a simple 1D convolution.

Thanks in advance.