1D Convolution in CUDA not the same as output in NumPy or MATLAB

I am attempting to create a project that solves deconvolution problems using CUDA. As part of the solution to these problems, I need to convolve multiple real functions together. Currently, I am having problems with the convergence of my project when given artificial data with an exact solution. I believe the source of this error is coming from the CUDA convolution function I have implemented, however I cannot seem to work out how it is incorrect. The kernel function I have written is based on the serial loop “full” convolution shown here in C++ taken from StackOverflow (c++ - How to perform 1-dimensional "valid" convolution? - Stack Overflow):

template<typename T>
conv(std::vector<T> const &f, std::vector<T> const &g) {
  int const nf = f.size();
  int const ng = g.size();
  int const n  = nf + ng - 1;
  std::vector<T> out(n, T());
  for(auto i(0); i < n; ++i) {
    int const jmn = (i >= ng - 1)? i - (ng - 1) : 0;
    int const jmx = (i <  nf - 1)? i            : nf - 1;
    for(auto j(jmn); j <= jmx; ++j) {
      out[i] += (f[j] * g[i - j]);
  return out; 

My kernel is written thus:

__global__ void convolve(double* vec1, double* vec2, double* res, int vec1_len, int conv_len)
	const int idx = blockDim.x * blockIdx.x + threadIdx.x;
	if (idx >= conv_len)

	double temp = 0;

	const int jmn = (idx >= X_h - 1) ? (idx - (X_h - 1)) : 0;
	const int jmx = (idx < (vec1_len - 1)) ? (idx) : (vec1_len - 1);

	for (int j = jmn; j <= jmx; j++)
		temp += (vec1[j] * vec2[idx - j]);

	res[idx] = temp;

In this example X_h is the length of the kernel (vec2) and conv_len is the length of the output array (X_h + vec1_len - 1). However, when I compare the output data from the output of the MATLAB or NumPy convolution functions, my output does not match. Can anyone help with why this is?

I have considered that I could take the FFT of both vec1 and vec2, multiply them together then take the inverse FFT of the output; however I am not familiar with the implementation of FFTs in CUDA and wonder whether it would be faster for a simple 1D convolution.

Thanks in advance.