Hello everyone,

To begin with, I am very new to cuda and parallel programming in general so forgive me if my question is inaccurate.

Recently, I have been working on the modification of an existing software (mumax3) to allow computations in double precision. Mumax3 is written in go and I have modified all the go and cuda files accordingly. Although, I have just discovered that Mumax3 has a kernel reduction function, as inspired by this reference (https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf). I have modified the corresponding file reduce.h as follows.

```
#ifndef _REDUCE_H_
#define _REDUCE_H_
// Block size for reduce kernels.
#define REDUCE_BLOCKSIZE 1024
// This macro expands to a reduce kernel with arbitrary reduce operation.
// Ugly, perhaps, but arguably nicer than some 1000+ line C++ template.
// load(i): loads element i, possibly pre-processing the data
// op(a, b): reduce operation. e.g. sum
// atomicOp(a, b): atomic reduce operation in global mem.
#define reduce(load, op, atomicOp) \
__shared__ double sdata[REDUCE_BLOCKSIZE]; \
int tid = threadIdx.x; \
int i = blockIdx.x * blockDim.x + threadIdx.x; \
\
double mine = initVal; \
int stride = gridDim.x * blockDim.x; \
while (i < n) { \
mine = op(mine, load(i)); \
i += stride; \
} \
sdata[tid] = mine; \
__syncthreads(); \
\
for (unsigned int s=blockDim.x/2; s>64; s>>=1) { \
if (tid < s){ \
sdata[tid] = op(sdata[tid], sdata[tid + s]);\
} \
__syncthreads(); \
} \
\
if (tid < 64) { \
volatile double* smem = sdata; \
smem[tid] = op(smem[tid], smem[tid + 64]); \
smem[tid] = op(smem[tid], smem[tid + 32]); \
smem[tid] = op(smem[tid], smem[tid + 16]); \
smem[tid] = op(smem[tid], smem[tid + 8]); \
smem[tid] = op(smem[tid], smem[tid + 4]); \
smem[tid] = op(smem[tid], smem[tid + 2]); \
smem[tid] = op(smem[tid], smem[tid + 1]); \
} \
\
if (tid == 0) { atomicOp(dst, sdata[0]); } \
// Based on "Optimizing parallel reduction in CUDA" by Mark Harris.
#endif
```

When I try to create and load a module using cuModuleLoadData, I get the following error.

```
panic: CUDA_ERROR_INVALID_IMAGE
```

So I understand that I am doing something wrong here. If you need more information on the context, please ask me.

Thank you for your help.