Kernel reduction in 64bits

Hello everyone,

To begin with, I am very new to cuda and parallel programming in general so forgive me if my question is inaccurate.

Recently, I have been working on the modification of an existing software (mumax3) to allow computations in double precision. Mumax3 is written in go and I have modified all the go and cuda files accordingly. Although, I have just discovered that Mumax3 has a kernel reduction function, as inspired by this reference (https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf). I have modified the corresponding file reduce.h as follows.

#ifndef _REDUCE_H_
#define _REDUCE_H_

// Block size for reduce kernels.
#define REDUCE_BLOCKSIZE 1024

// This macro expands to a reduce kernel with arbitrary reduce operation.
// Ugly, perhaps, but arguably nicer than some 1000+ line C++ template.
// load(i): loads element i, possibly pre-processing the data
// op(a, b): reduce operation. e.g. sum
// atomicOp(a, b): atomic reduce operation in global mem.
#define reduce(load, op, atomicOp)                      \
    __shared__ double sdata[REDUCE_BLOCKSIZE];           \
    int tid = threadIdx.x;                              \
    int i =  blockIdx.x * blockDim.x + threadIdx.x;     \
                                                        \
    double mine = initVal;                               \
    int stride = gridDim.x * blockDim.x;                \
    while (i < n) {                                     \
        mine = op(mine, load(i));                       \
        i += stride;                                    \
    }                                                   \
    sdata[tid] = mine;                                  \
    __syncthreads();                                    \
                                                        \
    for (unsigned int s=blockDim.x/2; s>64; s>>=1) {    \
        if (tid < s){                                   \
            sdata[tid] = op(sdata[tid], sdata[tid + s]);\
        }                                               \
        __syncthreads();                                \
    }                                                   \
                                                        \
    if (tid < 64) {                                     \
        volatile double* smem = sdata;                   \
        smem[tid] = op(smem[tid], smem[tid + 64]);      \
        smem[tid] = op(smem[tid], smem[tid + 32]);      \
        smem[tid] = op(smem[tid], smem[tid + 16]);      \
        smem[tid] = op(smem[tid], smem[tid +  8]);      \
        smem[tid] = op(smem[tid], smem[tid +  4]);      \
        smem[tid] = op(smem[tid], smem[tid +  2]);      \
        smem[tid] = op(smem[tid], smem[tid +  1]);      \
    }                                                   \
                                                        \
    if (tid == 0) { atomicOp(dst, sdata[0]); }          \
// Based on "Optimizing parallel reduction in CUDA" by Mark Harris.
#endif

When I try to create and load a module using cuModuleLoadData, I get the following error.

panic: CUDA_ERROR_INVALID_IMAGE

So I understand that I am doing something wrong here. If you need more information on the context, please ask me.

Thank you for your help.

Could it be that you missed an opening brace for the function?

Thank you for your reply. Unfortunately this is not the solution to my problem.

I don’t think there is any problem with the code you have shown, that would give rise to this error. There is a problem in this code in the section that is supposed to be warp-synchronous. You are entering that section allowing two warps to act (if (tid < 64)) and that won’t work. The solution is straightforward. However this alone would not give rise to the error you report.

If I had to guess, you are trying to run this on a device that does not natively support atomicAdd for double, and your attempt gives rise to this error.

Really just a guess.

Here’s a sample test case that works correctly for me:

$ cat reduce.h
#ifndef _REDUCE_H_
#define _REDUCE_H_

// Block size for reduce kernels.
#define REDUCE_BLOCKSIZE 1024

// This macro expands to a reduce kernel with arbitrary reduce operation.
// Ugly, perhaps, but arguably nicer than some 1000+ line C++ template.
// load(i): loads element i, possibly pre-processing the data
// op(a, b): reduce operation. e.g. sum
// atomicOp(a, b): atomic reduce operation in global mem.
#define reduce(load, op, atomicOp)                      \
    __shared__ double sdata[REDUCE_BLOCKSIZE];           \
    int tid = threadIdx.x;                              \
    int i =  blockIdx.x * blockDim.x + threadIdx.x;     \
                                                        \
    double mine = initVal;                               \
    int stride = gridDim.x * blockDim.x;                \
    while (i < n) {                                     \
        mine = op(mine, load(i));                       \
        i += stride;                                    \
    }                                                   \
    sdata[tid] = mine;                                  \
    __syncthreads();                                    \
                                                        \
    for (unsigned int s=blockDim.x/2; s>32; s>>=1) {    \
        if (tid < s){                                   \
            sdata[tid] = op(sdata[tid], sdata[tid + s]);\
        }                                               \
        __syncthreads();                                \
    }                                                   \
                                                        \
    if (tid < 32) {                                     \
        volatile double* smem = sdata;                   \
        smem[tid] = op(smem[tid], smem[tid + 32]);      \
        smem[tid] = op(smem[tid], smem[tid + 16]);      \
        smem[tid] = op(smem[tid], smem[tid +  8]);      \
        smem[tid] = op(smem[tid], smem[tid +  4]);      \
        smem[tid] = op(smem[tid], smem[tid +  2]);      \
        smem[tid] = op(smem[tid], smem[tid +  1]);      \
    }                                                   \
                                                        \
    if (tid == 0) { atomicOp(dst, sdata[0]); }          \
// Based on "Optimizing parallel reduction in CUDA" by Mark Harris.
#endif
$ cat test.cu
#include "reduce.h"
#include <iostream>
#define load(x) dA[x]
#define op(x,y) (x+y)
#define atomicOp(x,y) atomicAdd(x, y)

__global__ void rk(double initVal, double *dA, double *dst, size_t n){

  reduce(load, op, atomicOp);
}

int main(){

  size_t n = 1048576;
  double *dR, *dA, *hA = new double[n];
  for (int i = 0; i < n; i++) hA[i] = 1.0;
  cudaMalloc(&dA, n*sizeof(dA[0]));
  cudaMalloc(&dR, sizeof(dR[0]));
  cudaMemset(dR, 0, sizeof(dR[0]));
  cudaMemcpy(dA, hA, n*sizeof(dA[0]), cudaMemcpyHostToDevice);
  rk<<<160, REDUCE_BLOCKSIZE>>>(0.0, dA, dR, n);
  double R;
  cudaMemcpy(&R, dR, sizeof(dR[0]), cudaMemcpyDeviceToHost);
  std::cout << "result = " << R << std::endl;
}
$ nvcc -o test test.cu -arch=sm_70
$ compute-sanitizer ./test
========= COMPUTE-SANITIZER
result = 1.04858e+06
========= ERROR SUMMARY: 0 errors
$

Note that I am compiling for an architecture and running on a device that supports double atomicAdd.

Note that even my “fixing” of the warp-synchronous section is no longer a proper CUDA programming technique. This blog outlines a correct methodology (e.g. listing 8).