Kernel reduction in 64bits

valentin.fonck · November 21, 2022, 8:36am

Hello everyone,

To begin with, I am very new to cuda and parallel programming in general so forgive me if my question is inaccurate.

Recently, I have been working on the modification of an existing software (mumax3) to allow computations in double precision. Mumax3 is written in go and I have modified all the go and cuda files accordingly. Although, I have just discovered that Mumax3 has a kernel reduction function, as inspired by this reference (https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf). I have modified the corresponding file reduce.h as follows.

#ifndef _REDUCE_H_
#define _REDUCE_H_

// Block size for reduce kernels.
#define REDUCE_BLOCKSIZE 1024

// This macro expands to a reduce kernel with arbitrary reduce operation.
// Ugly, perhaps, but arguably nicer than some 1000+ line C++ template.
// load(i): loads element i, possibly pre-processing the data
// op(a, b): reduce operation. e.g. sum
// atomicOp(a, b): atomic reduce operation in global mem.
#define reduce(load, op, atomicOp)                      \
    __shared__ double sdata[REDUCE_BLOCKSIZE];           \
    int tid = threadIdx.x;                              \
    int i =  blockIdx.x * blockDim.x + threadIdx.x;     \
                                                        \
    double mine = initVal;                               \
    int stride = gridDim.x * blockDim.x;                \
    while (i < n) {                                     \
        mine = op(mine, load(i));                       \
        i += stride;                                    \
    }                                                   \
    sdata[tid] = mine;                                  \
    __syncthreads();                                    \
                                                        \
    for (unsigned int s=blockDim.x/2; s>64; s>>=1) {    \
        if (tid < s){                                   \
            sdata[tid] = op(sdata[tid], sdata[tid + s]);\
        }                                               \
        __syncthreads();                                \
    }                                                   \
                                                        \
    if (tid < 64) {                                     \
        volatile double* smem = sdata;                   \
        smem[tid] = op(smem[tid], smem[tid + 64]);      \
        smem[tid] = op(smem[tid], smem[tid + 32]);      \
        smem[tid] = op(smem[tid], smem[tid + 16]);      \
        smem[tid] = op(smem[tid], smem[tid +  8]);      \
        smem[tid] = op(smem[tid], smem[tid +  4]);      \
        smem[tid] = op(smem[tid], smem[tid +  2]);      \
        smem[tid] = op(smem[tid], smem[tid +  1]);      \
    }                                                   \
                                                        \
    if (tid == 0) { atomicOp(dst, sdata[0]); }          \
// Based on "Optimizing parallel reduction in CUDA" by Mark Harris.
#endif

When I try to create and load a module using cuModuleLoadData, I get the following error.

panic: CUDA_ERROR_INVALID_IMAGE

So I understand that I am doing something wrong here. If you need more information on the context, please ask me.

Thank you for your help.

Curefab · November 23, 2022, 6:13pm

Could it be that you missed an opening brace for the function?

valentin.fonck · November 25, 2022, 4:54pm

Thank you for your reply. Unfortunately this is not the solution to my problem.

Robert_Crovella · November 26, 2022, 3:13am

I don’t think there is any problem with the code you have shown, that would give rise to this error. There is a problem in this code in the section that is supposed to be warp-synchronous. You are entering that section allowing two warps to act (if (tid < 64)) and that won’t work. The solution is straightforward. However this alone would not give rise to the error you report.

If I had to guess, you are trying to run this on a device that does not natively support atomicAdd for double, and your attempt gives rise to this error.

Really just a guess.

Here’s a sample test case that works correctly for me:

$ cat reduce.h
#ifndef _REDUCE_H_
#define _REDUCE_H_

// Block size for reduce kernels.
#define REDUCE_BLOCKSIZE 1024

// This macro expands to a reduce kernel with arbitrary reduce operation.
// Ugly, perhaps, but arguably nicer than some 1000+ line C++ template.
// load(i): loads element i, possibly pre-processing the data
// op(a, b): reduce operation. e.g. sum
// atomicOp(a, b): atomic reduce operation in global mem.
#define reduce(load, op, atomicOp)                      \
    __shared__ double sdata[REDUCE_BLOCKSIZE];           \
    int tid = threadIdx.x;                              \
    int i =  blockIdx.x * blockDim.x + threadIdx.x;     \
                                                        \
    double mine = initVal;                               \
    int stride = gridDim.x * blockDim.x;                \
    while (i < n) {                                     \
        mine = op(mine, load(i));                       \
        i += stride;                                    \
    }                                                   \
    sdata[tid] = mine;                                  \
    __syncthreads();                                    \
                                                        \
    for (unsigned int s=blockDim.x/2; s>32; s>>=1) {    \
        if (tid < s){                                   \
            sdata[tid] = op(sdata[tid], sdata[tid + s]);\
        }                                               \
        __syncthreads();                                \
    }                                                   \
                                                        \
    if (tid < 32) {                                     \
        volatile double* smem = sdata;                   \
        smem[tid] = op(smem[tid], smem[tid + 32]);      \
        smem[tid] = op(smem[tid], smem[tid + 16]);      \
        smem[tid] = op(smem[tid], smem[tid +  8]);      \
        smem[tid] = op(smem[tid], smem[tid +  4]);      \
        smem[tid] = op(smem[tid], smem[tid +  2]);      \
        smem[tid] = op(smem[tid], smem[tid +  1]);      \
    }                                                   \
                                                        \
    if (tid == 0) { atomicOp(dst, sdata[0]); }          \
// Based on "Optimizing parallel reduction in CUDA" by Mark Harris.
#endif
$ cat test.cu
#include "reduce.h"
#include <iostream>
#define load(x) dA[x]
#define op(x,y) (x+y)
#define atomicOp(x,y) atomicAdd(x, y)

__global__ void rk(double initVal, double *dA, double *dst, size_t n){

  reduce(load, op, atomicOp);
}

int main(){

  size_t n = 1048576;
  double *dR, *dA, *hA = new double[n];
  for (int i = 0; i < n; i++) hA[i] = 1.0;
  cudaMalloc(&dA, n*sizeof(dA[0]));
  cudaMalloc(&dR, sizeof(dR[0]));
  cudaMemset(dR, 0, sizeof(dR[0]));
  cudaMemcpy(dA, hA, n*sizeof(dA[0]), cudaMemcpyHostToDevice);
  rk<<<160, REDUCE_BLOCKSIZE>>>(0.0, dA, dR, n);
  double R;
  cudaMemcpy(&R, dR, sizeof(dR[0]), cudaMemcpyDeviceToHost);
  std::cout << "result = " << R << std::endl;
}
$ nvcc -o test test.cu -arch=sm_70
$ compute-sanitizer ./test
========= COMPUTE-SANITIZER
result = 1.04858e+06
========= ERROR SUMMARY: 0 errors
$

Note that I am compiling for an architecture and running on a device that supports double atomicAdd.

Note that even my “fixing” of the warp-synchronous section is no longer a proper CUDA programming technique. This blog outlines a correct methodology (e.g. listing 8).

Topic		Replies	Views
reduction6 kernel from CUDA SDK not working correctly CUDA Programming and Performance	2	1890	August 10, 2010
Reduction random errors Reduction kernel turns weird values CUDA Programming and Performance	2	834	February 7, 2011
replacing float with double: simple reduction CUDA Programming and Performance	5	786	March 28, 2018
a short reduction example cannot get what is expected CUDA Programming and Performance	4	2275	March 28, 2012
cuda reduction kernel from example doesn't run CUDA Programming and Performance	5	1661	February 13, 2013
sum 10 of 1s, got 9.9999; sum 100000 of 1s, got 1.8535X10^193; CUDA Programming and Performance	4	2804	April 4, 2012
Vector Reduction cudaMemcpy() (error code invalid argument)! CUDA Programming and Performance	1	1409	December 4, 2013
reduction centric design forces. should Iconsider atomic increment rather than classic reduction CUDA Programming and Performance	0	533	April 4, 2012
CUDPP reduction function used for array summation CUDA on Windows Subsystem for Linux cuda , kernel , ubuntu	3	691	October 6, 2021
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10618	December 8, 2009

Kernel reduction in 64bits

Related topics