Cuda : Reduce (max/min) function on matrix implementation

ahmed.gueddiche2 · August 22, 2019, 10:38am

So I am learning cuda c++ and now i’m implementing a reduce function on a matrix : find the max value in a matrix. I have implemented a function to get the max on an array and transforming it to a matrix version should be straightforward but I can’t get it to work (I don’t get the correct result). I am wondering if this is the right approach. You can find the code for the two versions below :

For an array :

__global__
void reduce_kernal_shared_mem(float *d_in, float *d_out){
    int indx = blockDim.x * blockIdx.x + threadIdx.x;
    int tindx = threadIdx.x;

    extern __shared__ float sh_in[];

    sh_in[tindx] = -99999.0f;

    sh_in[tindx] = d_in[indx];
    __syncthreads();

    for(int i = blockDim.x / 2; i > 0; i >>= 1){
        if(tindx < i){
            sh_in[tindx] = fmax(sh_in[tindx], sh_in[tindx + i]);
        }
        __syncthreads();
    }

    if(tindx == 0){
        d_out[blockIdx.x] = sh_in[0];
    }
}

void reduce(float *d_in, float *d_int, float *d_out, const int ARRAY_SIZE){
     reduce_kernal_shared_mem<<<1024, 1024, 1024 * sizeof(float)>>>(d_in, d_int);
     reduce_kernal_shared_mem<<<1, 1024, 1024 * sizeof(float)>>>(d_int, d_out);
}

For a matrix :

__global__
void get_max(const float* d_logLuminance, float *d_out, int numRows, int numCols){
   int col = blockIdx.x * blockDim.x + threadIdx.x;
   int row = blockIdx.y * blockDim.y + threadIdx.y;
   int c_t = threadIdx.x;
   int r_t = threadIdx.y;
   int pos_1D = row * numCols + col;
   int pos_1D_t = r_t * blockDim.x + c_t;

   extern __shared__ float sh_mem[];

   sh_mem[pos_1D_t] = -999999.0f;

   if(pos_1D > numCols * numRows)
      return;

   sh_mem[pos_1D_t] = d_logLuminance[pos_1D];
   __syncthreads();

   for(int s = (blockDim.x * blockDim.y) / 2; s > 0; s >>= 1){
      if(pos_1D_t < s)
         sh_mem[pos_1D_t] = fmax(sh_mem[pos_1D_t], sh_mem[pos_1D_t + s]);
      __syncthreads();
   }

   if(r_t == 0 && c_t == 0)
      d_out[blockIdx.y * gridDim.x + blockIdx.x] = sh_mem[0];
}

void max(const float *d_logLuminance, int numRows, int numCols, float &max_logLum){

   int THREADS_PER_BLOCK = 32;
   dim3 blockSize(THREADS_PER_BLOCK, THREADS_PER_BLOCK);
   dim3 gridSize((THREADS_PER_BLOCK + numCols - 1) / THREADS_PER_BLOCK, 
                  (THREADS_PER_BLOCK + numRows - 1) / THREADS_PER_BLOCK);

   float *d_out, *d_int;
   cudaMalloc(&d_out, sizeof(float) * numRows * numCols);
   cudaMalloc(&d_int, sizeof(float) * numRows * numCols);


   get_max<<<gridSize, blockSize, THREADS_PER_BLOCK * THREADS_PER_BLOCK * sizeof(float)>>>(d_logLuminance, d_int, numRows, numCols);
   get_max<<<1, blockSize, THREADS_PER_BLOCK * THREADS_PER_BLOCK * sizeof(float)>>>(d_int, d_out, numRows, numCols);

   cudaDeviceSynchronize();

   cudaMemcpy(&max_logLum, d_out, sizeof(float), cudaMemcpyDeviceToHost);

   printf("max : %f\n", max_logLum);

   cudaFree(d_out);
   cudaFree(d_int);

}

I could just launch a kernel with 1-dimensional blocksize and gridsize with number of threads = numRows * numCols and work on the matrix as an array (without the indexing tricks) and it works but I would like to know why my approach won’t.

Thank you.

Robert_Crovella · August 22, 2019, 4:42pm

I answered in your cross-posting:

[url]Cuda : Reduce (max/min) function on matrix implementation - Stack Overflow

If you want people to respond to posts like this, in my experience you’re more likely to get attention if you provide a complete code. That is something that I or someone could copy, paste, compile, and run, without having to add anything or change anything. Don’t make someone else write a main routine, or guess at the sizes of the data set you are actually using.

Topic		Replies	Views
Finding max in array CUDA Programming and Performance	15	41505	November 26, 2017
Cumpute Max of Vector or Matrix CUDA Programming and Performance	7	3765	June 6, 2011
about finding a max number from a big array CUDA Programming and Performance	10	4647	January 7, 2020
One question regarding shared memory CUDA Programming and Performance	5	1234	April 24, 2013
Find maximum value from threads CUDA Programming and Performance	6	352	December 16, 2023
Why when I used this two global function, I got different results? CUDA Programming and Performance cuda	8	342	October 4, 2023
Shared Memory and Matrix CUDA Programming and Performance	1	746	February 22, 2016
How to do Reduction in column for a matrix CUDA Programming and Performance	2	1497	May 3, 2019
copy from 1D array to shared memory matrix in cuda CUDA Programming and Performance	7	2052	June 9, 2015
CUDA reduction CUDA Programming and Performance	10	51390	June 7, 2009

Cuda : Reduce (max/min) function on matrix implementation

Related topics