Cuda : Reduce (max/min) function on matrix implementation

So I am learning cuda c++ and now i’m implementing a reduce function on a matrix : find the max value in a matrix. I have implemented a function to get the max on an array and transforming it to a matrix version should be straightforward but I can’t get it to work (I don’t get the correct result). I am wondering if this is the right approach. You can find the code for the two versions below :

For an array :

void reduce_kernal_shared_mem(float *d_in, float *d_out){
    int indx = blockDim.x * blockIdx.x + threadIdx.x;
    int tindx = threadIdx.x;

    extern __shared__ float sh_in[];

    sh_in[tindx] = -99999.0f;

    sh_in[tindx] = d_in[indx];

    for(int i = blockDim.x / 2; i > 0; i >>= 1){
        if(tindx < i){
            sh_in[tindx] = fmax(sh_in[tindx], sh_in[tindx + i]);

    if(tindx == 0){
        d_out[blockIdx.x] = sh_in[0];

void reduce(float *d_in, float *d_int, float *d_out, const int ARRAY_SIZE){
     reduce_kernal_shared_mem<<<1024, 1024, 1024 * sizeof(float)>>>(d_in, d_int);
     reduce_kernal_shared_mem<<<1, 1024, 1024 * sizeof(float)>>>(d_int, d_out);

For a matrix :

void get_max(const float* d_logLuminance, float *d_out, int numRows, int numCols){
   int col = blockIdx.x * blockDim.x + threadIdx.x;
   int row = blockIdx.y * blockDim.y + threadIdx.y;
   int c_t = threadIdx.x;
   int r_t = threadIdx.y;
   int pos_1D = row * numCols + col;
   int pos_1D_t = r_t * blockDim.x + c_t;

   extern __shared__ float sh_mem[];

   sh_mem[pos_1D_t] = -999999.0f;

   if(pos_1D > numCols * numRows)

   sh_mem[pos_1D_t] = d_logLuminance[pos_1D];

   for(int s = (blockDim.x * blockDim.y) / 2; s > 0; s >>= 1){
      if(pos_1D_t < s)
         sh_mem[pos_1D_t] = fmax(sh_mem[pos_1D_t], sh_mem[pos_1D_t + s]);

   if(r_t == 0 && c_t == 0)
      d_out[blockIdx.y * gridDim.x + blockIdx.x] = sh_mem[0];

void max(const float *d_logLuminance, int numRows, int numCols, float &max_logLum){

   int THREADS_PER_BLOCK = 32;
   dim3 gridSize((THREADS_PER_BLOCK + numCols - 1) / THREADS_PER_BLOCK, 
                  (THREADS_PER_BLOCK + numRows - 1) / THREADS_PER_BLOCK);

   float *d_out, *d_int;
   cudaMalloc(&d_out, sizeof(float) * numRows * numCols);
   cudaMalloc(&d_int, sizeof(float) * numRows * numCols);

   get_max<<<gridSize, blockSize, THREADS_PER_BLOCK * THREADS_PER_BLOCK * sizeof(float)>>>(d_logLuminance, d_int, numRows, numCols);
   get_max<<<1, blockSize, THREADS_PER_BLOCK * THREADS_PER_BLOCK * sizeof(float)>>>(d_int, d_out, numRows, numCols);


   cudaMemcpy(&max_logLum, d_out, sizeof(float), cudaMemcpyDeviceToHost);

   printf("max : %f\n", max_logLum);



I could just launch a kernel with 1-dimensional blocksize and gridsize with number of threads = numRows * numCols and work on the matrix as an array (without the indexing tricks) and it works but I would like to know why my approach won’t.

Thank you.

I answered in your cross-posting:

If you want people to respond to posts like this, in my experience you’re more likely to get attention if you provide a complete code. That is something that I or someone could copy, paste, compile, and run, without having to add anything or change anything. Don’t make someone else write a main routine, or guess at the sizes of the data set you are actually using.