Search max/min value with different parts of an array

549590760 · July 9, 2021, 9:43am

Hallo,

i need to find min/max value of different parts of an array. Assuming that I have an array with 4 elements (it is 150 elements in fact but lets use 4 elements for example). I want to calculate the min value of each sub-array of the array and save them in a matrix, see the picture.

So for CPU, the code would be like this:

int length=4;
int out_matrix[4][4];
int in_array[4] ={4,1,3,8};

//Loop 1
for (int start= 0;  start < length;  ++start )
{
	//Loop 2
	for (int end= 0;  end< length;  ++end)
	{
		int nMin = MAXINT;
		int nMax = MININT;
		//Loop 3
		for (int n = start;  n <= end;  ++n)
		{
			if (in_array[n] < nMin)
			{
				nMin = in_array[n];
			}
				out_matrix[start][end] = nMin;
		}
}

I readed the docu about parallel reduction. But the solution there is only for the calculation in loop 3. And each single sub-array (maximal 150 elements) is not big enough to take advantage of gpu. Currently i put the calculation of loop 3 into a thread and give threads different sub-arrays.

Currently i implemented as following:

#define LOGLEN_MAX 150
__constant__ int constant_nLogLen;
__constant__ int constant_inArray[LOGLEN_MAX];
__constant__ int constant_nCalls;

__global__ void calculateMin(
    int* dev_outMatrix)
{
    int length = threadIdx.x;
    int startIndex = blockIdx.x;
    int logLen = constant_nLogLen;
    int nCalls = constant_nCalls;

    if (startIndex + length >= logLen || startIndex >= logLen || length >= logLen)
    {
        return;
    }

    int min = INT_MAX;
    for (int j = startIndex; j <= startIndex + length;j++)
    {
        if (constant_inArray[j] < min)
        {
            min = constant_inArray[j];
        }
    }

    dev_outMatrix[LOGLEN_MAX*startIndex + startIndex + length] = min;
}

Do you have any idea how can i rebuild my calculation so that i can avoid the loop 3 function in kernal function and still get an efficient processing?

Robert_Crovella · July 9, 2021, 5:41pm

I believe its possible to come up with a more work-efficient approach than what you have now. Whether or not it is faster than what you have now is hard to say without a comparison test.
I’d be very surprised if any sensible realization didn’t involve a loop of some sort in the kernel code
My suggestion would be that you provide a complete test case for your GPU implementation, so that others can easily test ideas if they wish.

Do as you wish, of course.

Robert_Crovella · July 9, 2021, 8:17pm

For an input length of 150, the kernel you have shown takes less than 10us on my GTX960. Its curious to me that you are trying to optimize it.

549590760 · July 12, 2021, 8:35am

Here i want to post my program so that you can test on your pc easily. With my Quadro M2000M(Capability 5.0) i get runtime log info by testing with 50 elements in array:
nCalls:100
init time =2.848us
memcpy_HostToDevice time =121.216us
Calculation time =289.376us
memcpy_DeviceToHost time =69.12us.
kernel.cu (7.9 KB)

@Robert_Crovella
In my programm, i also test the calculation with cpu, which takes only 21 us. It seems that my thread does not run parallel at all :(. Can you give me some suggestion? My goal is to make my application at least run faster than CPU.

Robert_Crovella · July 12, 2021, 2:33pm

Please post your code in the question, not as an attachment.

Robert_Crovella · July 12, 2021, 6:08pm

I believe its possible to come up with a more work-efficient approach. This is based on the following observations:

The diagonal of the result matrix is your input data
The results below the diagonal are all zero
For calculation of the results above the diagonal (the superdiagonals), we can observe that each item is simply the min of the value to the left of that item, and the value below that item
If we arrange our threads to process diagonallly from the main diagonal “towards” the upper right, we can at each step have the array of threads compute a diagonal of the result.

This method will require synchronization and communication, so ideally we would use shared memory for this. Furthermore, since we are not calculating results a row at a time, but rather one superdiagonal at a time, we must give some consideration to a method to write out results.

What follows is my attempt to compare what you originally posted, both CPU and GPU versions, with my approach (gpu_test). We can make some conclusions:

for a single instance of this problem, your approach is faster.
if we have multiple instances of this problem to compute at once, (16 is my example number) then my approach appears faster on my GTX960.

$ cat t165.cu
#include <iostream>
#include <cstdlib>
#include <cstdio>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
const int MY_MAX = 1000;
//const int MY_MIN = 0;
#define LOGLEN_MAX 150
__constant__ int constant_nLogLen;
__constant__ int constant_inArray[LOGLEN_MAX];

__global__ void calculateMin(
    int* dev_outMatrix)
{
    int length = threadIdx.x;
    int startIndex = blockIdx.x;
    int logLen = constant_nLogLen;

    if (startIndex + length >= logLen || startIndex >= logLen || length >= logLen)
    {
        return;
    }

    int min = MY_MAX;
    for (int j = startIndex; j <= startIndex + length;j++)
    {
        if (constant_inArray[j] < min)
        {
            min = constant_inArray[j];
        }
    }

    dev_outMatrix[startIndex*logLen + length+startIndex] = min;
}

template <typename T>
void cpu_test(T *in_array, T *out_matrix, T *out_max, int length){

//Loop 1
for (int start= 0;  start < length;  ++start )
{
        //Loop 2
        for (int end= 0;  end< length;  ++end)
        {
                T nMin = MY_MAX;
//              T nMax = MY_MIN;
                //Loop 3
                for (int n = start;  n <= end;  ++n)
                {
                        if (in_array[n] < nMin)
                        {
                                nMin = in_array[n];
                        }
                                out_matrix[start*length+end] = nMin;
                }
}
}
}
// assumes block consists of an even number of threads
template <typename T>
__global__ void gpu_test(const T * __restrict__ in, T * __restrict__ out_min, T * __restrict__ out_max){

  __shared__ T s[LOGLEN_MAX*LOGLEN_MAX/2];
  const int t = threadIdx.x;
  T my_val = in[t];
  const int my_l = blockDim.x;
  const int my_half = my_l>>1;
  const int my_l_m1 = my_l-1;
  int my_oidx = (blockIdx.x*my_l+my_l_m1)*my_l + t;
  int my_sidx = ((t - (t >= my_half)*my_half)*my_l)+t;
  s[my_sidx] = my_val;
  out_min[my_oidx] = (t == (my_l_m1))*my_val;
  for (int i = 1; i < my_l; i++){
    my_oidx -= my_l;
    my_sidx -= my_l;
    if (my_sidx < 0) my_sidx += my_half*my_l;
    int my_osidx = t+(my_l_m1-i-((i < my_half)*my_half))*my_l;
    __syncthreads();
    if (t >= i)
      s[my_sidx] = my_val = min(my_val,s[my_sidx-1]);
    out_min[my_oidx] = (t > (my_l_m1-i-1))*s[my_osidx];
  }
}

typedef int mt;
int main(){
    const int nblk = 16;
    const int length=150;
    mt out_matrix[length][length];
    mt in_array[length];
    mt *out = &(out_matrix[0][0]);
    for (int i = 0; i < length; i++){
            in_array[i] = (rand()%(MY_MAX-1))+1;
            for (int j = 0; j < length; j++) out_matrix[i][j] = 0;}

    unsigned long long dt = dtime_usec(0);
    for (int i = 0; i < nblk; i++) cpu_test(in_array, out, out, length);
    dt = dtime_usec(dt);
    std::cout << "cpu time: " << dt << "us" << std::endl;
    cudaMemcpyToSymbol(constant_inArray, in_array, length*sizeof(mt));
    cudaMemcpyToSymbol(constant_nLogLen, &length, sizeof(mt));
    mt *d_out, *h_out;
    cudaMalloc(&d_out, nblk*length*length*sizeof(mt));
    h_out = new mt[length*length];
    cudaMemset(d_out, 0, length*length*sizeof(mt));
    for (int i = 0; i < nblk; i++)
      calculateMin<<<length, length>>>(d_out);
    cudaMemcpy(h_out, d_out, length*length*sizeof(mt), cudaMemcpyDeviceToHost);
    for (int i = 0; i < length; i++){
      for (int j = 0; j < length; j++)
         if (h_out[i*length+j] != out[i*length+j]) {std::cout << "mismatch0 at: " << i << "," << j << " was: " << h_out[i*length+j] << " should be: " << out[i*length+j] <<  std::endl; return 0;}
    }
    mt *d_in;
    cudaMalloc(&d_in, length*sizeof(mt));
    cudaMemcpy(d_in, in_array, length*sizeof(mt), cudaMemcpyHostToDevice);
    gpu_test<<<nblk, length>>>(d_in, d_out, d_out);
    cudaMemcpy(h_out, d_out, length*length*sizeof(mt), cudaMemcpyDeviceToHost);
    for (int i = 0; i < length; i++)
      for (int j = 0; j < length; j++)
         if (h_out[i*length+j] != out[i*length+j]) {std::cout << "mismatch1 at: " << i << "," << j << " was: " << h_out[i*length+j] << " should be: " << out[i*length+j] <<  std::endl; return 0;}
}
$ nvcc -O3 -o t165 t165.cu -lineinfo
$ nvprof ./t165
cpu time: 4282us
==2990== NVPROF is profiling process 2990, command: ./t165
==2990== Profiling application: ./t165
==2990== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   68.71%  144.26us        16  9.0160us  8.4480us  9.3120us  calculateMin(int*)
                   13.96%  29.312us         2  14.656us  14.624us  14.688us  [CUDA memcpy DtoH]
                   13.87%  29.120us         1  29.120us  29.120us  29.120us  void gpu_test<int>(int const *, int*, int*)
                    1.74%  3.6480us         1  3.6480us  3.6480us  3.6480us  [CUDA memset]
                    1.72%  3.6170us         3  1.2050us  1.0240us  1.4090us  [CUDA memcpy HtoD]
      API calls:   99.04%  141.83ms         2  70.915ms  6.7020us  141.82ms  cudaMemcpyToSymbol
                    0.40%  566.04us       202  2.8020us     170ns  159.97us  cuDeviceGetAttribute
                    0.17%  243.11us         2  121.55us  120.67us  122.44us  cudaMalloc
                    0.14%  204.36us         3  68.118us  11.442us  110.78us  cudaMemcpy
                    0.10%  145.35us         2  72.675us  51.758us  93.593us  cuDeviceTotalMem
                    0.08%  117.60us        17  6.9170us  5.6480us  13.981us  cudaLaunchKernel
                    0.04%  58.518us         2  29.259us  26.122us  32.396us  cuDeviceGetName
                    0.02%  27.103us         1  27.103us  27.103us  27.103us  cudaMemset
                    0.01%  10.098us         2  5.0490us  1.7830us  8.3150us  cuDeviceGetPCIBusId
                    0.00%  1.2910us         3     430ns     274ns     721ns  cuDeviceGetCount
                    0.00%  1.2760us         4     319ns     192ns     689ns  cuDeviceGet
                    0.00%     586ns         2     293ns     256ns     330ns  cuDeviceGetUuid
$

In order to reduce shared memory footprint of my code, I “folded” the shared usage. This causes “lower” rows that will be written out to global memory first to use the “right” half of the shared storage that is allocated for upper rows. By the time the upper rows get around to using this region, the lower row results have already been written out to global memory.

Just to summarize, the timings I see for 16 iterations are:
cpu_test: ~4ms (this could probably be improved e.g. with OpenMP)
calculateMin: ~144us
gpu_test: ~29us

Topic		Replies	Views
CUDA reduction CUDA Programming and Performance	10	51480	June 7, 2009
find maximum value in an array along with index CUDA Programming and Performance	40	18798	October 11, 2010
parallel way to find min CUDA Programming and Performance	21	7300	April 15, 2011
Finding max in array CUDA Programming and Performance	15	42622	November 26, 2017
Finding minimum among multiple threads CUDA Programming and Performance	13	5320	August 11, 2013
Finding Minimum in array CUDA Programming and Performance	6	12892	October 14, 2011
Reduction in kernel function Get max value in vector CUDA Programming and Performance	8	13793	May 5, 2011
Finding the min value of a row in a 2d array CUDA Programming and Performance	1	1855	January 3, 2009
different output every time I run my code probably wrong in finding max value. CUDA Programming and Performance	4	3050	June 21, 2011
Fastest single-warp min() algorithm for 32 values CUDA Programming and Performance	15	3730	June 20, 2011

Search max/min value with different parts of an array

Related topics