about finding a max number from a big array

Hi, all:

  I am trying to code a program that is used for finding a max number and its associated index in a 2D array.

 I set the values to a 2D array, then, copy to a 1D array in device. I coded a global function named find_max to find the max number in the 1D array and its associated index in that 1D array. Outputs are 1D array with initialized "0" in all elements. The "0" in one element of the array is replaced by max number located at the associated index.


  After I executed program, outputs of 1D array and index of that max number in the input 1D array is incorrect.


  I tested the same implementation by for loop in host (CPU), outputs are correct. So, I think there should be some special considerations in this kind of implementation in GPU.  


   So, I will greatly appreciate you if you tell me the specials.

Dawn

My complete program is below:


#include “cuda_runtime.h”

#include “cuda.h”

#include “helper_functions.h”
#include “helper_cuda.h”

#include <stdio.h>
#include <stdlib.h>

//3 - linear
#define xsize 101
#define ysize 50

#define size xsize*ysize

// define blocks and threads
#define threds 1024 // # of threads per block
#if size%threds == 0
#define blocks (size/threds) // total # of blocks = total # of threads/threads (per block) you want to send to
#else
#define blocks (size/threds +1)
#endif

global void find_max(float *d_data, float *d_peak, int *d_max) {

int i_max = 0;
float tmp;

tmp = d_data[0];	

int idx = blockIdx.x*blockDim.x + threadIdx.x;

// find peak value idx by idx
while (idx < size) {
	if (tmp < d_data[idx]) {
		i_max = idx;
		tmp = d_data[i_max];
	}		
	idx += blockDim.x*gridDim.x;
}	

// write peak value and associated index for output
d_peak[i_max] = tmp;	
*d_max = i_max;

}

int main()
{

int *d_max;
float *d_x, *d_peak;	

float x[xsize][ysize], h_data[xsize][ysize];

// alloc memory for device variables and copy variables from host to device	
checkCudaErrors(cudaMalloc((void **)&d_x, sizeof(float)*size));
checkCudaErrors(cudaMalloc((void **)&d_peak, sizeof(float)*size));	

checkCudaErrors(cudaMalloc((void **)&d_max, sizeof(int)));

checkCudaErrors(cudaMemset(d_max, 0, sizeof(int)));
checkCudaErrors(cudaMemset(d_peak, 0.0f, sizeof(float)*size));

int num = 0;
// generate float data in x[][]
for (int i = 0; i < xsize; i++) {  		
	for (int j = 0; j < ysize; j++) { 
	    	num = rand() % 100;
	    	x[i][j] = (float)num;
			if (i == 50 && j == 25) {
			     x[i][j] = 1000.00f;
				 printf("peak number is set at index: \n");
				 printf("\n i=%d  j=%d \n", i, j);
			}
			printf("%f ", x[i][j]);
	}
	printf("\n");
}


// copy x from host to d_x in device 
checkCudaErrors(cudaMemcpy(d_x, x, sizeof(float)*size, cudaMemcpyHostToDevice));		

int h_max;
find_max <<< blocks, threds >>> (d_x, d_peak, d_max);

checkCudaErrors(cudaMemcpy(&h_max, d_max, sizeof(int), cudaMemcpyDeviceToHost));
printf("\n\n");
    
    // write index of the max number in 1D array
printf("max number index=%d \n", h_max);
printf("\n\n");

checkCudaErrors(cudaMemcpy(h_data, d_peak, sizeof(float)*size, cudaMemcpyDeviceToHost));

cudaDeviceSynchronize();

 
    // write output of 1D array having the max number 
for (int i = 0; i < xsize; i++) {	
	for (int j = 0; j < ysize; j++) {	
		if (i == 50 && j == 25) {				
			printf("\n i=%d  j=%d \n", i, j);
		}
		printf("%f ", h_data[i][j]);			
	}	
	printf("\n");
}


// clean buffer
// free 
//free(x);
//free(h_data);

cudaFree(d_x);
cudaFree(d_peak);	
cudaFree(d_max);


return 0;

}


i suggest you to use CODE tag for the program (it’s the last button in the bar in the top of edit box)

overall, algorithms for searching a minimum very well-known - did you read this topic in CUDA textbooks? if you just need result - you can use Thrust (part of CUDA) for that job

--------------------------------------------------------------------------------------------------


#include "cuda_runtime.h"

#include "cuda.h"

#include "helper_functions.h"
#include "helper_cuda.h"

#include <stdio.h>
#include <stdlib.h>


//3 - linear
#define xsize  101
#define ysize  50

#define size xsize*ysize

// define blocks and threads
#define threds 1024                    // # of threads per block
#if size%threds == 0
#define blocks  (size/threds)  // total # of blocks = total # of threads/threads (per block) you want to send to
#else
#define blocks (size/threds +1)
#endif

__global__ void find_max(float *d_data, float *d_peak, int *d_max) {

	int i_max = 0;
	float tmp;

	tmp = d_data[0];	

	int idx = blockIdx.x*blockDim.x + threadIdx.x;

	// find peak value idx by idx
	while (idx < size) {
		if (tmp < d_data[idx]) {
			i_max = idx;
			tmp = d_data[i_max];
		}		
		idx += blockDim.x*gridDim.x;
	}	

	// write peak value and associated index for output
	d_peak[i_max] = tmp;	
	*d_max = i_max;

}

int main()
{
	
	int *d_max;
	float *d_x, *d_peak;	

	float x[xsize][ysize], h_data[xsize][ysize];

	// alloc memory for device variables and copy variables from host to device	
	checkCudaErrors(cudaMalloc((void **)&d_x, sizeof(float)*size));
	checkCudaErrors(cudaMalloc((void **)&d_peak, sizeof(float)*size));	
	
	checkCudaErrors(cudaMalloc((void **)&d_max, sizeof(int)));

	checkCudaErrors(cudaMemset(d_max, 0, sizeof(int)));
	checkCudaErrors(cudaMemset(d_peak, 0.0f, sizeof(float)*size));
	
	int num = 0;
	// generate float data in x[][]
	for (int i = 0; i < xsize; i++) {  		
		for (int j = 0; j < ysize; j++) { 
		    	num = rand() % 100;
		    	x[i][j] = (float)num;
				if (i == 50 && j == 25) {
				     x[i][j] = 1000.00f;
					 printf("peak number is set at index: \n");
					 printf("\n i=%d  j=%d \n", i, j);
				}
				printf("%f ", x[i][j]);
		}
		printf("\n");
	}
	

	// copy x from host to d_x in device 
	checkCudaErrors(cudaMemcpy(d_x, x, sizeof(float)*size, cudaMemcpyHostToDevice));		
	
	int h_max;
	find_max <<< blocks, threds >>> (d_x, d_peak, d_max);

	checkCudaErrors(cudaMemcpy(&h_max, d_max, sizeof(int), cudaMemcpyDeviceToHost));
	printf("\n\n");
        
        // write index of the max number in 1D array
	printf("max number index=%d \n", h_max);
	printf("\n\n");

	checkCudaErrors(cudaMemcpy(h_data, d_peak, sizeof(float)*size, cudaMemcpyDeviceToHost));
	
	cudaDeviceSynchronize();

     
        // write output of 1D array having the max number 
	for (int i = 0; i < xsize; i++) {	
		for (int j = 0; j < ysize; j++) {	
			if (i == 50 && j == 25) {				
				printf("\n i=%d  j=%d \n", i, j);
			}
			printf("%f ", h_data[i][j]);			
		}	
		printf("\n");
	}

	
	// clean buffer
	// free 
	//free(x);
	//free(h_data);

	cudaFree(d_x);
	cudaFree(d_peak);	
	cudaFree(d_max);


	return 0;
}


---------------------------------------------------------------------------------------------

your algorithm of finding max number is incorrect. you have only one d_max value and all threads put their result to the same cell. this results that d_max contains result obtained by one of threads, rather than maximum for the entire array

if you can’t catch such sort of errors, you probably don’t have enough understanding of CUDA execution model, so i suggest you to reread chapters about SIMT parallelism model

Hi,

 Thank you for your helps. In the global find_max function, d_data is 1D array input data, d_peak is output 1D array containing max number and d_max is output single constant that is the index of max number in input 1D array of d_data. d_peak is initialized with "0" for each element before passing into find_max. After finding the max number from d_data, only element in the index of max number in d_data is replaced by the max number while rest element of d_peak remain "0"s. 

  So, please re-run my program to find the problems for me.  I will read the chapters of  SIMT parallelism model.


  Thank you very much in advance!

Dawn

d_max is output single constant that is the index of max number in input 1D array of d_data.

when you run “find_max <<< blocks, threds >>>”, it creates blocks*threds threads on GPU. each thread in your code scans a part of the input array and finds index of maxelement in this part. then each thread puts its result into *d_max, overriding results of each other, resulting in that *d_max contains index of some local maximum

first part of my recommendation was to read how to find maximum using GPU - it’s a well-known 2-step algo. you have implemented only first part of this algo, with a lot of mistakes

You may also want to use Thrust as someone point out, or CUB:

You likely won’t beat their performance on your own.

The answer to this question provides 3 different approaches to finding max value plus index:

https://stackoverflow.com/questions/27925979/thrustmax-element-slow-in-comparison-cublasisamax-more-efficient-implementat

Hi, txbob:

  Thank you very much! Your recommendation is really helpful. I made it. Thank you for your instructions and guidance!

Dawn

Hi @Robert, for the custom kernel method, could your explain a little bit about the block-level reduction? I dont understand why the condition is if (!threadIdx.x) there.

In a reduction it’s common to choose a single thread to do the last operation, such as storing the final reduced value into global memory. In this case he’s choosing the first thread to store the value.