different output every time I run my code probably wrong in finding max value.


I have written this code but it has different output every time I run it. I believe something is wrong with the way I find the max value.

Can anybody please help me, letting me know what is the efficient method for finding the max. and generally what are the things that would improve my code to be faster?

Thanks so much!

__device__ float floor_exp(float x) {

  return (x < -708.3f) ? 0.0 : exp(x);


__device__ const int nCom=50;

__device__ float maxVal=-0.5E10;

__device__ float tmp[60][50];

__device__ void findMax (int n){

	for (int i=0;i<n;i++){

		for (int j=0;j<nCom;j++)

			if (tmp[i][j]> maxVal)  maxVal=tmp[i][j];



__global__ void lowerBound(float* ref_GPU,float* test_GPU, uttSeg* result_GPU, int refSize){


	int x= threadIdx.x; 

	int y=threadIdx.y;   	





	tmp[x][y]=ref_GPU[x*nCom+y] + test_GPU[(x+blockIdx.x)*nCom+y];







	result_GPU[blockIdx.x].lb+= log(floor_exp(tmp[x][y]))+maxVal;			


You have a race condition since multiple threads are reading and writing [font=“Courier New”]maxVal[/font] in parallel. You need to either use atomic operations (not straightforward, as there is no atomicMax() function), or a reduction scheme (where shared values are written and read in a defined order).

Apart from this problem: Are you aware that in your findMax implementation all threads are executing the same code, on the same data? You should distribute the work between threads, instead of repeating the same work all over.

Thanks for your response.

I’m aware they are all finding max again, I thought this way I get a fixed response, which was not successful. and I know why!

Where can I read about reduction scheme? apparently that’s my solution.

Check out the reduction example in the CUDA SDK.

Thanks! I got it!