Why when I used this two global function, I got different results?

The first kind of global function is below. I used this function to calculate and could get right results.

__global__ void FindMin(float* InputData, float* MinList, int Vnum, int Len)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;
	int j = threadIdx.y + blockDim.y * blockIdx.y;

	float MinValue = 0.0;

	if (j < Vnum)
	{
		for (int k = 0; k < Len; k++)
		{
			if (InputData[k + j * Len] < MinValue)
			{
				MinValue = InputData[k + j * Len];
			}
		}
		MinList[j] = MinValue;
		MinValue = 0.0;
	}
}

And the second kind is:

__global__ void FindMin(float* InputData, float* MinList, int Vnum, int Len)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;
	int j = threadIdx.y + blockDim.y * blockIdx.y;

	float MinValue = 0.0;

	if (j < Vnum && i < Len)
	{
		if (InputData[i + j * Len] < MinValue)
		{
			MinValue = InputData[i + j * Len];
		}
		MinList[j] = MinValue;
		MinValue = 0.0;
	}
}

It’s weird that I used the same grid size and block size. Parameters I used were the same. But the second kind of global function could not lead to right results. I don’t know why would this happen, can you give a hand?
The second kind of global function is obviously faster, I preferred to use it.

I find it really strange. When I used the same way to write global function to get the maximum of an array, both kinds of global function below could lead to right results:
1.

__global__ void FindMax(float* InputData, float* MaxList, int Vnum, int Len)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;
	int j = threadIdx.y + blockDim.y * blockIdx.y;

	float MaxValue = 0.0;

	if (j < Vnum)
	{
		for (int k = 0; k < Len; k++)
		{
			if (InputData[k + j * Len] > MaxValue)
			{
				MaxValue = InputData[k + j * Len];
			}
		}
		MaxList[j] = MaxValue;
		MaxValue = 0.0;
	}
}
__global__ void FindMax(float* InputData, float* MaxList, int Vnum, int Len)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;
	int j = threadIdx.y + blockDim.y * blockIdx.y;

	float MaxValue = 0.0;

	if (j < Vnum && i < Len)
	{
		if (InputData[i + j * Len] > MaxValue)
		{
			MaxValue = InputData[i + j * Len];
		}
		MaxList[j] = MaxValue;
		MaxValue = 0.0;
	}
}

I assume you want to compute the minimum value in each row of a matrix.

Kernel 2 has several issues.
let’s say the inputdata is 1.0, 2.0, 3.0, 4.0 and you use 4 threads. Then each thread will notice that its value is greater than MinValue and will not update it. MinValue needs to be initialized with MAX_FLOAT.

Now, assume MinValue could be updated at is 1.0, 2.0, 3.0, 4.0 for the threads, respectively. Then each thread will attempt to write its value to the output. But you need to compute the minimum of all four values and only output that value.

If you want to implement the operation on your own, there are many resources on how to do a parallel reduction. Otherwise, you could just use a library like cub other thrust.

1 Like

Thank you for your help. You said ā€œthere are many resources on how to do a parallel reduction.
ā€, can you give me some reference?
Besides, Why kernel ā€œFindMaxā€ didn’t occur problem like kernel ā€œFindMinā€ļ¼ŸThe logic is the same.

FindMax does also not work correctly. You do not compute the max of all values. It is just coincidence for your specific input and number of threads.

1 Like

OK, I see. Really thank you!

I once agin read your reply. It seem that there are some standard ways to find minimum and maxinum of matrix in official documents. Are there any solutions? If there are, can you offer some to me? If there are not, just ignore what i say.

You are looking for a segmented reduction.

It could be implemented via thrust::reduce_by_key Reductions | Thrust . You would need to generate the row number (the key) for each entry in the matrix.

It could also implemented via cub::DeviceSegmentedReduce::Min CUB: cub::DeviceSegmentedReduce Struct Reference . Here you need to pass begin and end of each segment.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.