The first kind of global function is below. I used this function to calculate and could get right results.

__global__ void FindMin(float* InputData, float* MinList, int Vnum, int Len)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float MinValue = 0.0;
if (j < Vnum)
{
for (int k = 0; k < Len; k++)
{
if (InputData[k + j * Len] < MinValue)
{
MinValue = InputData[k + j * Len];
}
}
MinList[j] = MinValue;
MinValue = 0.0;
}
}

And the second kind is:

__global__ void FindMin(float* InputData, float* MinList, int Vnum, int Len)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float MinValue = 0.0;
if (j < Vnum && i < Len)
{
if (InputData[i + j * Len] < MinValue)
{
MinValue = InputData[i + j * Len];
}
MinList[j] = MinValue;
MinValue = 0.0;
}
}

Itās weird that I used the same grid size and block size. Parameters I used were the same. But the second kind of global function could not lead to right results. I donāt know why would this happen, can you give a hand?
The second kind of global function is obviously faster, I preferred to use it.

I find it really strange. When I used the same way to write global function to get the maximum of an array, both kinds of global function below could lead to right results:
1.

__global__ void FindMax(float* InputData, float* MaxList, int Vnum, int Len)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float MaxValue = 0.0;
if (j < Vnum)
{
for (int k = 0; k < Len; k++)
{
if (InputData[k + j * Len] > MaxValue)
{
MaxValue = InputData[k + j * Len];
}
}
MaxList[j] = MaxValue;
MaxValue = 0.0;
}
}

__global__ void FindMax(float* InputData, float* MaxList, int Vnum, int Len)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float MaxValue = 0.0;
if (j < Vnum && i < Len)
{
if (InputData[i + j * Len] > MaxValue)
{
MaxValue = InputData[i + j * Len];
}
MaxList[j] = MaxValue;
MaxValue = 0.0;
}
}

I assume you want to compute the minimum value in each row of a matrix.

Kernel 2 has several issues.
letās say the inputdata is 1.0, 2.0, 3.0, 4.0 and you use 4 threads. Then each thread will notice that its value is greater than MinValue and will not update it. MinValue needs to be initialized with MAX_FLOAT.

Now, assume MinValue could be updated at is 1.0, 2.0, 3.0, 4.0 for the threads, respectively. Then each thread will attempt to write its value to the output. But you need to compute the minimum of all four values and only output that value.

If you want to implement the operation on your own, there are many resources on how to do a parallel reduction. Otherwise, you could just use a library like cub other thrust.

Thank you for your help. You said āthere are many resources on how to do a parallel reduction.
ā, can you give me some reference?
Besides, Why kernel āFindMaxā didnāt occur problem like kernel āFindMināļ¼The logic is the same.

I once agin read your reply. It seem that there are some standard ways to find minimum and maxinum of matrix in official documents. Are there any solutions? If there are, can you offer some to me? If there are not, just ignore what i say.

It could be implemented via thrust::reduce_by_keyReductions | Thrust . You would need to generate the row number (the key) for each entry in the matrix.