How to reduce time cost

Hi,
I am stuck with my code for a long time.I have absolutely no idea how to optimize the time cost.Here is my code.Plz.help…

typedef struct ShapeUnitStruct
{
	int* sensorOffsets;  // The offset value of each point on the sensor relative to the center point
	int  begin = 0;
	int  end = 0;
};

typedef struct _SHAPE_POS
{
	int				        type;
	ShapeUnitStruct*	    pRayStruct = nullptr;
	ShapeUnitStruct*	    pSensorStruct = nullptr;
}SHAPE_POS;

__device__
void DetectKernel(unsigned char* diImgData, long long iImgPixelIdx, int iThresholdVal,
	unsigned char iMinThreshold, unsigned char iMaxThreshold, SHAPE_POS* diShape, int iShapeIdx, 
	long long iRsltIdx, short* doRslt)
{
	int centerColor = iThresholdVal;
	int gray = 0;
	long long edgeIndex = 0;
	int isExcp = 0;
	int isExcpIdx = 0;
	int weight = 0;
	int sensorRslt = 0;
	int sensorExcpDir = 0;

	// traverse all directions
	for (int dirIdx = 0; dirIdx < 8; ++dirIdx)
	{
		int SensorExcpCount = 0;
		for (int sensorIdx = 0; sensorIdx < 1; ++sensorIdx)
		{
			// sensor begin idx
			int sensorBegin = diShape.pSensorStruct[sensorIdx].begin;
			// sensor end idx
			int sensorEnd = diShape.pSensorStruct[sensorIdx].end;
			// sensor direction
			int sensorDir = dirIdx;

			for (int lenIdx = sensorBegin; lenIdx <= sensorEnd; ++lenIdx)
			{
				// gray value of pixel in sensor
				int tempVal = *(diImgData + iImgPixelIdx + diShape.pSensorStruct[sensorIdx].sensorOffsets[lenIdx]);
				// Binarization,equal = 1
				int isEqual = !((((iMinThreshold - 1 - tempVal) ^ (iMaxThreshold - tempVal)) < 0) ^ centerColor);
				if (isEqual == 0)
				{
					SensorExcpCount++;
					if (weight < (sensorEnd - lenIdx + 1))
					{
						weight = sensorEnd - lenIdx + 1;
						sensorExcpDir = sensorDir;
					}
					break;
				}
			}
		}
		if (SensorExcpCount == 0)
		{
			continue;
		}

		int isRay = 0;
		for (int rayIdx = 0; rayIdx < 2; ++rayIdx)
		{
			// ray direction
			int rayDir;
			if (rayIdx == 0)
			{
				rayDir = (dirIdx + 7) % 8;
			}
			else
			{
				rayDir = (dirIdx + 1) % 8;
			}
 
			// ray begin
			int rayBegin = diShape.pRayStruct[rayIdx].begin;
			// ray end
			int rayEnd = diShape.pRayStruct[rayIdx].end;

			for (int lenIdx = rayBegin; lenIdx <= rayEnd; ++lenIdx)
			{
				// gray value of pixel in ray
				int tempVal = *(diImgData + iImgPixelIdx + diShape.pRayStruct[rayIdx].sensorOffsets[lenIdx]);
				// Binarization,equal = 0
				int isEqual = ((((iMinThreshold - 1 - tempVal) ^ (iMaxThreshold - tempVal)) < 0) ^ centerColor);
				if (isEqual == 1)
				{
					isRay += isEqual;
					break;
				}
			}
			if (isRay >= 1)
			{
				break;
			}
		}

		if (isRay == 0)
		{
			doRslt[iRsltIdx] = 1;
			doRslt[iRsltIdx] |= weight << 1;
			doRslt[iRsltIdx] |= centerColor << 7;
			doRslt[iRsltIdx] |= sensorExcpDir << 8;
		}
	}
}

1651114943(1)
The number of pixels on a ray/sensor ranges from 1 to 32. SHAPE_POS store different types of information, like the picture shows. The original code data structure is too complex, i simplified the code a bit. And i also read the filter demo, but it seems that there is not enough shared mem for my application.

What is the definition of SHAPE_POS?

When you used the CUDA profiler to analyze this kernel, what did it identify as the bottleneck(s)? What adjustments did you make based on the profiler data?

Glancing at the code, I would guess that performance is dominated by memory accesses. The indirect addressing via the indexes doesn’t look like it would produce coalesced accesses, and you might want to think about how data structures can be arranged to make best use of the “base + thread_index” addressing idiom optimal for coalesced access.

Generally speaking, memory accesses narrower than the width of a register should be avoided as they can be inefficient. Try loading and processing four pixels at a time via uchar4 (note alignment requirements!). By the same token you would then store results via a ushort4.

It may not matter in a kernel this simple, but generally you would want to use the __restrict__ modifier for a kernel’s pointer arguments, provided the data accessed via the pointers is non-overlapping, which would appear to be the case here.