Help me please to sort out why unrolled loop is slower in my case?

Hi All!
I’m first time here, so, please be sorry for my possible inconsistency in this post.

So, I have to calculate the quadrature of signal from the first array around of peaks positions that are contained in the second array.

           (float *)volts                            (int *)peaks               
s  [0..........................8159]      [0..........................215]
e  [0..........................8159]      [0..........................215]
g  [0..........................8159]      [0..........................215]
m  [0..........................8159]      [0..........................215]
e                                                                          
n          * * * * * *                            * * * * * *              
t                                                                          
_  [0..........................8159]      [0..........................215]
c                                ^                                      ^
n                       segment_size-1                            peaks_cnt-1
t

The function “quadratures” are running with the maximum allowed threads by the Nvidia card and blocks = segment_cnt / max_threads + 1.
The function “quadratures_v2” is running with blocks=segment_cnt and threads=peak_cnt.
Why is the second variant slower than the first? The segment count is >= 1e5.

Both functions are below. Thanks in advance!

// overall number of threads = u32SegmentCount;
__global__ void quadratures(float *volts, int *peaks, int peaks_cnt, volatile float *quads, 
                            int segment_cnt, int segment_size, double scaler, int integration_interval) {
	INT64 n = blockDim.x * blockIdx.x + threadIdx.x;
	UINT8 *vBuffer = (UINT8 *)volts;
	UINT8 *qBuffer = (UINT8 *)quads;

	if(n < segment_cnt) {
		vBuffer += segment_size * sizeof(float) * n;
		qBuffer += peaks_cnt * sizeof(float) * n;
		float *p = (float *)vBuffer;
		float *q = (float *)qBuffer;
		// Integration for each pulse in a segment
		for(int i = 0; i < peaks_cnt; i++) {
			float voltage_integral = 0.0;
			// Integration around the pulse position of the signal
			for(int j = 0; j < integration_interval; j++) {
				int interval_position = peaks[i] - (int)((float)integration_interval / 2.0) + j;
				voltage_integral += p[interval_position];
			}
			q[i] = (float)(voltage_integral * scaler);
		}
	}
}

// blocks = segment count and threads = peaks count
__global__ void quadratures_v2(float *volts, int *peaks, int peaks_cnt, volatile float *quads, 
                               int segment_cnt, int segment_size, double scaler, int integration_interval) {
	UINT8 *vBuffer = (UINT8 *)volts;
	UINT8 *qBuffer = (UINT8 *)quads;

	if(blockIdx.x < segment_cnt && threadIdx.x < peaks_cnt) {
		vBuffer += segment_size * sizeof(float) * blockIdx.x;
		qBuffer += peaks_cnt * sizeof(float) * blockIdx.x;
		float *p = (float *)vBuffer;
		float *q = (float *)qBuffer;
		int i = threadIdx.x;
		float voltage_integral = 0.0;
		// Integration around the pulse position of the signal
		for(int j = 0; j < integration_interval; j++) {
			int interval_position = peaks[i] - (int)((float)integration_interval / 2.0) + j;
			voltage_integral += p[interval_position];
		}
		q[i] = (float)(voltage_integral * scaler);
	}
}

There would appear to be an occupancy difference between the two cases, with the v2 kernel possibly achieving lower occupancy. I suppose this might make it run a bit slower, but that’s really just a guess. I wouldn’t expect a large difference.

For performance questions I usually suggest:

  1. A complete code
  2. The system you are running on (host OS, CUDA version, GPU driver version, GPU)
  3. Compilation commands/instructions
  4. How you are measuring timing.

Do as you wish of course.

  1. code snippets for run CUDA kernel for each one variant below:
    the first case, treads = 1024 (max value for this card)
int blocks = (u32SegmentCount % threads == 0)?(u32SegmentCount / threads):(u32SegmentCount / threads) + 1;

quadratures <<<blocks, threads >>> ((float *)volts, (int *)d_peaks, peaks_cnt, (float *)quads, u32SegmentCount, i64SegmentSize, scaler, integration_interval);

the second case

int blocks_cnt = u32SegmentCount;
int threads_cnt = peaks_cnt;

quadratures_v2 <<<blocks_cnt, threads_cnt >>> ((float *)volts, (int *)d_peaks, peaks_cnt, (float *)quads, u32SegmentCount, i64SegmentSize, scaler, integration_interval);
  1. Windows Server 2016, GPU Device 0: “Ampere” with compute capability 8.0, NVIDIA A100-PCIE-40GB, CUDA SDK 11.4, driver version 30.0.14.7111
  2. C:\Program Files (x86)\Gage\CompuScope\CompuScope C SDK\C Samples\Advanced\GPU\GageAcquireThruGPU-Simple>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\nvcc.exe” -gencode=arch=compute_80,code=“sm_80,compute_80” -gencode=arch=compute_86,code=“sm_86,compute_86” --use-local-env -ccbin “C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64” -x cu -I"C:\Program Files (x86)\Gage\CompuScope\include" -I"…\C Common" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_CONSOLE -D_CRT_SECURE_NO_DEPRECATE -D_CRT_NON_CONFORMING_SWPRINTFS -D_MBCS -Xcompiler "/EHsc /W4 /nologo /Od /Fdx64\Debug\vc142.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\DSPEquation_Simple_GPU.cu.obj “C:\Program Files (x86)\Gage\CompuScope\CompuScope C SDK\C Samples\Advanced\GPU\GageAcquireThruGPU-Simple\DSPEquation_Simple_GPU.cu”
  3. I’m using QueryPerformanceCounter()

There would appear to be an occupancy difference between the two cases, with the v2 kernel possibly achieving lower occupancy. I suppose this might make it run a bit slower, but that’s really just a guess. I wouldn’t expect a large difference.

Nothing positive happened when I’ve tried to increase the occupation to the maximum allowed by this card. Even more - this variant was slower more. The last variant is below. The blockSize =1024, gridSize = (peaks_cnt * u32SegmentCount + blockSize - 1) / blockSize. The difference is more than two times slower! What I’m doing wrong?

__global__ void quadratures_v2_1(float *volts, int *peaks, int peaks_cnt, volatile float *quads, 
                                 int segment_cnt, int segment_size, double scaler, int integration_interval) {
	// number of threads = segment_cnt * peaks_cnt
	int n = blockDim.x * blockIdx.x + threadIdx.x;
	UINT8 *vBuffer = (UINT8 *)volts;
	UINT8 *qBuffer = (UINT8 *)quads;

	if(n < segment_cnt * peaks_cnt) {
		int row = n / peaks_cnt;
		vBuffer += segment_size * sizeof(float) * row;
		qBuffer += peaks_cnt * sizeof(float) * row;
		float *p = (float *)vBuffer;
		float *q = (float *)qBuffer;
		int i = n % peaks_cnt;

		// Integration around the pulse position of the signal
		float voltage_integral = 0.0;
		int start_pos = peaks[i] - (integration_interval >> 1);
		for(int j = start_pos; j < start_pos + integration_interval; j++) {
			voltage_integral += p[j];
		}
		q[i] = (float)(voltage_integral * scaler);
	}
}

At the point where one wonders about a performance difference between two pieces of code for which there does not seem to be an obvious reason, there is a good reason to fire up the CUDA profiler and look for salient differences in the profile data, then work backwards from there.

With modern GPUs, for many real-life codes, data movement has the largest impact on performance out of all potential reasons for performance. That may not be the case here, but it is at least conceivable that unrolling of a loop allows the compiler to schedule more loads early. And while this is generally beneficial (it increases latency tolerance), that is not always the case.

The CUDA profiler can provide many useful insights, and one has to spend some quality time with it to learn to exploit it to its full potential. I am not an expert in that regard either because I did not have a need to use it very often in recent years.

Thank for the idea.
I’ve tried to profile those specific kernels (quadratures() and quadratures_v2_1()). It is a little bit too difficult for me to interpret all profiler output, I’m a newbie here. :( I see a lot of extremely varied values, but how it should be used to improve performance in my case? I’ll be very appreciative if anyone could help me interpret those results.

(post deleted by author)