Performance degradation as task size grows

_Illidan · April 22, 2023, 10:42am

I am writing a code to solve the lamé equation for anisotropic waveguides and have come across the following situation. The performance of the code that calculates the part of the green matrix drops after a certain task size is reached.
I understand why performance is jagged at small sizes - it’s tail effects, but what causes the drop at large sizes? In my understanding, performance should asymptotically tend to peak with growth in size.

CUDA 11.8 Win 10 GTX1080 (PC)
CUDA 11.8 Win 10 GTX840M (Laptop)

part of my code


#define hv _host_ void 

hv test_speed_new_v3(int lo = 1, int hi = 1025, int step = 1, int repetitions = 10)
{
//global variables, nothing interesting
	timer_data_ timer;

	t_device_config device;
	t_math_solver_configs  solver;
	int dim1 = 1;
	int dim2 = 1;
	float default_override = 4;

	int layers = 1;

	int mesh_kind = MESH_GAMMA_ALPHA;
	cuDoubleComplex g1 = make_cuDoubleComplex(1.000000, 0.0);
	cuDoubleComplex g2 = make_cuDoubleComplex(1.000001, 0.0);
	cuDoubleComplex a1 = make_cuDoubleComplex(1.0, 0.0);
	cuDoubleComplex a2 = make_cuDoubleComplex(10.0, 0.0);

	double m__ = 1;
	cuDoubleComplex omega = make_cuDoubleComplex(0.628 * m__, 0.0 * m__);
	cuDoubleComplex z = make_cuDoubleComplex(0.0, 0.0);

	float override_ = default_override;
	int smem = 64 * 1024;
	int th = 128;
	int regs = 128 * 1024;
	int preset = NO_PRESET_BLOCKS;

	int thuse;
	int smemuse;
	int reguse;

	t_math_solver_configs* solver_on_device;

	device_init(device);
	init_empty_vals(device);
	cuda_streams_init(device);

	init_timer(&timer);

	cudaMalloc(&solver_on_device, sizeof(t_math_solver_configs));
	cuda_check(cudaDeviceSynchronize());

//this part is measuring time for different problems
	for (int cur_pc = lo; cur_pc < hi; cur_pc += step)
	{
		solver.dim1 = dim1;
		solver.dim2 = cur_pc ;

//allocating mem
		build_spec_pools_Nxlay_v3_expandedME(device, solver, 1);

		cudaMemcpy(solver_on_device, &solver, sizeof(t_math_solver_configs), cudaMemcpyHostToDevice);

//initialize physic constants
		k1x1_init_com0_1l_caller << <1, 1 >> > (solver_on_device);

		thuse = 32;
		reguse = 110;
		smemuse = 1024;
		int blocks = GetBlocksCountf4(device.device_SM, smem, smemuse, th, thuse, regs, reguse, override_, dim1 * dim2, preset);

//generate mesh
		knxm_generate_points_v2 << <blocks, 32 >> > (solver_on_device, g1, g2, a1, a2, omega, z, dim1, dim2, mesh_kind);
		cudaMemcpy(&solver, solver_on_device, sizeof(t_math_solver_configs), cudaMemcpyDeviceToHost);

		start_timer(&timer);

//time consuming part
		for (int rep_id = 0; rep_id < repetitions; rep_id++)
		{
			k1x1_smart_launch_ENEV << <1, 1 >> > (solver_on_device, dim1 * dim2 * layers, device.device_SM);
			k1x1_smart_launch_BLOCKS_V2 << <1, 1 >> > (solver_on_device, dim1 * dim2, device.device_SM);
			k1x1_smart_launch_SOLVER << <1, 1 >> > (solver_on_device, dim1 * dim2, device.device_SM);
		}

		double elapsed = stop_timer_avg(&timer, (double)repetitions);

		double points_per_sec = ((double)dim2 * 1000000) / elapsed;

		printf("count %d elapsed %lf us perf pps %lf \n", dim2, elapsed, points_per_sec);

		deallocate_pool(device, solver.multi_pool);
		cuda_check(cudaDeviceSynchronize(), "post cycle check", true, __FILE__, __LINE__);
	}
	kill_timer(&timer);
	device_release(device);
}

code for timer



#define hv _host_ void 

struct timer_data_
{
	double executed_us;
	cudaEvent_t start, stop;
};


hv init_timer(timer_data_* timer)
{
	cuda_check(cudaEventCreate(&timer->start));
	cuda_check(cudaEventCreate(&timer->stop));
}

hv start_timer(timer_data_* timer)
{
	cuda_check(cudaEventRecord(timer->start, 0));
	cuda_check(cudaEventRecord(timer->stop, 0));
}

__host__ double stop_timer(timer_data_* timer)
{
	float time;
	cuda_check(cudaEventRecord(timer->stop, 0));
	cuda_check(cudaEventSynchronize(timer->stop));
	cuda_check(cudaEventElapsedTime(&time, timer->start, timer->stop));
	timer->executed_us = (double)time / 1000.0;
	return timer->executed_us;
}

__host__ double stop_timer_avg(timer_data_* timer, double avg)
{
	float time;
	cuda_check(cudaEventRecord(timer->stop, 0));
	cuda_check(cudaEventSynchronize(timer->stop));
	cuda_check(cudaEventElapsedTime(&time, timer->start, timer->stop));
	timer->executed_us = ((double)time * 1000.0) / avg;
	return timer->executed_us;
}
 
__host__ void kill_timer(timer_data_* timer)
{
	cuda_check(cudaEventDestroy(timer->start));
	cuda_check(cudaEventDestroy(timer->stop));
}

UPD: GPU temperature stays low during computing ~50-55 degrees Celsius

Robert_Crovella · April 22, 2023, 1:50pm

Is this what you are measuring on the y axis?

_Illidan:

		start_timer(&timer);

//time consuming part
		for (int rep_id = 0; rep_id < repetitions; rep_id++)
		{
			k1x1_smart_launch_ENEV << <1, 1 >> > (solver_on_device, dim1 * dim2 * layers, device.device_SM);
			k1x1_smart_launch_BLOCKS_V2 << <1, 1 >> > (solver_on_device, dim1 * dim2, device.device_SM);
			k1x1_smart_launch_SOLVER << <1, 1 >> > (solver_on_device, dim1 * dim2, device.device_SM);
		}

		double elapsed = stop_timer_avg(&timer, (double)repetitions);

If so (yikes!), there is nothing about a tail effect that applies to any of that.

More generally, you might want to explain how the x and y axis variables are derived from the code you have shown. Regarding the x axis, for me, anyway, “problems” or “problems count” is not self-evident from the code you have shown.

_Illidan · April 22, 2023, 2:32pm

On y axis is measured followng value - how many samples can my GPU process per second
samples per sec = 1000000 / (elapsed in us / (dim1*dim2 ) )
On x axis is followng value - samples count in my test
'samples/10' = dim1*dim2/10
I made an excel spreadsheet and calculated the performance by copying the output of my program there.

Robert_Crovella · April 22, 2023, 2:35pm

kernel launches of <<<1,1>>> are generally not a performant way to use a GPU. There is no tail effect with a kernel launch of <<<1,1>>>.

_Illidan · April 22, 2023, 2:39pm

It’s just laucher for child grid, that decide what exactly to call, its done to call more efficient versions on small tasks

gv k1x1_smart_launch_BLOCKS_V2(t_math_solver_configs* solver, int size_, int sm)
{
	int load_per_sm = size_ / sm;
	float override_ = 8;
	int smem = 64 * 1024;
	int th = 128;
	int regs = 128 * 1024;
	int preset = NO_PRESET_BLOCKS;
	int thuse = 18;
	int smemuse = 1440;
	int regs_use = 61;
	int work_size = size_;

	int optimal_blocks_max = GetBlocksCountf4(sm, smem, smemuse, th, thuse, regs, regs_use, override_, work_size, preset);


	int block_cap = sm * 128 / 2;

	if (optimal_blocks_max > block_cap) optimal_blocks_max = block_cap;

	if (load_per_sm >= 28)
	{
		k18xm_32b_blocks_reverse_ext_v4 << <optimal_blocks_max, 18 >> > (solver);
		return;
	}

	if (load_per_sm >= 24)
	{
		k18xm_28b_blocks_reverse_ext_v4 << <size_, 18 >> > (solver);
		return;
	}

	if (load_per_sm >= 20)
	{
		k18xm_24b_blocks_reverse_ext_v4 << <size_, 18 >> > (solver);
		return;
	}

	if (load_per_sm >= 16)
	{
		k18xm_20b_blocks_reverse_ext_v4 << <size_, 18 >> > (solver);
		return;
	}

	if (load_per_sm >= 12)
	{
		k18xm_16b_blocks_reverse_ext_v4 << <size_, 18 >> > (solver);
		return;
	}

	if (load_per_sm >= 8)
	{
		k18xm_12b_blocks_reverse_ext_v4 << <size_, 18 >> > (solver);
		return;
	}

	k18xm_8b_blocks_reverse_ext_v4 << <size_, 18 >> > (solver);
	return; 
}

This code will lauch grid <<<min (sm *max_possible_per_sm, dim1 * dim2) ,18>>>
18 thread usage caused by 3x6 and 6x6 small matrix calculations
difference is in launch bounds:

#define gv global void
gv  __launch_bounds__(18, 32)   k18xm_32b_blocks_reverse_ext_v4(t_math_solver_configs* solver)
{
...
}

gv  __launch_bounds__(18, 24)   k18xm_24b_blocks_reverse_ext_v4(t_math_solver_configs* solver)
{
...
}


__host__ int  GetBlocksCountf4(int sm, int smem, int smemuse, int th, int thuse, float regs, float regs_use, float override_, int work_size, int preset)
{
	float smemlim = 1024 * 1024;
	if (smemuse > 0)  smemlim = (float)smem / (float)smemuse;
	float thlim = (float)th / (float)thuse;
	float reglim = (float)regs / (float)(regs_use * thuse);
	if (smemlim < thlim) thlim = smemlim;
	if (reglim < thlim) thlim = reglim;
	if (thlim > 32) thlim = 32;

	int blocks = (int)((float)sm * (thlim * override_));

	if (blocks > work_size) blocks = work_size;
	if (preset > 0) blocks = preset;

	return blocks;
}

_Illidan · April 22, 2023, 2:40pm

picture for elapsed time
y axis - elapsed time per repetition in us
x axis -'samples/10' = dim1*dim2/10

Robert_Crovella · April 22, 2023, 3:16pm

18 threads per block is also not particularly good for a GPU. Launching 32x18 threads per block may be better. You may still have “occupancy quantization” going on, which might lead to the performance concern here.

There’s enough levels of abstraction here that I wasn’t able to figure out grid dimensions for the cases around dim1*dim2=500

_Illidan · April 22, 2023, 3:24pm

What is “occupancy quantization”? I can’t find it on google
I am running up to 24 * sm blocks with 18 threads grids on my GPU for this kernel (BLOCKS_V2 ).
like <<< blocks = 1000, 18>>> in the previous example
Interestingly, I’ve found that the efficiency stars drop as I get closer to cuda’s core count.
So if i get you right your idea is try to improve kernels using N*18 threads instead of 18 threads per block?
One of the problems with my code is that it is “related to math formulas” - the choice of threads and algorithms is directly related to the math formulas I’m implementing, so choosing different threads is not that easy, but I can try to do something in this appeal

Robert_Crovella · April 22, 2023, 4:38pm

Its a term I just made up. Occupancy is the idea of the GPU carrying as much workers as possible. Full occupancy means the GPU is carrying as many workers (e.g. threads) as is possible.

Quantization refers to the idea that if you integer divide one number by another number, then depending on the actual numerator and denominator, the difference between the integer result and actual result may vary, due to quantization - expressing a result in fixed size units. (I’m sure there are better definitions of quantization - this will suffice in my view.)

So this “division” comes about when we take the full occupancy, (to be arbitrary - the number does not matter) let’s say 2048 threads, and divide that by the denominator - the number of threads per threadblock. We use integer division because the GPU does not express or launch fractional threadblocks. If the number of threads per threadblock is 512, the difference between the integer result (4) and the actual result (4) is zero. OTOH if the number of threads per threadblock is 513, the difference between the integer result (3) and the actual result (3.9922…) is not zero. This “quantization” effect can limit your achieved occupancy, which can limit your performance. If you take a kernel and plot the throughput vs. threadblock size, varying from say 1 to 1024 or something like that, you will often see a stepped performance curve, not unlike your plots.

As a separate topic:

A threadblock that has 18 threads (only) in it will, to a first order approximation, and based on that piece of information only, waste ((32-18)/32)*100% = 44% of your GPU horsepower. Launching any number of such blocks does not change that fact or arithmetic.

A threadblock that has 18x32 threads in it will, to a first order approximation, waste 0% of GPU resources, again considering only that datum.

There are probably additional grid sizing criticisms we could make (note that even 18x32 threads may still be a factor for the occupancy considerations mentioned above), but that one (18 threads per warp/block) in particular is probably the largest single performance factor to be concerned about “on paper”.

_Illidan · April 22, 2023, 4:55pm

So I have to try something like this
cool_kernel <<<blocks, 18*32>>>(data)

cool_kernel  ( )
{
int tid = threadIds.x; 
int virtual_tid = tid% 18 ;
int virtual_block = tid /18;
// other code
}

It’s possible option for two of my three kernels, the first will be more difficult due to the use of shared memory

Robert_Crovella · April 22, 2023, 5:00pm

I don’t know what the performance limiters are for your case. The most promising approach IMO is to use analysis-driven optimization in order to improve the performance of your application.

But for me, personally, I would certainly question whether a threadblock size of 18 threads may possibly be a limiter to performance. It’s not recommended in any typical CUDA beginner training.

_Illidan · April 22, 2023, 5:37pm

I already tried to optimize code
I did analysis, so mostly its latency and global memory, as I did understand. in all cases 66% of latency also memory latency (~)
ENEV :
occupancy 24.9/ 50% max (visual profiler)
global mem throughput about 73Gb/s, max 320

Blocks
occupancy 22.5/ 50% max (visual profiler)
global mem throughput about 103Gb/s, max 320

solver 24.9/ 50% max (visual profiler)
global mem throughput about 110Gb/s achived to 320 max Gb/s

so as I understand your basic idea here is to improve grid and probably vectorize my global memory loads

_Illidan · April 25, 2023, 4:58pm

I am trying to rewrite the first of my functions (eigensolver) to 54 threads per block instead of 9 (then I will probably increase the number of warps per block), as I see from the tests, I can probably achieve a speedup of 3-5 times depending on the ratio between the size tasks and the performance limit of a particular video card (more is better). Further, I will probably apply the technique described here for other kernels. Thanks Robert, issue resolved.

system · May 9, 2023, 4:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	118	December 31, 2024
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	786	March 30, 2024
Efficient use of shared memory CUDA Programming and Performance	29	4467	December 2, 2019
CUDA, more threads for same work = Longer run time despite better occupancy, Why? CUDA Programming and Performance	9	6002	March 25, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11155	May 23, 2010
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6063	November 16, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5901	July 25, 2007
Cuda code performance CUDA Programming and Performance	14	3144	December 16, 2014
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2563	March 5, 2013

Performance degradation as task size grows

Related topics