call stack/ stack frame

and so i cleaned up the stack frame - significantly

however, in its almighty wisdom, the compiler still refuses extra blocks to be seated

i am confident that it is not due to shared memory, but still due to local memory - i hardly spill anymore, and when i do, it is no more than 16 bytes

exactly how does on interpret the ptxas information on register usage? is the noted value again per thread?

ptxas info : Function properties for Z8fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 180 registers, 507 bytes smem, 688 bytes cmem[0]

if so, the value seems excessive, to say the least
is there a way to obtain a more detailed breakdown of the register use build-up, other than to completely strip the kernel, and to systematically put its components back again, noting changes in reported register use with each step?

Yes, the register use reported by ptxas is per thread. To get total register use by the kernel launch, you have to multiply by the number of threads launched, possibly also taking into account register allocation granularity.

thanks, txbob

i believe 180 registers per thread is exorbitant; the kernel itself is basically a shell, calling/ controlling a number of functions; and i really can not think that any function would use that much registers per turn; perhaps the kernel/ functions are ‘clinging’ on to registers

is there a way to track this? otherwise i would have to start with an empty kernel shell, and add its functions one by one, to note register use/ accumulation

I agree it sounds quite high. I don’t know how to tease out some kind of allocated breakdown of register usage; perhaps others will (other than by just inspecting SASS, which is pretty manual). However you seem to be focused on occupancy. Register usage can have a major effect on occupancy. AFAIK spilling by itself does not (although it can affect perf, obviously.) You might want to validate your compass heading, by first just using launch bounds to limit register usage (acknowledging the probability of spills) and see if you can drive up occupancy.

If you can, then there may be some value in attacking register usage. I’m kind of surprised that you can get 8 threadblocks “seated” while still having a register usage of 180 per thread. You must not be launching very many threads per block (my arithmetic says less than 64).

As a final suggestion, I do think that performance analysis-driven optimization is the way to go, instead of blindly attacking register usage, or occupancy, or any other specific metric. There are good presentations on the basics of analysis driven optimization.

“AFAIK spilling by itself does not”

i was under the impression that, to put it in my own words, the compiler sees spilling into L1 as an orange light, and spilling into global as a red light…
the propensity to squeeze out more performance should drop with an increase in spilling, in the eyes of the compiler

yes, i only seat blocks of 32 threads

this might sound awkward, but, in addition to parallel implementation, my problem solves faster the more instances are processed concurrently, because it impacts arrival rates in short
hence, at present i obtain 120 instances, and i feel semi accomplished

to be honest, the profiler behaves much like memtest with this particular kernel - simply put, it gets lost…

thanks for the advice

Re: the kernel itself is basically a shell, calling/ controlling a number of function

If those device functions reside in the same compilation unit as the kernel itself, they will likely get inlined into the kernel which can create very large code and a lot of register pressure due to the many variables used across the totality of that code. Again, this is speculation as to the cause of the high register pressure since you have not provided any of the code in question. Note that registers are 32 bit wide, so any 64-bit data objects, e.g. of type double, long long int, or size_t will take up two registers each.

To explore the impact of inlining on register pressure, you could experiment with the noinline function attribute, or try to compile each function separately. I second txbob’s recommendation to use the profiler for analysis-driven optimization instead of trying to guess what the performance limiters are in this code.

i am doing something wrong, missing something, or a combination of the 2…

firstly, noinline does nothing whatsoever to the register usage

secondly, this is the kernel as shell, stripped of all the device functions:

__global__ void _fwd_krnl(bool in_have_ref_case, bool gbl_pnts_rem,
	unsigned int cur_krnl_n, unsigned int comb_arr_size, unsigned int iss_mul_comb_arr_size,
	unsigned int comb_pnt_n, unsigned int tot_block_bnd, unsigned int gbl_pnt_cnt,
	unsigned int block_pnt_cnt_trans_pnt, unsigned int block_pnt_cnt_trans_pnt_block_n,
	unsigned int block_pnt_cntA, unsigned int block_pnt_cntB,
	double in_ref_case, double port_val, double max_fwd, double comb_res,
	bool* d_in_pnt, bool* d_block_bnd_has_range, bool* d_block_posted_ref_case, bool* d_gbl_posted_ref_case,
	bool* d_gbl_set_ref, bool* d_stop_exec,
	char* d_gbl_sol_overflow_type, char* d_block_stat,
	unsigned int* d_pnts_cnt, unsigned int* d_rem_in_val, unsigned int* d_in_pnt_val,
	unsigned int* d_port_pnt_iss_mul_data, unsigned int* d_iss_mul_data, unsigned int* d_iss_port_pnt_in_arr,
	unsigned int* d_1st_ref_upd, unsigned int* d_gbl_block_sol_cnt, unsigned int* d_terminate,
	double* d_fwd, double* d_comb, double* d_start_pnt_range_data, double* d_port_pnt_time,
	double* d_port_pnt_time_diff, double* d_block_start_time_diff_rec, double* d_iss_cpn_in_arr,
	double* d_iss_val, double* d_iss_comb_sum, double* d_port_pnt_mul_sum, double* d_gbl_ref_case,
	double* d_block_ref_case, double* d_best_result_stack, double* d_sol_stack, double* d_worst_ref_case,
	double* d_result_upd, double* d_split_comb, double* d_split_fwd, double* d_split_work_data)
{
	__shared__ bool break_for_overhead;
	__shared__ bool must_do_overhead;
	__shared__ bool must_terminate;
	__shared__ unsigned int rerun_cnt;

	__shared__ unsigned int cur_pnt_port_pnt[3];
	__shared__ double cur_pnt_fwd[3];
	__shared__ double start_pnt_range_data[start_pnt_range_data_length];

	__shared__ Fwd_krnl_data fwd_krnl_data;

	if (threadIdx.x < start_pnt_range_data_length)
	{
		start_pnt_range_data[threadIdx.x] = d_start_pnt_range_data[threadIdx.x];
	}

	__syncthreads();
}

and this the ptxas information for it, followed by the original kernel’s information:

ptxas info : Compiling entry function ‘Z9_fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2’ for ‘sm_35’
ptxas info : Function properties for Z9_fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2
360 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 97 registers, 16 bytes smem, 688 bytes cmem[0]
ptxas info : Compiling entry function ‘Z8fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2’ for ‘sm_35’
ptxas info : Function properties for Z8fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 180 registers, 507 bytes smem, 688 bytes cmem[0]

the shell itself is already at 97 registers…?

97 registers would be more or less the amount of registers required to store the kernel parameters…

but those are passed via constant memory, not so…? so why would they end up in registers already…?

by now i have not only rolled up the within kernel device functions’ function parameters, but also the kernel’s function parameters

ptxas information reports register use down from 180 to 35 - 41:

ptxas info : Function properties for Z8fwd_krnljPjPdPPbPPcPS_PS0
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 35 registers, 951 bytes smem, 376 bytes cmem[0]

i now manage to seat 210 blocks concurrently across the device, up from 120

i shall later on attempt to more thoroughly profile, to see what is optimal loading

thanks txbob, njuffa, gregory for your input