a deep dive into Instruction-level parallelism

njuffa · December 18, 2018, 12:57am

More likely than not (you should be able to confirm with the help of the CUDA profiler), this change

(1) reduced dynamic instruction count by allowing all but the first address to be computed by simple addition
(2) allowed the compiler to schedule loads earlier and batch them, increasing latency tolerance and improving the efficiency of memory accesses

I would not classify this as an ILP-related technique (which is why I asked), but I guess one could argue about the exact definition of that. To first order, a GPU is a scalar processor which can schedule one instruction per thread per cycle for execution, i.e. the question of ILP doesn’t even come up. There are various exceptions to this first-order description which vary by GPU architecture, but I am not aware of any GPU that shipped since CUDA came into existence that issues more than two instructions per thread per cycle under any conditions.

If someone has better information, I encourage comments on this. There have been too many different architectures to keep all details in my memory.

Topic		Replies	Views
Pipelined Loads CUDA Programming and Performance	54	7334	September 21, 2010
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23032	October 12, 2010
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7237	March 30, 2013
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2630	March 5, 2013
Waiting for global memory access. CUDA Programming and Performance	32	56464	January 31, 2008
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30368	December 15, 2010
How close to peak can you get on a CPU? CUDA Programming and Performance	33	3020	November 9, 2010
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1618	June 15, 2011
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4545	October 24, 2008
Warp Size Question CUDA Programming and Performance	21	14098	June 18, 2010

a deep dive into Instruction-level parallelism

Related topics