Register number per SM on GPGPUSim simulator


I am using GPGPUSim simulator to do GPU project. The default register number per SM is 8192. When I increase the register number to 16384, and further to 32768, the execution time is longer on some benchmarks. I checked the bottleneck analysis. It shows the stall is caused by register write back contention. I am not quite understanding this meaning. If someone have such experience, I appreciate it so much if you can explain why this unexpected longer execution time happens. Why the stall is increasing in the register write back stage?

Thank you so much!


In case anyone is interested here, Prof. Tor Aamodt from UBC previously posted a potential explanation for this as follows:

"When read requests return from memory they contend for register file
access with any warps in the pipeline that would like to enter the
write back stage. If your application is limited by the number of
registers, then with less registers, fewer warps would be able to run
on a single shader core (SM). That may tend to reduce the chance that
a memory request returning from memory encounters a warp trying to
enter writeback. Increasing number of registers may increase the
chance of seeing this contention. Still if performance is decreasing
when you increase number of CTAs (blocks) running concurrently (by
increasing register resources) I would look for other explanations.
One possibility (discussed in section 4.7 our ISPASS’09 paper) is
increasing DRAM contention causing lower DRAM efficiency.

In my research group we typically look for bottlenecks using
AerialVision (included with the latest version of GPGPU-Sim) since it
helps give a more complete picture than aggregate statistics. If the
warp divergence breakdown shows a high “W0” component it typically
means the application is DRAM bandwidth limited (another explanation
for high W0 even when not accessing memory can be “low occupancy” –
e.g., below six warps per SM it is not possible to keep the pipeline
full). One thing we always check for if we see counter intuitive
behavior is “load imbalance”. By this I mean cases where one SM gets
more work than others in one configuration versus another
configuration and then takes longer to finish. Generating a parallel
intensity plot of IPC per shader core in AerialVision tends to give a
hint when load imbalance is affecting performance leading to anomalous
results (check if one configuration has a few SMs busy at the end of
the simulation even though all other SMs have completed). There is a
short discussion of the load imbalance issue in our ISPASS’09 paper . More on
AerialVision can be found in our soon to be presented ISPASS 2010

I read this message. Thank you so much!