Kernels with heavy register spills Tricks to improve memory throughput?

Hi,

GPU Tesla 2070

I have a kernel which is quite big and results in lot of spills -
568 bytes stack frame, 1028 bytes spill stores, 1116 bytes spill loads

This results in many local loads and stores - and also heavy dependence on compilers performance on making these loads and stores coalesced.
Since I dont know enough about the nvcc compiler I really dont know if it is doing a good job or not while managing these spilled loads and stores (which are now in global mem).
The application is such that major part of the global mem transaction comes from these spilled ld/st thus managing them effectively becomes very important - in my opinion.

Requested Global Load Throughput(bytes/sec) – 385659128
Global Load Throughput(bytes/sec) – 12584594915
Requested Global Store Throughput(bytes/sec) – 296805
Global Store Throughput(bytes/sec) – 732814299
Global load efficiency - about 3%
Global store efficiency - .04%

As we can see currently I achieve extremely low global ld/st efficiency which indicated extremely wasteful utilization of bandwidth - am i correct?

What are the possible ways to improve this situation?

  • Coding methods?
  • Shifting some variables to into shared mem (how - based on their bigger live range etc?)
  • Explicitly declare some internal variables as global so that I can manage their coalesced access?

Would like to know your thoughts on these…

Thanks!
Sid.

Are you sure your reads and writes are coalesced?
Does each 32 threads request continuous locations in memory?
As a general rule your reads and writes should be (base_address+tid) to be coalesced.

I don’t think it is a compiler problem.

As I said, the major portion of my global ld/st are coming disguised as ‘local’. Which means that the registers spilled and some other data structures have automatically been placed into global memory by the compiler and I (as per my current implementation) have no control on them - I cannot apply the coalesced read/write rules like the one you mentioned on those local ld/st - it is probably compilers job.

Correct me if I am wrong in my understanding.

Local memory is laid out differently so accesses are always coalesced (unless threads access “automatic” arrays each using a different index).

So suppose I have two arrays declared in my kernel.

double X[4],Y[4];

variable X is used in a loop which has been completely unrolled - which results in fixed index value based access at compile time.

Variable Y is used in a complex (maybe nested) loop and has not be unrolled.

a) would it mean that compiler would attempt to allocate regs for variable X which may/maynot spill onto local memory AND also its access would be coalesced even when it spills

b) For variable Y - it would be a case of “automatic” indexed array - going directly to local mem and probably not coalesced?

Am I understanding it correctly?

Are we sure that in general cases local ld/st are coalesced ( I would assume so - compiler should be able to do this)? Do you happen to remember where this been stated by Nvidia?

So you should probably be focusing on variables of type Y. How do I identify them with surety?

Yes, I think so.

It is stated in the Programming Guide:

The only way to be sure (I think) is to look at disassembled device code using cuobjdump -sass. But your description makes it pretty clear that the compiler has no better option for this variable. You might however manually place it in shared memory (with suitable threadIdx-dependent addressing).

Thanks Tera. This was very helpful.
I will to analyse my kernel and various profiler stats more and get back here.

Sid.

I think the best thing is for you to declare your arrays in shared memory and have full control over them.

Or just configure the device for more L1 cache and less shared memory.

I have this in my implementation currently.

My L1 cache hit rates -

L1 local Hit rate - 86%

L1 Global hit rate - 72%

It should be helpful. I will try this and report if I get improvement.

Sid.

PS - Quick question - Is there an obvious reason for a kernel with a config having bigger block size to run faster than one having smaller block size even if gridsizes have been rearranged in such a way that in both cases same number of warps are roughly invoked on each SM. eg (blockSize X GridSize) - 64 X 96 is faster than 32 X 192 ? There is no intra block communication etc.

Yes: Even numbers of warps are slightly faster for a number of reasons, not all of them documented. And even 64 threads per block don’t allow reaching full occupancy due to the limit on number of resident blocks per SM.

OTOH you want to allow for more than one block per SM, and occupancy is not equivalent to speed either.

Do you happen to know the limit to maximum number of resident blocks for Tesla 2070 (448 core - probably 28 SM)?

Yes: It’s 8 blocks per SM for all but the newest Kepler devices. Check Appendix F of the Programming Guide.