I have a few implementation related questions after going through the Ch.5 on Performace Guidelines in the Programming Guide 2.0 and playing around a bit with the Occupancy Calculator:
From the Occupancy Calculator, I could understand the meaning of “active thread blocks”…i.e. the number of thread-blocks that can be scheduled at a time on a given multiprocessor given the limited resources of shared memory and registers per multiprocessor and the physical limits on the GPU hardware. I can understand the limitations due to shared memory and GPU hardware but not due to registers. I understand that register spills are stored in the local memory. So there should be some way in which I should be able to decide/specify how much to store in the registers and how much in local memory so that I could minimize register usage. I vaguely remember having come across one such discussion in which I could specify something into a file but can’t find it now. Can anyone please suggest?
Quoting from the occupancy calculator:
Now I am confused as to how do I figure out whether my code is bottlenecked by bandwidth, by computation or by global memory accesses?
This is regarding the coalesced memory accesses. I understand that memory access by a half-warp should be limited to within a “segment” which are aligned starting address “zero” for any segment size. Now how do I make sure that the half-warp knows the start address of any segment? Eg. for 64B segments, the starting addresses of different segments would be 0,64,128,192 and so on as in figure 5.4 in the programming guide.
Can I use align (16) for arrays too or it should be used only with structs? Will using align (16) with large arrays help in reducing the number of load instructions when I move data from device to shared memory?
Quoting from article 5.3 of the programming guide:
What variable types are we talking about…device or shared? If intermediate data structures can be created and destroyed, will I be correct in thinking that creation and destruction should happen inside the kernel? If yes, then why I was not able to declare a new variable as __device__type from inside the kernel? I could ever declare a device type variable only in the global space (i.e. outside the CPU and the GPU code…where we declare global variables in a normal C code). If I am wrong in my understanding, can someone please paste a small code-snippet to illustrate the right way to create and destroy such intermediate variables?
- Quoting from article 184.108.40.206 of the programming guide:
I believe [(i%n) = (i&(n-1))] would be true for n=any multiple of 2, and n=1 (and not just a power of 2). Wanted to reconfirm since I have a lot of modulo division at almost every step in my code. Is there a faster way for modulo division with any integer n?
Thanks & regards,