Got some questions about ptx code ptx code, cubin file, decuda, register usage

Hello,

I have some questions about ptx code. My current code is using too many registers. I invoked (64, 2) threads for one block and each of the threads in my algorithm explicitly uses about 100 registers. But when I compile it with --ptxas-options=-v, it reports 121 registers plus 120 bytes local memory usage. Now I have the following questions:

  1. I looked at ptx code and there was no local memory usage. I mean I searched “.local” but found non. So is this simply because I use too many registers? But I use only 100 registers explicitly in my algorithm! Could other temporary variables take so many registers?

  2. The only local memory usage that can be observed is due to careless programming, right? (e.g. can’t determine the index of an array during compilation) Only in this case, we can clearly see there is local memory declared in ptx code. If the local memory usage is only due to too many registers used, we can’t tell this from ptx code alone.

  3. As I can see in ptx code, there are about 900 temporary registers declared. When compiled from ptx file to cubin file, it’s further optimized and thus most of these registers are eliminated, right?

  4. How can I know where my registers are used and what they are used for? Or which part of my code results in local memory usage? I can’t see any local memory usage from ptx code now.

  5. Decuda is only available before sdk 3.0 and after that, cubin file uses ELF format which disabled decuda. Is this correct? By the way, why I can’t find any decuda download available online now? This link (http://github.com/laanwj/decuda/downloads) seems unavailable any more.

Sincerely waiting for your reply and your help. Thanks.

Please delete this post as I will post it on another board. Thanks.