I have implemented a stack by using shared memory on Fermi and the performance is excellent.
However this trick only works when size of stack is limited by 8.
Now I want to develop a general version when size of stack is bigger than 8.
I intend to allocate a global stack in global memory. The size of stack depends on
input parameters, so I will not use static local memory and I don’t want to allocate
memory inside kernel in order to match time budget.
Suppose stack size is 16 “double” (128 bytes), and target platform is C2070.
Then I will allocate (14 SM) x (1536 threads/SM) x (128 bytes / thread), and then use
inline assembly to fetch smid, warpid and laneid
int laneid ;
int warpid ;
int smid ;
int nsmid ;
asm("mov.u32 %0, %%laneid ;" : "=r"(laneid));
asm("mov.u32 %0, %%warpid ;" : "=r"(warpid));
asm("mov.u32 %0, %%smid ;" : "=r"(smid));
asm("mov.u32 %0, %%nsmid ;" : "=r"(nsmid));
Once laneid, warpid and smid are ready, each thread can compute starting address of its stack.
I have one question that number of physical SMs is not equal to nsmid, for example
GTX480: nsmid = 15
C2070: nsmid = 15, (physical SMs = 14)
This is mentioned in ptx_isa.pdf
"A predefined, read-only special register that returns the maximum number of SM
identifiers. The SM identifier numbering is not guaranteed to be contiguous, so
%nsmid may be larger than the physical number of SMs in the device."
Of course I can use one kernel to fetch nsmid before doing my work.
My question is
(1) does nsmid keep the same value for each card?
(2) is there any rule to know nsmid in advance without probing it?