question about nsmid

I have implemented a stack by using shared memory on Fermi and the performance is excellent.

However this trick only works when size of stack is limited by 8.

Now I want to develop a general version when size of stack is bigger than 8.

I intend to allocate a global stack in global memory. The size of stack depends on

input parameters, so I will not use static local memory and I don’t want to allocate

memory inside kernel in order to match time budget.

Suppose stack size is 16 “double” (128 bytes), and target platform is C2070.

Then I will allocate (14 SM) x (1536 threads/SM) x (128 bytes / thread), and then use

inline assembly to fetch smid, warpid and laneid

int laneid ;

    int warpid ;

    int smid ;

    int nsmid ;

    asm("mov.u32 %0, %%laneid ;" : "=r"(laneid));

    asm("mov.u32 %0, %%warpid ;" : "=r"(warpid));

    asm("mov.u32 %0, %%smid ;" : "=r"(smid));

    asm("mov.u32 %0, %%nsmid ;" : "=r"(nsmid));

Once laneid, warpid and smid are ready, each thread can compute starting address of its stack.

I have one question that number of physical SMs is not equal to nsmid, for example

GTX480: nsmid = 15

C2070: nsmid = 15, (physical SMs = 14)

This is mentioned in ptx_isa.pdf

"A predefined, read-only special register that returns the maximum number of SM

identifiers. The SM identifier numbering is not guaranteed to be contiguous, so

%nsmid may be larger than the physical number of SMs in the device."

Of course I can use one kernel to fetch nsmid before doing my work.

My question is

(1) does nsmid keep the same value for each card?

(2) is there any rule to know nsmid in advance without probing it?

If you use %smid without remapping you’ll need to allocate 16 sections of global memory instead of 14, and there will be 2 empty sections left unused. (Maybe for the specific C2070 you probed you’ll only need 15 sections with 1 empty)

As for your questions

  1. %nsmid is pre-defined. There is no reason to expect it to change on the same card.
  2. I don’t think you could, unless you know which 2 MPs on your card are disabled. Of course, I assume that different cards usually get different MPs disabled when they are manufactured.

Another problem I can see is that while you’ll have 2 block running concurrently on the same MP, you’re not doing anything to differentiate the 2 blocks. The two blocks will get the same smid and the same range of warpid. However, if two concurrent blocks always start and terminate together, I guess one of the 2 blocks will always have an even blockIdx.x and another will always have an odd one. This could be used to differentiate the 2 concurrent blocks.

As for the shared memory approach, you can reduce the number of threads per warp and rely on ILP to achieve better ALU pipeline utilization. This would allow you to get 128 bytes /thread.

@hyqneuron

Yes, I can increase size of stack by sacrificing occupancy, but my app is memory-bound and ILP is not clear.

I prefer to have 512 threads per SM at least. So I still need a general version which can work on large stack.

Oh, thank you for this important point. Concurrent blocks in one SM should work because each warp indeed have different warpid, and then

every thread can have its own stack without conflict. However for new block, it is possible that some warp of new block would have

same warpid as one warp of existing blocks. I think I need a counter per SM to remove this race condition.

For example, the following code can guarantee non-conflict warpid in one SM. And I allocate

(nsmid) x (2048 threads/SM) x (size of stack per thread)

However I need to pay penalty of atomicAdd.

__shared__ volatile int s_warpid[16];

int tid = threadIdx.y * blockDim.x + threadIdx.x ;

int laneid ;  

int smid ;

int warpid ;

asm("mov.u32 %0, %%laneid ;" : "=r"(laneid));    

asm("mov.u32 %0, %%smid ;" : "=r"(smid));    

if ( 0 == laneid ){

    s_warpid[tid >>5] = (atomicAdd(&counter[smid], 1) & 63) ;

}

warpid = s_warpid[tid >>5] ;

Could make it more generic by having a small global array with say 8 elements per SM
When a block starts it reserves a stack by setting the first free element (of the 8 for its SM) to its block number+1, and when it finishes sets it back to 0. (definatly wont have to use Atomic to set it back to 0)

atomicAdd is not a problem because number of atomicAdd is much smaller than memory transactions in the kernel.
However your idea is great, 1 SM supports 8 blocks at most, so I can use block index. I will try this.

So far I develop three versions,
v1: use shared memory to implement the stack
v2: use local memory to implement the stack
v3: use global memory to implement the stack

Obviously v1 runs fastest, and v3 is at most 50% slower than v1.
However v2 is 8x slower than v1, this is wierd.

I compare assembly code of v2 with that of v3, the only difference is STL, LDL and LD.E, ST.E

I don’t know why v2 is so slow.

Local memory bypasses L1.

Edit: are you running in 64-bit mode?
Also, reducing occupancy could still work well even if your kernel is memory-bound, as long as your kernel doesn’t use too many registers. You’ve seen the GTC 2010 doc Better Performance At Lower Occupancy, haven’t you? The author managed to get 84% of peak throughput at 4% occupancy.

I made a mistake on computation stack pointer, and coalesced property failed,

that is why v2 is 8x slower than v1.

Now v2 is almost the same as v3.

My work is to evaluate Reverse Polish Notation (RPN).

For example, infix expression a * b + c would be transformed to postfix one, a b * c +,

and pack all expressions into a linear array. Each thread reads one expression and use

a stack to evaluate this expression.

The bottleneck is efficiency of loading operands. In my experiment, I generate address of

operands at random, so it is fully non-coalesced. Also I focus on double precision, the type

of operands is “double”. In order to simply design of prototype, I only handle +, - , * and /.

If I use shared memory as the stack, then I have following configurations.

(1) stack size = 8, occupancy is 768 threads per SM

(2) stack size = 12, occupancy is 512 threads per SM

(3) stack size = 16, occupancy is 384 threads per SM

(4) stack size = 32, occupancy is 192 threads per SM

(5) stack size = 64, occupancy is 96 threads per SM

(6) stack size = 128, occupancy is 32 threads per SM

My experiments show (1), (2), (3), (4) are almost the same, (5) is 30% slower and (6) is 3x slower.

As I mentioned, this app is memory-bound and penalty of loading operands dominates bandwidth.

So occupancy may not be critical here.

The good news is I can have large stack of size 32 without sacrificing performance.

C2070 is 10x~16x faster than single thread in corei7, and speedup of GTX480 is over 20x.

If your app is memory-bound, you should be able to switch between two pointers and store the first 32 values in shared memory, rest in local or global?

I only use shared memory, if stack size = 32, occupancy is 192 threads per SM, then total amount of stacks is 192 * 32 * sizeof(double) = 48KB.

I don’t think this is correct. According to the Fermi Tuning Guide (pg 3): “Devices of compute capability 2.x come with an L1/L2 cache hierarchy that is used to cache local and global memory accesses.” In fact, a few paragraphs later it says: “Local memory caching in L1 cannot be disabled…”

I misread the PTX manual. Sorry for the wrong info. External Image

Actually it’s global memory stores that bypass the L1.