Identifying SM number of a block

I am trying to debug unstable block timings (sometimes a certain block runs quickly, sometimes it seems to stall and take 10x longer than usual.) This is with monte carlo code so it’s hard to reproduce… the problem tends to jump to new blocks for every run. To aid debugging I am measuring per-block runtime with clock() sentinels at start and end.

For this debugging I want to see if “brother” blocks are slowed at the same time… a brother block being run on the same SM simultaneously. This is abstracted away from CUDA so blocks don’t really know where they’re being run, but now I’d like to know to help debugging.

I do remember a thread here on the forums from two years ago or so (?) which found that SM IDs could be found by sneaking illegal peeks into unallocated shared memory.

Something weird (and clearly an unsupported hack!!) like (from memory):

extern __shared__ unsigned int array[];

int mySMnumber=array[-3];

I’m unsuccessful finding this thread again via Google (or the forum’s own less powerful search). Does anyone remember the details or have a link?

And yes, I do realize how ugly and unsupported this hack is…

Thanks!

Why not use the (also not really supported) inline ptx option? That way you can jsut use the supported %smid ptx function for getting the multiprocessor id.

Something like below (warning I have never tried to write ptx ;))

device uint get_sm_id(void)

{

uint result;

asm(“mov.u32 %0, %1;” : “=r” (result) : “%smid”);

return result;

}

If it works, please let me know, I still have a project somewhere where if I get some time I will implement this to be able to do a per multiprocessor output, so I can do a reduction over 32 values afterwards instead of over millions of values.