any way to know on which SM a thread is running?

Is there any way by which I would know on which SM exactly a thread is running? Actually I just want to confirm that all the SMs have some work to do.

Is there any way by which I would know on which SM exactly a thread is running? Actually I just want to confirm that all the SMs have some work to do.

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

And you only need to call it in one thread of your block.

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

And you only need to call it in one thread of your block.

Well… nice code there. But all the output I’m getting for the smid is just “1”. I have 8 blocks, so it gives me eight “1”.

Well… nice code there. But all the output I’m getting for the smid is just “1”. I have 8 blocks, so it gives me eight “1”.

Riedijk’s code works flawlessly on my linux boxes, with

8600GT and CUDA 3.2

GTX460 and CUDA 3.2

GTX480 and CUDA 3.2

and also on the old CUDA

GTX280 and CUDA 2.3

Try the code below, compiling it with nvcc code.cu and executing it with ./a.out 100

If it fails, describe your hardware, OS and CUDA stack

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

/* E.D. Riedijk */

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

__global__ void kern(int *sm){

   if (threadIdx.x==0)

      sm[blockIdx.x]=get_smid();

}

int main(int argc, char *argv[]){

   int N = atoi(argv[1]);

   int *sm, *sm_d;

   sm = (int *) malloc(N*sizeof(*sm));

   cudaMalloc((void**)&sm_d,N*sizeof(*sm_d));

   kern<<<N,N>>>( sm_d);

   cudaMemcpy(sm, sm_d, N*sizeof(int), cudaMemcpyDeviceToHost);

   for (int i=0;i<N;i++)

      printf("%d %d\n",i,sm[i]);

return 0;

}

Riedijk’s code works flawlessly on my linux boxes, with

8600GT and CUDA 3.2

GTX460 and CUDA 3.2

GTX480 and CUDA 3.2

and also on the old CUDA

GTX280 and CUDA 2.3

Try the code below, compiling it with nvcc code.cu and executing it with ./a.out 100

If it fails, describe your hardware, OS and CUDA stack

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

/* E.D. Riedijk */

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

__global__ void kern(int *sm){

   if (threadIdx.x==0)

      sm[blockIdx.x]=get_smid();

}

int main(int argc, char *argv[]){

   int N = atoi(argv[1]);

   int *sm, *sm_d;

   sm = (int *) malloc(N*sizeof(*sm));

   cudaMalloc((void**)&sm_d,N*sizeof(*sm_d));

   kern<<<N,N>>>( sm_d);

   cudaMemcpy(sm, sm_d, N*sizeof(int), cudaMemcpyDeviceToHost);

   for (int i=0;i<N;i++)

      printf("%d %d\n",i,sm[i]);

return 0;

}

that might very well be the case if your kernel doesn’t use a lot of registers and shared memory. 8 blocks can run concurrently on one multiprocessor, try with more blocks ;)

I am running my card (1.1) which has 4 SM with CUDA 3.2, and get the following:

$ ./a.out 4

0 0

1 4

2 1

3 5

Could anybody explain to me why those four numbers? Thanks!

Hi benetion
cudaMemcpy(sm, sm_d, N*sizeof(int), cudaMemcpyDeviceToHost);
Lacks a synchronization after it. Add cudaThreadSynchronize() to see if there’s any change. Though I doubt this is actually the reason that the smids appear to be incorrect for you. Maybe you should set N equals to at least twenty and report back the result.

Thanks for the suggestion. Adding cudaThreadSynchronize() does not help.

$ ./a.out 4

0 0

1 4

2 1

3 5

$ ./a.out 20

0 0

1 4

2 1

3 5

4 0

5 4

6 1

7 5

8 0

9 4

10 1

11 5

12 0

13 4

14 1

15 5

16 0

17 4

18 1

19 5

Well, clearly you are having 4 SMs, just that the labelling seems weird. 0, 1, 4 and 5. Nvidia seems to have a consistent problem with their SM management. I was having some even weirder result with block launching on a GTX 460. Perhaps you just have to live with that weird labelling. Or, maybe somebody else could prove me wrong.

GTX 285 also has numbers higher than 29. There are also some gaps in the reported numbers there. I just made a translation table from the smids to an array of 0…29

SMID is a bitfield. On Tesla with CUDA 3.x/4.0, the 2 low-order bits indicate the SM number within a TPC, and the high-order bits point to the TPC number.

So SMID=4 means TPC 1, SM 0.

The code above is compiled to:

S2R R0, SR0;                    // Read PhysID (http://forums.nvidia.com/index.php?showtopic=45458)

LOP.AND R1, R0, c [0x1] [0x0];  // Extract TPC ID (c[1][0]=0x0ff00000)

LOP.AND R2, R0, c [0x1] [0x1];  // Extract SM ID  (c[1][1]=0x000f0000)

SHR R1, R1, 0x12;               // Align to lsb+2

SHR R2, R2, 0x10;               // Align to lsb

LOP.OR R1, R1, R2;              // OR them together

Note that G9x and later GPUs do “breadth-first” scheduling, by first scattering the workload across the first SM of multiple TPCs, then filling the other SMs, and then finally scheduling multiple blocks per SM.

Hi Sylvain

Thanks for the clarification! No wonder %smid is incorrect for the non-Fermi cards. ptxas is such a lazy boy! Doesn’t even take into account the number of SMs under each TPC for CC1.x when calculating the smid.

Do you have more info on SRx and the content of c?

I do not see why %smid is incorrect… It behaves as specified in the documentation. The PTX manual even warns that “The SM identifier numbering is not guaranteed to be contiguous.”

c are just locations in const memory that cuobjdump does not display by default (it does with the -elf option).

Or with good old Decuda:

mov.b32 $r0, %physid

and.b32 $r1, $r0, c1[0x0000]

and.b32 $r2, $r0, c1[0x0004]

shr.u32 $r1, $r1, 0x00000012

shr.u32 $r2, $r2, 0x00000010

or.b32 $r1, $r1, $r2

// segment: const (1:0000)

0000: 0ff00000 000f0000

SRx are described in decuda’s README. They are mostly the same as (a subset of) PTX special registers.

I’m not familiar with the PTX stuff, but I’ve done some search and found out that %smid is a read-only predefined variable, so is there a way to manipulate to which SM a block will go? or the only way is to use the queried smid to distribute the work somehow?

Thanks

This is the easiest way.

The secret behind writing good CUDA code is to not to attempt to know murky/underground/undocumented things.