The PTX and SASS codes corresponding to the device code are empty

I have a simple CUDA demo, when I tried to measure kernel execution time using nsys command, I find that the execution of kernel code is independent of the number of loops in the loop body.

My code is as follows :

#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>

#define BDIMX 32
#define BDIMY 32

__global__ void setRowReadRow(int * out)
{
    unsigned int idx=threadIdx.y*blockDim.x+threadIdx.x;

    for(unsigned int e=0; e<65536*30000; e++)
    for(unsigned int w=0; w<65536*30000; w++)
    for(unsigned int q=0; q<65536*30000; q++)
    for(unsigned int i=0; i<65535*30000; i++)
    for(unsigned int j=0; j<65535*30000; j++)
    for(unsigned int k=0; k<65536*30000; k++)
    for(unsigned int l=0; l<65535*30000; l++)
    for(unsigned int m=0; m<65536; m++){
      out[idx] +=  m*j   ;
    }

    
}

int main(int argc,char **argv)
{
  int nElem=BDIMX*BDIMY;
  int nBlock=1;
  int nByte=sizeof(int)*nElem*nBlock ;
  int * out = NULL;
  int * host_data = NULL;
  
  host_data = (int *)malloc(sizeof(int) * nElem);
  cudaMalloc((int**)&out,nByte);
  
  dim3 block(BDIMY,BDIMX);
  dim3 grid(1);

  setRowReadRow<<<grid,block>>>(out);
  cudaMemcpy(host_data, out, sizeof(int) * nElem,
                             cudaMemcpyDeviceToHost);
  cudaFree(out);
  return 0;
}

So I tried to get kernel PTX and SASS code,I used the following command to get the PTX and SASS code:

# compile
nvcc demo.cu -arch=sm_80 -o demo
# get ptx code
cuobjdump --dump-ptx ./demo
# get sass code
nvcc demo.cu -arch=sm_80 -c demo.cu
cuobjdump -sass demo.o

PTX code is as follows, only one ret command :

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
host = linux
compile_size = 64bit

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_80
code version = [8,1]
host = linux
compile_size = 64bit
compressed

.version 8.1
.target sm_80
.address_size 64

.visible .entry _Z13setRowReadRowPi(
.param .u64 _Z13setRowReadRowPi_param_0
)
{
ret;
}

SASS code is as follows:

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
host = linux
compile_size = 64bit

        code for sm_80
                Function : _Z13setRowReadRowPi
        .headerflags    @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM80 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM80)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;  /* 0x00000a0000017a02 */
                                                           /* 0x000fc40000000f00 */
        /*0010*/                   EXIT ;                  /* 0x000000000000794d */
                                                           /* 0x000fea0003800000 */
        /*0020*/                   BRA 0x20;               /* 0xfffffff000007947 */
                                                           /* 0x000fc0000383ffff */
        /*0030*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
        /*0040*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
        /*0050*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
        /*0060*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
        /*0070*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
        /*0080*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
        /*0090*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
        /*00a0*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
        /*00b0*/                   NOP;                    /* 0x0000000000007918 */
                                                           /* 0x000fc00000000000 */
                ..........

So my question is why is this PTX and SASS code empty? My GPU is A800, OS is Ubuntu20.04.

In C++ overflow of a signed integer is undefined behavior. Reducing the for loop nesting and iteration per loop it can be observed that the compiler reduces the loops to a constant value. When the constant exceed MAX_INT the kernel becomes a NOP kernel. It is very quick to test this in godbolt.org using CUDA C++ compiler.

2 Likes

Thank you for your answer, but I have a new problem.

I made some changes to the kernel code as follows:

__global__ void setRowReadRow(int * out)
{

    unsigned int idx=threadIdx.y*blockDim.x+threadIdx.x;

    for(unsigned int j=0; j<65536; j++)
        for(unsigned int m=0; m<65536; m++)
            out[idx] +=  m ;
 }
}

The representation range of the int data type is [-2^31, 2^32-1],and 65536*65536=2^32, Obviously the above kerenl code will overflow.

However in godbolt.org the corresponding ptx code is as follows:

.visible .entry setRowReadRow(int*)(
        .param .u64 setRowReadRow(int*)_param_0
)
{

        ld.param.u64    %rd1, [setRowReadRow(int*)_param_0];
        cvta.to.global.u64      %rd2, %rd1;
        mov.u32         %r1, %tid.y;
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %tid.x;
        mad.lo.s32      %r4, %r1, %r2, %r3;
        mul.wide.u32    %rd3, %r4, 4;
        add.s64         %rd4, %rd2, %rd3;
        ld.global.u32   %r5, [%rd4];
        xor.b32         %r6, %r5, -2147483648;
        st.global.u32   [%rd4], %r6;
        ret;
}

So my new question is why the PTX code not empty in the case of this overflow.

As stated above:

Undefined behavior (UB) means just that: the behavior is undefined, anything can happen. In ancient programmer folklore this includes the possibility of nasal demons. When UB is present it makes no sense to try and analyze a compiler’s response.

1 Like