SASS for F2I and I2F conversions

I have a simple kernel and I am trying to understand the reason for the stalls that are showing up for it (sm 52, cc 7.5).

#define datafloat double

#define BX 16
#define BY 16
#define BDIM 256

__global__ void jacobiShared(const int N,
			     const datafloat *rhs,
			     const datafloat *u,
			     datafloat *newu){

  // Get thread indices
  const int i = blockIdx.x*blockDim.x + threadIdx.x;
  const int j = blockIdx.y*blockDim.y + threadIdx.y;

  __shared__ float s_u[BY+2][BX+2];

  // load block of data in to shared memory
  for(int n=threadIdx.y; n<BY+2; n+=BY){
    for(int m=threadIdx.x; m<BX+2; m+=BX){

      const int i = blockIdx.x*blockDim.x + m;
      const int j = blockIdx.y*blockDim.y + n;
      
      datafloat val = 0.;

      if(i < N+2 &&  j < N+2)
        val = u[j*(N+2) + i];
      
      s_u[n][m] = val;
    }
  }

  // barrier until all the shared memory entries are loaded
  __syncthreads();

  if((i < N) && (j < N)){

    // Get padded grid ID
    const int pid = (j + 1)*(N + 2) + (i + 1);

    datafloat invD = 4;

    newu[pid] = (rhs[pid]
		      + s_u[threadIdx.y+0][threadIdx.x+1]
		      + s_u[threadIdx.y+2][threadIdx.x+1]
		      + s_u[threadIdx.y+1][threadIdx.x+0]
		      + s_u[threadIdx.y+1][threadIdx.x+2]
		      )/invD;
  }
}

Here the datafloat is defined as a double (say, case 1). Hence in the SASS file, I see a bunch of F2F.F64.F32 as expected. Now, if I make the datafloat as int (say, case 2), I see a bunch of I2F, also as expected. And finally when I define the datafloat as float (say case 3), there are no such conversions.

If I now look at the stalls (in Nsight Compute for example - Warp State (All cycles)), case 1 has Stall Tex Throttle (8.66), while case 2 and 3 doesn’t have Stall Tex Throttle. So my questions are:

  1. Why does case 1 have Stall Tex Throttle listed as a stall reason (3rd place in the list of stalls), while case 2 and 3 don’t? I am not using/defining any texture memory. Does F2F use texture memory by any chance?
  2. Why does case 2 not have Stall Tex Throttle? If the conversions from one datatype to another uses Tex memory, shouldn’t I2F cause a Tex throttle too?

Questions about specific profiler event counters are best asked in the Nsight Compute sub-forum: Nsight Compute - NVIDIA Developer Forums

The code is being run on a consumer card (sm_52) with low FP64 throughput, which generally makes stalls more likely whenever FP64 functionality is being used. Conversions involving two 32-bit types are handled by higher-throughput FP32 pipes.

  • Why does case 1 have Stall Tex Throttle listed as a stall reason (3rd place in the list of stalls), while case 2 and 3 don’t? I am not using/defining any texture memory. Does F2F use texture memory by any chance?

On CC 7.5 (Turing) and consumer focused GPUs the F2F.F64.F32 is issued to a reduced throughput FP64 unit that is shared by all 4 warp schedulers. The FP64 unit and register write-back share the same data path as the texture unit.

  • Why does case 2 not have Stall Tex Throttle? If the conversions from one datatype to another uses Tex memory, shouldn’t I2F cause a Tex throttle too?

On most GPUs I2F and F2F (32-bit float only) is implemented in the XU/SFU (special function unit) that is per warp scheduler but issued through MIO. These instructions will have a stall reason of mio_throttle and instrutions dependent on the result will have short_scoreboard.

The XU/SFU throughput is much higher than the FP64 throughput on CC 7.5 so you are likely to see a lot more stalls in the F2F.FP64 case than the other 2 cases.