# SASS for F2I and I2F conversions

I have a simple kernel and I am trying to understand the reason for the stalls that are showing up for it (sm 52, cc 7.5).

``````#define datafloat double

#define BX 16
#define BY 16
#define BDIM 256

__global__ void jacobiShared(const int N,
const datafloat *rhs,
const datafloat *u,
datafloat *newu){

const int i = blockIdx.x*blockDim.x + threadIdx.x;
const int j = blockIdx.y*blockDim.y + threadIdx.y;

__shared__ float s_u[BY+2][BX+2];

// load block of data in to shared memory

const int i = blockIdx.x*blockDim.x + m;
const int j = blockIdx.y*blockDim.y + n;

datafloat val = 0.;

if(i < N+2 &&  j < N+2)
val = u[j*(N+2) + i];

s_u[n][m] = val;
}
}

// barrier until all the shared memory entries are loaded

if((i < N) && (j < N)){

const int pid = (j + 1)*(N + 2) + (i + 1);

datafloat invD = 4;

newu[pid] = (rhs[pid]
)/invD;
}
}
``````

Here the `datafloat` is defined as a `double` (say, case 1). Hence in the SASS file, I see a bunch of `F2F.F64.F32` as expected. Now, if I make the `datafloat` as `int` (say, case 2), I see a bunch of `I2F`, also as expected. And finally when I define the `datafloat` as `float` (say case 3), there are no such conversions.

If I now look at the stalls (in Nsight Compute for example - Warp State (All cycles)), case 1 has Stall Tex Throttle (8.66), while case 2 and 3 doesnâ€™t have Stall Tex Throttle. So my questions are:

1. Why does case 1 have Stall Tex Throttle listed as a stall reason (3rd place in the list of stalls), while case 2 and 3 donâ€™t? I am not using/defining any texture memory. Does F2F use texture memory by any chance?
2. Why does case 2 not have Stall Tex Throttle? If the conversions from one datatype to another uses Tex memory, shouldnâ€™t `I2F` cause a Tex throttle too?

Questions about specific profiler event counters are best asked in the Nsight Compute sub-forum: Nsight Compute - NVIDIA Developer Forums

The code is being run on a consumer card (`sm_52`) with low FP64 throughput, which generally makes stalls more likely whenever FP64 functionality is being used. Conversions involving two 32-bit types are handled by higher-throughput FP32 pipes.

• Why does case 1 have Stall Tex Throttle listed as a stall reason (3rd place in the list of stalls), while case 2 and 3 donâ€™t? I am not using/defining any texture memory. Does F2F use texture memory by any chance?

On CC 7.5 (Turing) and consumer focused GPUs the F2F.F64.F32 is issued to a reduced throughput FP64 unit that is shared by all 4 warp schedulers. The FP64 unit and register write-back share the same data path as the texture unit.

• Why does case 2 not have Stall Tex Throttle? If the conversions from one datatype to another uses Tex memory, shouldnâ€™t `I2F` cause a Tex throttle too?

On most GPUs I2F and F2F (32-bit float only) is implemented in the XU/SFU (special function unit) that is per warp scheduler but issued through MIO. These instructions will have a stall reason of mio_throttle and instrutions dependent on the result will have short_scoreboard.

The XU/SFU throughput is much higher than the FP64 throughput on CC 7.5 so you are likely to see a lot more stalls in the F2F.FP64 case than the other 2 cases.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.