I have a simple kernel and I am trying to understand the reason for the stalls that are showing up for it (sm 52, cc 7.5).
#define datafloat double
#define BX 16
#define BY 16
#define BDIM 256
__global__ void jacobiShared(const int N,
const datafloat *rhs,
const datafloat *u,
datafloat *newu){
// Get thread indices
const int i = blockIdx.x*blockDim.x + threadIdx.x;
const int j = blockIdx.y*blockDim.y + threadIdx.y;
__shared__ float s_u[BY+2][BX+2];
// load block of data in to shared memory
for(int n=threadIdx.y; n<BY+2; n+=BY){
for(int m=threadIdx.x; m<BX+2; m+=BX){
const int i = blockIdx.x*blockDim.x + m;
const int j = blockIdx.y*blockDim.y + n;
datafloat val = 0.;
if(i < N+2 && j < N+2)
val = u[j*(N+2) + i];
s_u[n][m] = val;
}
}
// barrier until all the shared memory entries are loaded
__syncthreads();
if((i < N) && (j < N)){
// Get padded grid ID
const int pid = (j + 1)*(N + 2) + (i + 1);
datafloat invD = 4;
newu[pid] = (rhs[pid]
+ s_u[threadIdx.y+0][threadIdx.x+1]
+ s_u[threadIdx.y+2][threadIdx.x+1]
+ s_u[threadIdx.y+1][threadIdx.x+0]
+ s_u[threadIdx.y+1][threadIdx.x+2]
)/invD;
}
}
Here the datafloat
is defined as a double
(say, case 1). Hence in the SASS file, I see a bunch of F2F.F64.F32
as expected. Now, if I make the datafloat
as int
(say, case 2), I see a bunch of I2F
, also as expected. And finally when I define the datafloat
as float
(say case 3), there are no such conversions.
If I now look at the stalls (in Nsight Compute for example - Warp State (All cycles)), case 1 has Stall Tex Throttle (8.66), while case 2 and 3 doesn’t have Stall Tex Throttle. So my questions are:
- Why does case 1 have Stall Tex Throttle listed as a stall reason (3rd place in the list of stalls), while case 2 and 3 don’t? I am not using/defining any texture memory. Does F2F use texture memory by any chance?
- Why does case 2 not have Stall Tex Throttle? If the conversions from one datatype to another uses Tex memory, shouldn’t
I2F
cause a Tex throttle too?