Hello,
I haven’t been doing cuda for a while, and now coming back to it. I was very surprised seeing super inefficient resource usage to get my simple routine running. Consider this code:
__global__ void kernel(double* sd, const double* rk, const double* rw, int N, int ny, int slice_stride)
{
__shared__ double rw_row[32];
double sum = 0.0;
for (int i = 32; i < N; i += 32)
{
if (threadIdx.x < 32)
rw_row[threadIdx.x] = rw[blockIdx.x * N + i + threadIdx.x];
__syncthreads();
for (int m = 0; m < 32; ++m)
{
sum += rw_row[m] * rk[(i+m)*slice_stride + ny*blockIdx.x + threadIdx.x];
}
}
sd[blockIdx.x*N + threadIdx.x] = sum;
}
compiled with -O3 --fmad=true for sm_35. This code ends up using 44(!!!) registers for like 3 pointers, 2 offsets and one accumulation variable. Disassembly reveals that just to do += with FMAD it uses 22 (or 11double) registers. trying to use --maxrregcount to anything smaller 44 ends up using stack. What is going on here ? 44 regs is like 75% of available registers for the simplest kernel ever ???
Please shed some light on this.
Thanks!