Hi,
I created an application with three different precisions. I am using the same compilation line for FP64, FP32, and FP16 precisions, the building line is:
nvcc -I/usr/local/cuda/include/ -L/usr/local/cuda/lib64/ -O3 -std=c++11 -gencode arch=compute_61,code=[sm_61,compute_61] --resource-usage ./cuda_half_mxm.cu -o ./cuda_mxm_half
The thing is, I am getting various register usages with and without -g -G flags.
For FP16 without -g -G
ptxas info : Used 32 registers, 348 bytes cmem[0]
For FP16 with -g -G
ptxas info : Used 39 registers, 72 bytes cumulative stack size, 348 bytes cmem[0], 28 bytes cmem[2]
For FP32 without -g -G
ptxas info : Used 32 registers, 348 bytes cmem[0]
For FP32 with -g -G
ptxas info : Used 16 registers, 348 bytes cmem[0]
For FP64 without -g -G
ptxas info : Used 32 registers, 348 bytes cmem[0]
For FP64 with -g -G
ptxas info : Used 18 registers, 348 bytes cmem[0]
Does anyone know why register usage vary for the same application with and without debug flags?
Is it right to think that register usage would be smaller for FP16?
Thanks