Hi,
lots of fun here for code compiled with nvcc -cuda with Intel as host compiler. Excerpts of two of my kernels follow, compiled with
nvcc -I/usr/local/cuda/2.0/include -g --host-compilation C --ptxas-options=-v -arch sm_10 -cuda -o coproc_axpy_cuda.cu.c coproc_axpy_cuda.cu
icc -g -Wall -strict-ansi -wd981 -I/usr/local/cuda/2.0/include -wd561 -malign-double -o object/coproc_axpy_cuda.o -c coproc_axpy_cuda.cu.c
Without fatbin support (aka -arch sm_10 on a G80):
ptxas info : Compiling entry function ‘__globfunc__Z17knl_dbicgupdate_sdPdS_S_i’
ptxas info : Used 4 registers, 52+48 bytes smem
ptxas info : Compiling entry function ‘__globfunc__Z17knl_sbicgupdate_sPfS_S_fi’
ptxas info : Used 4 registers, 48+44 bytes smem
Kernel declarations:
global void knl_dbicgupdate_s(double alphad, double *s, double *r, double *v, int N) {…}
global void knl_sbicgupdate_s(float *s, float *r, float *v, float alphaf, int N) {…}
If I interpret ptxas output correctly:
__globfunc__Z17knl_dbicgupdate_sdPdS_S_i
hence
double Pd(pointer from now double) S(ingle pointer) S(ingle pointer) integer (“S” as I compile double code with -arch sm_10)
to
__globfunc__Z17knl_sbicgupdate_sPfS_S_fi
hence
Pf(pointer float from now on) S(ingle pointer) S(ingle pointer) float int
All makes sense. As kernel parameters are passed via smem, resource consumption reduces by sizeof(double)-sizeof(float)=4 for the scalar. Check.
Let’s look at another op from my codebase:
ptxas info : Compiling entry function ‘__globfunc__Z18knl_dbicgupdate_xrddPdS_S_S_S_S_i’
ptxas info : Used 5 registers, 84+80 bytes smem
ptxas info : Compiling entry function ‘__globfunc__Z18knl_sbicgupdate_xrPfS_S_fS_fS_S_i’
ptxas info : Used 5 registers, 84+80 bytes smem
global void knl_dbicgupdate_xr(double alphad, double omegad, double *x, double *r, double *phat, double *shat, double *s, double *t, int N)
global void knl_sbicgupdate_xr(float *x, float *r, float *phat, float alphaf, float *shat, float omegaf, float *s, float *t, int N)
According to my ptxas interpretation above, everything matches.
Why the heck is there no difference in smen consumption even if two scalars are now float instead of double??
This issue has weird consequences btw: In my full-fletched code, cublasInit() dies consequently, regardless if I call it explicitly or implicitly (by using cublas and not calling cublasInit() gdb says it is called on first appearance of cublasBLA() code). With something like:
#0 0x00002b5b8bdef475 in ?? () from /usr/lib64/libcuda.so
#1 0x00002b5b8bdef9b2 in ?? () from /usr/lib64/libcuda.so
#2 0x00002b5b8bdef834 in ?? () from /usr/lib64/libcuda.so
#3 0x00002b5b8bd536c0 in ?? () from /usr/lib64/libcuda.so
#4 0x00002b5b8bd53795 in ?? () from /usr/lib64/libcuda.so
#5 0x00002b5b8bdef24b in ?? () from /usr/lib64/libcuda.so
#6 0x00002b5b8bd030bc in ?? () from /usr/lib64/libcuda.so
#7 0x00002b5b8bcf94ce in ?? () from /usr/lib64/libcuda.so
#8 0x00002b5b8bed879a in ?? () from /usr/lib64/libcuda.so
#9 0x00002b5b8bf08a42 in ?? () from /usr/lib64/libcuda.so
#10 0x00002b5b8bf18740 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#11 0x00002b5b8bf33c06 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#12 0x00002b5b8bf181a5 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#13 0x00002b5b8bf052e3 in ?? () from /usr/lib64/libcuda.so
#14 0x00002b5b8bcbba87 in ?? () from /usr/lib64/libcuda.so
#15 0x00002b5b89a6f4df in cudaMemcpyFromSymbolAsync () from /usr/local/cuda/2.0/lib/libcudart.so.2
#16 0x00002b5b89a6d6d2 in cudaMemcpyFromSymbolAsync () from /usr/local/cuda/2.0/lib/libcudart.so.2
#17 0x00002b5b89a50add in cudaFree () from /usr/local/cuda/2.0/lib/libcudart.so.2
#18 0x00002b5b89c97e10 in cublasInitCtx () from /usr/local/cuda/2.0/lib/libcublas.so.2
#19 0x00002b5b89cf45b7 in cublasIdamin () from /usr/local/cuda/2.0/lib/libcublas.so.2
#20 0x00002b5b89c97f30 in cublasInit () from /usr/local/cuda/2.0/lib/libcublas.so.2
Things get worse if I compile to fatbin. Things get exceptionally worse when I switch from Intel 10.1 to 11.0. I should mention that the whole thing is called from Fortran occasionally (the above issues are independent of Fortran linking). All this is not explicitly forbidden by nvcc docs.
What a waste of time. someone please shed some light on nvcc -cuda (which I can’t use in my apps) or shed some light on what I might be doing wrong. Boiling this down to a repro app will take a full day (I tried today and I failed), I hope the above samples will suffice.
dom