nvcc -cuda

Hi,

lots of fun here for code compiled with nvcc -cuda with Intel as host compiler. Excerpts of two of my kernels follow, compiled with

nvcc -I/usr/local/cuda/2.0/include -g --host-compilation C --ptxas-options=-v -arch sm_10 -cuda -o coproc_axpy_cuda.cu.c coproc_axpy_cuda.cu
icc -g -Wall -strict-ansi -wd981 -I/usr/local/cuda/2.0/include -wd561 -malign-double -o object/coproc_axpy_cuda.o -c coproc_axpy_cuda.cu.c

Without fatbin support (aka -arch sm_10 on a G80):

ptxas info : Compiling entry function ‘__globfunc__Z17knl_dbicgupdate_sdPdS_S_i’
ptxas info : Used 4 registers, 52+48 bytes smem
ptxas info : Compiling entry function ‘__globfunc__Z17knl_sbicgupdate_sPfS_S_fi’
ptxas info : Used 4 registers, 48+44 bytes smem

Kernel declarations:
global void knl_dbicgupdate_s(double alphad, double *s, double *r, double *v, int N) {…}
global void knl_sbicgupdate_s(float *s, float *r, float *v, float alphaf, int N) {…}

If I interpret ptxas output correctly:
__globfunc__Z17knl_dbicgupdate_sdPdS_S_i
hence
double Pd(pointer from now double) S(ingle pointer) S(ingle pointer) integer (“S” as I compile double code with -arch sm_10)
to
__globfunc__Z17knl_sbicgupdate_sPfS_S_fi
hence
Pf(pointer float from now on) S(ingle pointer) S(ingle pointer) float int
All makes sense. As kernel parameters are passed via smem, resource consumption reduces by sizeof(double)-sizeof(float)=4 for the scalar. Check.

Let’s look at another op from my codebase:

ptxas info : Compiling entry function ‘__globfunc__Z18knl_dbicgupdate_xrddPdS_S_S_S_S_i’
ptxas info : Used 5 registers, 84+80 bytes smem
ptxas info : Compiling entry function ‘__globfunc__Z18knl_sbicgupdate_xrPfS_S_fS_fS_S_i’
ptxas info : Used 5 registers, 84+80 bytes smem

global void knl_dbicgupdate_xr(double alphad, double omegad, double *x, double *r, double *phat, double *shat, double *s, double *t, int N)
global void knl_sbicgupdate_xr(float *x, float *r, float *phat, float alphaf, float *shat, float omegaf, float *s, float *t, int N)

According to my ptxas interpretation above, everything matches.

Why the heck is there no difference in smen consumption even if two scalars are now float instead of double??

This issue has weird consequences btw: In my full-fletched code, cublasInit() dies consequently, regardless if I call it explicitly or implicitly (by using cublas and not calling cublasInit() gdb says it is called on first appearance of cublasBLA() code). With something like:

#0 0x00002b5b8bdef475 in ?? () from /usr/lib64/libcuda.so
#1 0x00002b5b8bdef9b2 in ?? () from /usr/lib64/libcuda.so
#2 0x00002b5b8bdef834 in ?? () from /usr/lib64/libcuda.so
#3 0x00002b5b8bd536c0 in ?? () from /usr/lib64/libcuda.so
#4 0x00002b5b8bd53795 in ?? () from /usr/lib64/libcuda.so
#5 0x00002b5b8bdef24b in ?? () from /usr/lib64/libcuda.so
#6 0x00002b5b8bd030bc in ?? () from /usr/lib64/libcuda.so
#7 0x00002b5b8bcf94ce in ?? () from /usr/lib64/libcuda.so
#8 0x00002b5b8bed879a in ?? () from /usr/lib64/libcuda.so
#9 0x00002b5b8bf08a42 in ?? () from /usr/lib64/libcuda.so
#10 0x00002b5b8bf18740 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#11 0x00002b5b8bf33c06 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#12 0x00002b5b8bf181a5 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#13 0x00002b5b8bf052e3 in ?? () from /usr/lib64/libcuda.so
#14 0x00002b5b8bcbba87 in ?? () from /usr/lib64/libcuda.so
#15 0x00002b5b89a6f4df in cudaMemcpyFromSymbolAsync () from /usr/local/cuda/2.0/lib/libcudart.so.2
#16 0x00002b5b89a6d6d2 in cudaMemcpyFromSymbolAsync () from /usr/local/cuda/2.0/lib/libcudart.so.2
#17 0x00002b5b89a50add in cudaFree () from /usr/local/cuda/2.0/lib/libcudart.so.2
#18 0x00002b5b89c97e10 in cublasInitCtx () from /usr/local/cuda/2.0/lib/libcublas.so.2
#19 0x00002b5b89cf45b7 in cublasIdamin () from /usr/local/cuda/2.0/lib/libcublas.so.2
#20 0x00002b5b89c97f30 in cublasInit () from /usr/local/cuda/2.0/lib/libcublas.so.2

Things get worse if I compile to fatbin. Things get exceptionally worse when I switch from Intel 10.1 to 11.0. I should mention that the whole thing is called from Fortran occasionally (the above issues are independent of Fortran linking). All this is not explicitly forbidden by nvcc docs.

What a waste of time. someone please shed some light on nvcc -cuda (which I can’t use in my apps) or shed some light on what I might be doing wrong. Boiling this down to a repro app will take a full day (I tried today and I failed), I hope the above samples will suffice.

dom

Intel’s compiler suite is not currently supported. Does this problem reproduce with gcc ?

OpenSuSE 10.2’s gcc works fine, but due to some confirmed bugs in its Fortran backend gfortran, it does not compile my main app. I need at least CUDA support for gcc 4.3.2 to compile the whole show with gcc or support for icc/ifort 9.1.x, 10.1.x or 11.x

Sigh. Guess it’s time to drop all nifty cuda features in my code (a previous version passed an nvdeveloper bug report without being told about Intel issues btw) and start hacking workarounds. Dropping support for fatbins helps a lot with my code for starters. Dropping support for cublas almost does the trick.

Back in the GPGPU OpenGL days, life was much easier. Even if CUDA (see my NVISION presentation) is much faster, it seems not to be worth it at this stage. Sigh.

Sorry for the rant last night. I was just annoyed by having to learn that the Intel compiler suite was unsupported, despite no complaints whatsoever on previous bug reports on nvdeveloper regarding code samples using the Intel compilers.

I revamped our Makefile infrastructure, and even if I compile everything remotely related to CUDA to a static lib with OpenSuSE’s gcc (“4.1.2 20061115 (prerelease) (SUSE Linux)”) I still get the same SIGFPE. Which makes sense, actually, as I ruled out the Intel compiler to persuade myself that I am not paranoid: FPEs triggered from within CUDA are still caught by the Intel Fortran runtime, as my whole application is linked with Intel’s ifort. Reproducible with any Fortran runtime I could get my hands on: PGI 7.2.x, gfortran 4.3.x, g95 newer than 0.91. Now please don’t tell me than any of the above are not supported, because CUDA should not depend on how it is called from applications…

Consequently, I claim the following:

  • creating fatbins AND
  • cublas AND
  • compiling with nvcc -cuda even though it is not technically necessary when using the system gcc) AND
  • a reasonable number of kernels (sophistication does not seem to matter) AND
  • calling from Fortran (to enable catching of SIGFPEs I guess)
    yields a crash. Pretty much impossible to repro without my real code. In the short term, I can probably live with messy make rules determining statically the compute capabilities of the GPUs my code is executed on (because dropping fatbin support fixes all problems immediately).

dom

I’m sorry to hear that you’re experiencing so many issues. We’ll need a means of reproducing this if you’d like it to be investigated further.

Lonni,

If I just unload my whole application on you (10K lines C, 100K lines Fortran) I’ll probably get laughed at, and right so :) I’m trying to boil this down (current status: 1k lines C, 3 lines Fortran), but as I have other research to do it might take a while. I’ll post something on nvdeveloper as soon as I have reached a minimalistic version. Please also note that this affects my feature request #486551 on nvdeveloper.

dom