nvcc -cuda

e.ping · November 18, 2008, 12:20am

Hi,

lots of fun here for code compiled with nvcc -cuda with Intel as host compiler. Excerpts of two of my kernels follow, compiled with

nvcc -I/usr/local/cuda/2.0/include -g --host-compilation C --ptxas-options=-v -arch sm_10 -cuda -o coproc_axpy_cuda.cu.c coproc_axpy_cuda.cu
icc -g -Wall -strict-ansi -wd981 -I/usr/local/cuda/2.0/include -wd561 -malign-double -o object/coproc_axpy_cuda.o -c coproc_axpy_cuda.cu.c

Without fatbin support (aka -arch sm_10 on a G80):

ptxas info : Compiling entry function ‘__globfunc__Z17knl_dbicgupdate_sdPdS_S_i’
ptxas info : Used 4 registers, 52+48 bytes smem
ptxas info : Compiling entry function ‘__globfunc__Z17knl_sbicgupdate_sPfS_S_fi’
ptxas info : Used 4 registers, 48+44 bytes smem

Kernel declarations:
global void knl_dbicgupdate_s(double alphad, double *s, double *r, double *v, int N) {…}
global void knl_sbicgupdate_s(float *s, float *r, float *v, float alphaf, int N) {…}

If I interpret ptxas output correctly:
__globfunc__Z17knl_dbicgupdate_sdPdS_S_i
hence
double Pd(pointer from now double) S(ingle pointer) S(ingle pointer) integer (“S” as I compile double code with -arch sm_10)
to
__globfunc__Z17knl_sbicgupdate_sPfS_S_fi
hence
Pf(pointer float from now on) S(ingle pointer) S(ingle pointer) float int
All makes sense. As kernel parameters are passed via smem, resource consumption reduces by sizeof(double)-sizeof(float)=4 for the scalar. Check.

Let’s look at another op from my codebase:

ptxas info : Compiling entry function ‘__globfunc__Z18knl_dbicgupdate_xrddPdS_S_S_S_S_i’
ptxas info : Used 5 registers, 84+80 bytes smem
ptxas info : Compiling entry function ‘__globfunc__Z18knl_sbicgupdate_xrPfS_S_fS_fS_S_i’
ptxas info : Used 5 registers, 84+80 bytes smem

global void knl_dbicgupdate_xr(double alphad, double omegad, double *x, double *r, double *phat, double *shat, double *s, double *t, int N)
global void knl_sbicgupdate_xr(float *x, float *r, float *phat, float alphaf, float *shat, float omegaf, float *s, float *t, int N)

According to my ptxas interpretation above, everything matches.

Why the heck is there no difference in smen consumption even if two scalars are now float instead of double??

This issue has weird consequences btw: In my full-fletched code, cublasInit() dies consequently, regardless if I call it explicitly or implicitly (by using cublas and not calling cublasInit() gdb says it is called on first appearance of cublasBLA() code). With something like:

#0 0x00002b5b8bdef475 in ?? () from /usr/lib64/libcuda.so
#1 0x00002b5b8bdef9b2 in ?? () from /usr/lib64/libcuda.so
#2 0x00002b5b8bdef834 in ?? () from /usr/lib64/libcuda.so
#3 0x00002b5b8bd536c0 in ?? () from /usr/lib64/libcuda.so
#4 0x00002b5b8bd53795 in ?? () from /usr/lib64/libcuda.so
#5 0x00002b5b8bdef24b in ?? () from /usr/lib64/libcuda.so
#6 0x00002b5b8bd030bc in ?? () from /usr/lib64/libcuda.so
#7 0x00002b5b8bcf94ce in ?? () from /usr/lib64/libcuda.so
#8 0x00002b5b8bed879a in ?? () from /usr/lib64/libcuda.so
#9 0x00002b5b8bf08a42 in ?? () from /usr/lib64/libcuda.so
#10 0x00002b5b8bf18740 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#11 0x00002b5b8bf33c06 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#12 0x00002b5b8bf181a5 in gpudbg_rt_debugger_syscall () from /usr/lib64/libcuda.so
#13 0x00002b5b8bf052e3 in ?? () from /usr/lib64/libcuda.so
#14 0x00002b5b8bcbba87 in ?? () from /usr/lib64/libcuda.so
#15 0x00002b5b89a6f4df in cudaMemcpyFromSymbolAsync () from /usr/local/cuda/2.0/lib/libcudart.so.2
#16 0x00002b5b89a6d6d2 in cudaMemcpyFromSymbolAsync () from /usr/local/cuda/2.0/lib/libcudart.so.2
#17 0x00002b5b89a50add in cudaFree () from /usr/local/cuda/2.0/lib/libcudart.so.2
#18 0x00002b5b89c97e10 in cublasInitCtx () from /usr/local/cuda/2.0/lib/libcublas.so.2
#19 0x00002b5b89cf45b7 in cublasIdamin () from /usr/local/cuda/2.0/lib/libcublas.so.2
#20 0x00002b5b89c97f30 in cublasInit () from /usr/local/cuda/2.0/lib/libcublas.so.2

Things get worse if I compile to fatbin. Things get exceptionally worse when I switch from Intel 10.1 to 11.0. I should mention that the whole thing is called from Fortran occasionally (the above issues are independent of Fortran linking). All this is not explicitly forbidden by nvcc docs.

What a waste of time. someone please shed some light on nvcc -cuda (which I can’t use in my apps) or shed some light on what I might be doing wrong. Boiling this down to a repro app will take a full day (I tried today and I failed), I hope the above samples will suffice.

dom

netllama · November 18, 2008, 12:24am

Intel’s compiler suite is not currently supported. Does this problem reproduce with gcc ?

e.ping · November 18, 2008, 12:47am

OpenSuSE 10.2’s gcc works fine, but due to some confirmed bugs in its Fortran backend gfortran, it does not compile my main app. I need at least CUDA support for gcc 4.3.2 to compile the whole show with gcc or support for icc/ifort 9.1.x, 10.1.x or 11.x

Sigh. Guess it’s time to drop all nifty cuda features in my code (a previous version passed an nvdeveloper bug report without being told about Intel issues btw) and start hacking workarounds. Dropping support for fatbins helps a lot with my code for starters. Dropping support for cublas almost does the trick.

Back in the GPGPU OpenGL days, life was much easier. Even if CUDA (see my NVISION presentation) is much faster, it seems not to be worth it at this stage. Sigh.

e.ping · November 18, 2008, 11:33pm

Sorry for the rant last night. I was just annoyed by having to learn that the Intel compiler suite was unsupported, despite no complaints whatsoever on previous bug reports on nvdeveloper regarding code samples using the Intel compilers.

I revamped our Makefile infrastructure, and even if I compile everything remotely related to CUDA to a static lib with OpenSuSE’s gcc (“4.1.2 20061115 (prerelease) (SUSE Linux)”) I still get the same SIGFPE. Which makes sense, actually, as I ruled out the Intel compiler to persuade myself that I am not paranoid: FPEs triggered from within CUDA are still caught by the Intel Fortran runtime, as my whole application is linked with Intel’s ifort. Reproducible with any Fortran runtime I could get my hands on: PGI 7.2.x, gfortran 4.3.x, g95 newer than 0.91. Now please don’t tell me than any of the above are not supported, because CUDA should not depend on how it is called from applications…

Consequently, I claim the following:

creating fatbins AND
cublas AND
compiling with nvcc -cuda even though it is not technically necessary when using the system gcc) AND
a reasonable number of kernels (sophistication does not seem to matter) AND
calling from Fortran (to enable catching of SIGFPEs I guess)
yields a crash. Pretty much impossible to repro without my real code. In the short term, I can probably live with messy make rules determining statically the compute capabilities of the GPUs my code is executed on (because dropping fatbin support fixes all problems immediately).

dom

netllama · November 18, 2008, 11:37pm

I’m sorry to hear that you’re experiencing so many issues. We’ll need a means of reproducing this if you’d like it to be investigated further.

e.ping · November 18, 2008, 11:53pm

Lonni,

If I just unload my whole application on you (10K lines C, 100K lines Fortran) I’ll probably get laughed at, and right so :) I’m trying to boil this down (current status: 1k lines C, 3 lines Fortran), but as I have other research to do it might take a while. I’ll post something on nvdeveloper as soon as I have reached a minimalistic version. Please also note that this affects my feature request #486551 on nvdeveloper.

dom

Topic		Replies	Views
NVCC compiler issue, with POC CUDA Programming and Performance	2	638	February 14, 2020
--fatbin segfaults CUDA Programming and Performance	1	1204	February 25, 2009
Using Intel C/C++ compilers with CUDA tools CUDA Programming and Performance	4	6679	February 19, 2013
What is code compiled with -arch=sm_13 slower? CUDA Programming and Performance	16	3772	April 22, 2009
sm_1x vs. compute_1x? CUDA Programming and Performance	9	13033	November 9, 2008
cuda nvcc compatiable issue with icc(intel c++ compiler) CUDA Programming and Performance	0	11619	January 4, 2011
Inconsistancy between NVCC and MS-Compiler CUDA Programming and Performance	5	6392	December 10, 2010
Intel compiler support for front-end CUDA compilation CUDA Programming and Performance	11	15575	January 27, 2010
Performance drop with compute_20,sm_20 nvcc options CUDA Programming and Performance	3	4424	March 5, 2012
CUDA compile trouble CUDA Programming and Performance	47	5561	November 8, 2010

nvcc -cuda

Related topics