nvcc -O3 problem

Hi I have a question about -O3 option.
I have a simple matrix multiplication code.
It has a cuda kernel for matrix multiply and a cpu version of matrix multiply in it.

If I use nvcc -O3 to compile it, it seems only optimize the cuda code, not the cpu code.

Since if I move the cpu code out and use gcc -O3 to compile it, the cpu time for matrix multiplication is much less than the time for cpu matrix multiplication in cuda code.

Can anyone give me a hint how to optimize the cpu code inside the .cu file?

I’m surprised as [font=“Courier New”]-O3[/font] should influence only the host code. Anyway, you can explicitly set the optimization level for the host code compilation with [font=“Courier New”]-Xcompiler -O3[/font].

Thanks for your reply.

I tried this before, it doesn’t work. I mean the performance doesn’t change.

If I put the CPU code and GPU code in a.cu file and use nvcc -O3 to compile,

the result I got:

CPU cost 2.4sec

GPU cost 0.05sec

If I copy CPU code out to a .c file and use gcc -O3 to compile,

the result I got:

CPU cost 0.55sec

It seems the performance hit is caused by something else. Is the gcc you are using directly the same version as the one called by nvcc? What are the outputs of gcc --version and nvcc -Xcompiler --version (with a dummy .cu file)?

In addition to the possible use of two different versions of gcc that tera mentions, there could be other reasons. In general the host code that nvcc passes to the host compiler is not identical to the host code as written. I believe this has something to do with the way code is parsed and separated into host and device portions. While I have seen some minor performance deviations from compiling host code with nvcc rather than through gcc directly I am not aware of cases of significant performance differences. If there is concern about that I would suggest moving the host code in question into a separate .c file compiled directly with gcc.

You can see the code that is actually passed to the host compiler by adding -v --keep to the nvcc invocation. This does a verbose build showing the invocation of each component called by nvcc (which is just a slim compiler driver program), so you can see which intermediate file is sent to gcc, and it keeps the intermediate files around at the end of the compilation instead of deleting them.

A -O{0|1|2|3} flag on the nvcc commandline is passed through to the host compiler and is not used to control the device code generation. Device code optimization is controlled via two separate component-level flags that default to -O3 and these settings should not normally be changed by the user. They are useful for experiments during the course of researching potential compiler bugs, for narrowing down where such bugs may be introduced.

lxu@aji:~/CUDA/C$ gcc --version

gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3

Copyright © 2009 Free Software Foundation, Inc.

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

lxu@aji:~/CUDA/2mm$ nvcc -Xcompiler --version 2mm.cu

gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3

Copyright © 2009 Free Software Foundation, Inc.

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Catastrophic error: cannot open source file “/tmp/tmpxft_00000ef3_00000000-4_2mm.cpp1.ii”

1 catastrophic error detected in the compilation of “/tmp/tmpxft_00000ef3_00000000-4_2mm.cpp1.ii”.

Compilation terminated.

So I suppose the gcc is the same version.

I used -v --keep to compile .cu file and got a lot of files generated. Could you please tell me which one is for the gcc?

2mm.cpp1.ii 2mm.cpp3.i 2mm.cu 2mm.cudafe1.c 2mm.cudafe1.gpu 2mm.cudafe2.c 2mm.cudafe2.stub.c 2mm.fatbin 2mm.hash 2mm.ptx

2mm.cpp2.i 2mm.cpp4.ii 2mm.cu.cpp 2mm.cudafe1.cpp 2mm.cudafe1.stub.c 2mm.cudafe2.gpu 2mm.exe 2mm.fatbin.c 2mm.o 2mm.sm_10.cubin

I don’t know. As I recall the way to find out is to do a verbose build with -v and check which file(s) get passed to the host compiler invocation(s).