Hi I have a question about -O3 option.
I have a simple matrix multiplication code.
It has a cuda kernel for matrix multiply and a cpu version of matrix multiply in it.
If I use nvcc -O3 to compile it, it seems only optimize the cuda code, not the cpu code.
Since if I move the cpu code out and use gcc -O3 to compile it, the cpu time for matrix multiplication is much less than the time for cpu matrix multiplication in cuda code.
Can anyone give me a hint how to optimize the cpu code inside the .cu file?
I’m surprised as [font=“Courier New”]-O3[/font] should influence only the host code. Anyway, you can explicitly set the optimization level for the host code compilation with [font=“Courier New”]-Xcompiler -O3[/font].
It seems the performance hit is caused by something else. Is the gcc you are using directly the same version as the one called by nvcc? What are the outputs of gcc --version and nvcc -Xcompiler --version (with a dummy .cu file)?
In addition to the possible use of two different versions of gcc that tera mentions, there could be other reasons. In general the host code that nvcc passes to the host compiler is not identical to the host code as written. I believe this has something to do with the way code is parsed and separated into host and device portions. While I have seen some minor performance deviations from compiling host code with nvcc rather than through gcc directly I am not aware of cases of significant performance differences. If there is concern about that I would suggest moving the host code in question into a separate .c file compiled directly with gcc.
You can see the code that is actually passed to the host compiler by adding -v --keep to the nvcc invocation. This does a verbose build showing the invocation of each component called by nvcc (which is just a slim compiler driver program), so you can see which intermediate file is sent to gcc, and it keeps the intermediate files around at the end of the compilation instead of deleting them.
A -O{0|1|2|3} flag on the nvcc commandline is passed through to the host compiler and is not used to control the device code generation. Device code optimization is controlled via two separate component-level flags that default to -O3 and these settings should not normally be changed by the user. They are useful for experiments during the course of researching potential compiler bugs, for narrowing down where such bugs may be introduced.