I’ve now managed to optimize my g++ output to be pretty much as fast as nvc++ output code for general c++ code (non gpu). But I don’t seem to be able to do the same with nvcc.
The result is that nvcc output is now running much slower than the g++ output (3982ms vs 2579ms).
Please see my (abridged) test output
======================================= Testing nvcc ======================================= nvcc -O3 --extended-lambda -I /opt/nvidia/hpc_sdk/Linux_x86_64/cuda/11.0/include -o main main.cpp Running nvcc on main.cpp 100 iterations of 100000 samples with 100 inner loop iterations. Elapsed time in nanoseconds : 3982115443 ns Elapsed time in microseconds : 3982115 µs Elapsed time in milliseconds : 3982 ms Elapsed time in seconds : 3 sec ======================================= Testing g++ ======================================= g++ -Ofast -march=native -std=c++17 -Wall -Wextra -pedantic -o main_no_policy main.cpp 100 iterations of 100000 samples with 100 inner loop iterations. Elapsed time in nanoseconds : 2579204732 ns Elapsed time in microseconds : 2579204 µs Elapsed time in milliseconds : 2579 ms Elapsed time in seconds : 2 sec
The key here for g++ was the options -Ofast -march=native . I don’t seem to be able to specify the same for nvcc. As I understand it nvcc is using g++ to compile the host code and it’s deeply frustrating not to be able to pass those options through.
Also the descriptions of the optimization levels in the nvcc documentation and output from nvcc --help seems to be missing. How do I specify the machine architecture to nvcc and what are the options? In this case I need the native architecture of the machine doing the compilation (Intel Coffee lake with supports AVX2 (since Haswell)).
Using my two command lines …
g++ -Ofast -march=native nvcc -O3 --extended-lambda
How do I combine the g++ options into the nvcc command and get it to work?