Very slow host code when compiled through nvcc

I have discovered that in an application I have, host parts of the code are much slower when built with nvcc, as opposed to disabling all cuda code and building purely with g++. I managed to shrink it down to an easily reproducable example (Eigen itself can be retrieved from libeigen / eigen · GitLab):

#include <iostream>
#include <Eigen/Dense>
#include <chrono>

using Duration = std::chrono::microseconds;
using Clock = std::chrono::system_clock;

int main() {
  auto tic = Clock::now();
  Eigen::MatrixXf a = Eigen::MatrixXf::Random(1024, 1024);
  Eigen::MatrixXf b = Eigen::MatrixXf::Random(1024, 1024);
  Eigen::MatrixXf c = Eigen::MatrixXf::Random(1024, 1024);

  for (size_t i = 0; i < 100; i++) {
    a.noalias() += b * c;

  std::cout << a.sum() << std::endl; // to use the result of computation so compiler doesn't optimize away the entire thing

  auto toc = Clock::now();
  std::cout << std::chrono::duration_cast<Duration>(toc - tic).count() << std::endl;

  return 0;

Note that this is pure C++ without cuda. Then I build this as follows:

g++ toy.cpp -std=c++17 -I eigen/ -Ofast -march=native -mtune=native -o toy-g++
nvcc -x cu toy.cpp -std=c++17 -I eigen/ -Xcompiler=-Ofast,-march=native,-mtune=native --expt-relaxed-constexpr -o toy-nvcc

Then I can just execute the two to get timings like:

❯ ./toy-g++; ./toy-nvcc

and across multiple runs this kind of ~10x or more difference seems consistent.

  1. Is there something trivially obvious that I’m missing, like a missing optimization flag? I have tried a variety of combinations while inspecting nvcc with --verbose without much luck, tried -O3 at the top level and various -Xlinker flags.
  2. If not, is this kind of difference expected?
  3. Obviously when I run the exact same nvcc as above without -x cu, it is more directly forwarded to g++ and they are both equally fast. But in the actual application there is cuda code. Is it always recommended to separate host and device logic into different source files / translation units and compiled separately by either nvcc or a host compiler (or nvcc without -x cu I guess)?

eigen is known to modify its behavior when it detects CUDA nvcc compiler in use. I don’t know if this applies here or not, but the difference you’re reporting is clearly tied to eigen IMO.

Two examples of difference in eigen behavior are 1. structure alignment (reported in multiple places) and 2. EIGEN_DONT_VECTORIZE. I have no idea why the eigen developers make these decisions, but you’re not going to work around those purely with CUDA nvcc command line switches (excluding -D of course. There may be some eigen defines that you can specify that would affect this.)

I would say yes, I expect differences in behavior when eigen is used with CUDA. If you want maximum performance, separate the eigen performance-sensitive host code into .cpp files.

Thanks! I was thinking that the logic you mention would be contained within the “unsupported” portion of Eigen that works with GPUs, not in regular “Eigen/Dense”, I guess that’s not true. I will look into this, thank you.

Edit: For the record, adding #undef __CUDACC__ on top of the snippet above makes it equally fast under nvcc, thus this does show that it’s Eigen that’s modifying the behaviour unless I’m missing something.

It’s a useful datapoint, thanks for adding your observation. For the benefit of other readers, please don’t assume that is a general work around. This particular example has no “CUDA code” in it, so undefining that may be a non-issue. But in the general case, where a .cu file contains CUDA-specific code, its possible that #undef __CUDACC__ may not be a great idea. Even with respect to Eigen, casual modification of system defines like that could expose something unwanted. I don’t know enough about Eigen to know how and why they have firewalled CUDA the way they have. casual grep-ing of the Eigen source suggests that they are working around “bugs” in nvcc. I don’t happen to know what those are.

I think the “safe” thing to do is to partition the code as I suggested, rather than try and force the two together. Alternatively, a lot more study than what I have done so far would be in order.

Yes, for sure. Thanks for pointing that out. There are other ways to achieve it apparently like EIGEN_NO_CUDA etc, without meddling with system defines. This was the first thing that I tried.

But the bigger picture is, the more I look into Eigen the more it seems like they are designing it with the assumption that a use of nvcc implies its logic is going to go into device code. I’m assuming optimization decisions based on this is hurting host code. This might be incorrect, since it is after cursory investigation. Still it strongly aligns with the separate compilation + linking approach.

There is probably some sensible logic to that. It makes sense to me anyway. For example, if that logic holds, then they might be turning off “vectorization” (e.g. use of AVX, etc.) when __CUDACC__ is detected. And that probably would serve the purpose you describe, for example.

However, there are better, plainly-established ways for developers to differentiate behavior of a function that will be called in host code from the same function called in CUDA device code. Such methods don’t depend on detecting __CUDACC__.

I’m really just speculating here. Your description is logical, but I don’t know if Eigen is doing that, exactly. And if they are, there are certainly better ways to do it. These better ways would allow a routine executed in device code to behave in a sensible fashion, while at the same time the host version uses e.g. AVX “vectorization”.

I’m sure the devil is in the details. I doubt that Eigen developers are naive, or unfamiliar with CUDA.

Fair points, also speculating here. Taking a closer look at different parts in which these flags appear now.

In my experience, that is really something that depends on the nature of the host code. In the past, I have seen some functional issues with host code in .cu files, in particular that some SIMD intrinsics would not compile. In terms of performance, I observed that host code in .cu is not passed verbatim to the host compiler, and that the differences can cause the performance of host code in a .cu file to differ from that achieved when the same host code is moved into a separate file and passed directly to the host compiler. These performance differences were on the order of +/- 5 to 10 %.

I do not know whether either of these previously observed issues still apply to the latest CUDA versions. My personal approach has been to keep host code in .cu files, splitting it into separate .cpp files only if I see evidence of a problem, and that has worked well for me for the vast majority of my code.

A defensive approach especially in case of a large application where restructuring in case of problems later on would be a major pain would be to keep the amount of host code in .cu files to a minimum from the start.

1 Like