Google gpucc vs. Nvidia nvcc?

I am sure people have heard of that Google created its own version of CUDA compiler, i.e., gpucc, and and mentioned that it could outperform nvcc for some internal & external benchmarks, and the compilation time could be reduced significantly. See

Since Google will make it opensource, will this attract compiler developers to make contributions to it, and move application developers from using Nvidia nvcc to Google gpucc?

I had not heard of this Google effort. Competition is usually beneficial to a market, so lets see what (if any) benefit CUDA programmers will be able to reap.

The slide deck you linked shows gpucc compilation times 8% faster on average compared to nvcc. Performance of the generated code is +/-15% versus code generated by nvcc and looks like the same on average. Given that NVIDIA’s CUDA compilers have been based on open-source frameworks (first Open64, now LLVM) for a long time, I do not perceive any compelling reason for people to jump on gpucc. Do you?

How will compiler developers react to the availability of gpucc? “Objection, Your Honor, calls for speculation!”. In general, the world has a shortage of good compiler engineers. Most of them will be more than busy in their well-paying day jobs. As a result, I would not expect many to donate work to gpucc. So I would think the people working on gpucc will be mostly engineers paid by Google to do so. Google has deep pockets, and could support this project indefinitely.

What I’d be most interested in is a compiler that doesn’t take over a second to compile even the simplest cu file. Does anyone know what nvcc is doing that takes so long? Is it just the overhead of LLVM? ptxas is plenty snappy and assembles in the millisecond time scale.

Fast compilation is important for dynamically generated code.

Ha, I used to have a kernel that would spend 33 minutes per architecture in ‘cicc’. This was CUDA 6.0.

CUDA 6.5 slashed that to 1 minute per arch.

Now it’s down to 15 seconds per arch in CUDA 7.5 but the troublesome kernels are much cleaner.


P.S. you can dump the timing of nvcc internals to a .csv like so:

nvcc -time time.csv

It’s been years since I looked at the details of compile times. From what I recall, the flow involves multiple separate applications invoked by nvcc, with a number of files being used for communication between components. That is likely responsible for basic overhead that dominates the compilation of short files.

While JIT compilation from PTX was part of the CUDA build flow from the beginning, JIT compilation from high-level source is a late addition, and the design goals are probably reflected in the speed of the respective compiler components. As Allan noted, ever since JIT compilation from HLL source was added as a feature, compile times have generally improved.

The default settings of the CUDA tool chain are for full optimization, which means there are a ton of optimization passes that run inside LLVM. There isn’t even a convenient -O[1|2|3] switch one can pass to the LLVM compiler, like one could with the old Open64 compiler, in order to turn off a bunch of these passes. Lowering the optimization level could improve compilation times quite a bit with Open64. Aggressive function inlining is one of the main culprits in ballooning the code size and thus compilation times.

If you are a user of the Intel compiler, you will be aware that builds at full optimization are pretty slow as well, although not quite as slow as with nvcc. No lunch (i.e., highly optimized code) is for free.

In practical terms, CUDA programmers facing excessive build times should keep filing bugs/RFEs with NVIDIA, so development effort can be applied where most needed.

@njuffa, yes, I doubt people will jump to another compiler if the gains (compilation time and performance) are very minimal. However, in the gpucc slide deck, they claimed FFT and Blas1/2 could be improved by 21%~91% (I assume it’s regarding the performance, not the compilation time). This seems quite attractive since it gives “free performance improvement” for the end users.

Just for a single BLAS API, ?GEMM, alone, there are myriad different kernels: for small matrices, large matrices, tall skinny ones, squarish ones, with different transpose modes, for different architectures, etc, etc. Now multiply that by the 80 or so different basic functions provided by BLAS and you have more pieces of code that need optimization than you could ever get to with a reasonably-sized library team.

This means it is relatively easy to find BLAS functions that can benefit from further optimization. The literature is full of papers comparing a researcher’s latest effort with a vendor implementation of some BLAS function or other (any vendor library, not just CUBLAS). Often that kind of research does feed back into vendor libraries after some time. And just a fraction of the potential optimization work has to do with compiler optimizations, often it is about algorithmic changes that modify data transfer patterns, for example.

I suspect things are similar for FFT libraries, but have not had significant personal exposure to that field.

I believe that it will a breaking news, if google can release GPUCC in March.
Google announced this open source complier in SC15, and i believe that Nvidia should have partially involved in their research.

Has anyone here tried this on their code? It sounds like LLVM 3.9 is close to being released.