I saw that the CUDA 6 compiler was improved regarding generating memory transfer instructions (noted in another posting), so I recompiled our stuff under CUDA 6 (previously Cuda 5). But I got a big slowdown on a rather simple kernel on a GRIDK520 board.
4% slower on single precision
32% slower on double precision
Has anyone else seen this? The board is compute capability 3.0 on an Amazon AWS node, Windows 2008 Server R2 (similar to Windows 7 desktop). 64 bit code compiled in visual studio 2008.
I think it would be useful to compare actual kernel execution times before reaching the conclusions that the slowdown is due to different code generated by the newer compiler. It would also be useful to identify specifically the kernels that are slower.
Compilers are very complex pieces of software consisting of numerous transformation passes, most of them controlled by various heuristics. When changes are made to add new functionality or improve performance, there can be unforeseen interactions and side effects. As a consequence, performance of some codes may decline. The goal is obviously to keep that percentage as small as possible, i.e. changes should have much more upside than downside.
I don’t know whether your code uses any libraries, but similar processes apply there, i.e. changes may improve performance in most code contexts or most argument combinations, but not all. The goal is to improve the performance vast majority of known or anticipated use cases, but there may be some that, intentionally or unintentionally, lose some performance. The trade-offs can be complex. One common trade-off in libraries is the need for improved accuracy versus the need for improved performance, with the needs often differing by target market.
Lastly, there is always the possibility of unintentional changes in the form of performance bugs.
When significant performance drops are seen in user code (and a 32% performance is certainly significant), the best thing to do is to file a bug report, using the form linked from the registered developer site. Please attach a self-contained repro code. It is very helpful to reduce the repro code to the minimal amount necessary to still reproduce the issue.
Thanks, I am trying to run a profile on both CUDA5 and CUDA6 to see if it’s the kernel causing the delay or not. There is only 1 kernel, and it is not using external libraries.
I am running 4 streams and doing async mem copies, so the added time may have something to do with those tasks instead.
This is what I found: it seems to be kernel slowdown for the 1 kernel on CUDA 6 from
visual profiler data.
Also the visual profiler showed that CUDA6 version was using 6 more registers (30 vs. 24).
So maybe that’s causing the slowdown (e.g. “register pressure”?) I’m not sure
if this is the proper way to use the term. I will look thru the code, maybe I have
some extra local variables that are unneeded that aren’t getting optimized out in Cuda6?
Could that cause extra register use? Or maybe there is a way to set max registers to be
used…I thought I saw that somewhere in the docs.
Increased register pressure can cause lower performance via reduced occupancy (although there is, generally speaking, no strong correlation between occupancy and performance). register pressure induced performance drops are fairly common on sm_2x devices, less common on sm_30 devices, and relatively uncommon on sm_35 and later architectures. This is due to improvements to the architecture: larger register file, increased number of registers per thread block, larger number of simultaneous thread blocks resident on an SM.
I would suggest filing a bug to get this sorted out.
Update:
I found the maxregcount flag and set that to 24 on CUDA6 to match CUDA5 regcount… verified that
it is using this new limit by running Visual Profiler. However, the speed does not improve with
the lower register count. I may be able to compare PTX files and see if there is anything obvious different.
Frequently, imposing register limits on the compiler that are lower more than 1 or 2 registers compared to the number picked by the compiler itself does no improve execution time because (1) the compiler tries to reduce register pressure by repeating previously performed computation, i.e. undoing CSE or strength reduction (2) if that does not reduce the register count enough, starts spilling registers to local memory. This usually increases execution times due to increases in dynamic instruction count and memory traffic.
Since PTX is just an intermediate representation, that is compiled rather than assembled, looking at PTX usually is not helpful to identify code generation issues. For that, one has to inspect the machine code (SASS) which can be extracted from the binary by use of cuobjdump --dump-sass.
I think I have traced the problem to global memory loads.
There are 2 within a tight loop. If I get rid of those, the code
speeds up tremendously of course.
I found this option (see below) in the -Ptxas subtopic of the nvcc args,
but there does not seem to be a description of what these do.
I didn’t find anything in the PTX doc on the site to indicate what the args are, for the PTX assembler command itself. Maybe I am missing it.
Do you know where I can find the definition of these various cache options?
–def-load-cache ca cg cs lu cv (the options list were in an online book, not from NVIDIA, but NVIDIA’s current docs has the --def-load-cache option listed)
See PTX specification, section 8.7.6.1. Cache Operators. Except for rare cases, it is not something you would want to play with on modern architectures. The last time I used -dlcm was years ago. The default used to be .ca
Tight loops that dominate overall app-level performance can easily lead to significant performance differences due to code generation artifacts that are mostly harmless otherweise. I have run into several such cases in the past, one of which is still tracked in an open compiler bug that I have worked around through course code changes in the meantime.
I think you have applied enough due diligence at this point to proceed to filing a bug, so that is what I would suggest as the next step.
Thanks for the info. I’ll see about the bug report, that looks more involved if I have to prep a standalone test case. FYI, I tried those cache options to see what would happen, and they just made it a few seconds slower, nothing notable.
By the way, I see some other anecdotal reports of slowdown with CUDA 6, but this on Mac:
Standalone test cases are the only way to reproduce reports of performance drops. That is the first step that happens before the compiler team starts root cause analysis. Having filed bugs not only against the CUDA compiler but several other tool chains throughput my career, I understand that creating a cut-down repro code for a bug report requires some work. Thank you for your help.
With every CUDA release, among the millions of CUDA kernels out in the wild, there will be some that experience a performance drop. Generally speaking this happens with every tool chain that is still being actively maintained and improved, and is not an issue particular to CUDA. I tried to outline some of the reasons above. Each reported issue needs to be looked at individually: only after a root cause has been determined can one say with certainty that it is identical to, or different from, the root cause behind some other observation.