CUDA 7.5 on Maxwell 980Ti drops performance by 10x versus CUDA 7.0, and 6.5

thanks njuffa for the helpful feedback. I will read the Best Practices Guide in more details.The last time I read the guide carefully was about 5 years ago …

by disabling the offending if() block we found earlier, I was able to run the PC sampling profiler again. I now see some new findings. I would like to get some help on interpreting the assembly code.

Over the last week, I’ve implemented a new RNG (xorshift128+) with a hope to get better speed. I now see some different patterns in the PC sampling profiler output. Memory dependency, previously only accounts for 2-3% of the latency, now returns back to the scene, even though the overall running speed is pretty decent (24k photon/ms on 980Ti, higher than what it was before).

I notice that almost 100% of this memory dependency comes from a single line of code (line#622),

if(idx1d!=idx1dold && idx1dold>0 && mediaidold){

which accounts for 1/3 of the total run time now. In the assembly, almost 100% of the memory dependency comes from the below single assembly line:

I2I.S32.S16 R57, R6;

I am attaching the screenshot of the PC sampling benchmark output. The hotspot in both the source code (top-left) and the assembly code (top-right) are highlighted.

External Media

variable mediaidold is a char (label of the media), read from the global memory array media on line#609. I suspect the I2I.S32.S16 instruction was for retrieving the value of mediaidold? is there a document I can read more about assembly instructions?

PS: I just changed line #609 from mediaidold=media[idx1d]; to mediaidold=mediaid; MCX got a nice 40% speed improvement ! (jumping from 24k photon/ms to 34k photon/ms on Maxwell)! I guess this confirms my suspicion.

On the other hand, the improvement on Fermi and Kepler was not as exciting, only about 10%. In comparison, my 980Ti is about 10x faster than 1 core of 590 (Fermi). I wish I can see what happened on the older GPU architecture. Unfortunately the PC sampling profiler only runs on Maxwell.

I2I is an integer type conversion, the mnemonic means “integer to integer”. Here, it converts a signed 16-bit integer to a signed 32-bit integer. It makes sense that you would see this instruction as part of a ‘char’ to ‘int’ conversion. This is not a particularly slow instruction, but if the source data comes directly from a load instruction, it may be stalled due to memory dependency.

Best practice: every integer in a C/C++ program should be ‘int’, unless there is an exceedingly good reason for it to be of some other type.

C and C++ semantics require that in an expression, all integer data with a type narrower than ‘int’ is widened to ‘int’ before being incorporated into the computation. So use of narrow integer types can often decrease efficiency (the compiler may be able to work around some of that under the “as-if” rule).

I don’t see how a minor issue like this would contribute to 1/3 of the runtime, that could be an artifact of the sampling profiler, which is a common risk of using a sampling approach.

You may want to look into the general efficiency of your global memory accesses. txbob already pointed out the general low efficiency of that, I think. The use of ‘const restrict’ pointer arguments may also allow for more aggressive re-ordering of loads, leading to better tolerance to high global memory access latency.

Usually when I see “trivial” code changes:

mediaidold=media[idx1d]; to mediaidold=mediaid;

resulting in large speed improvements, I think about the effect on optimization. The classic example is when people try to debug/optimize by commenting things out. Eventually they comment out a “trivial” write to global memory and suddenly their function gets 1000x faster. “Why does this one line of code take 343ms ??” You can find questions like that all over the place.

So I haven’t studied this case, but I would also consider whether the code change in question allowed the compiler to optimize away some significant chunk of code, which no longer has any impact on global state. For example, does this change eliminate the dependency on a previous computation involving either idx1d or media[idx1d]? If so, this code change could result in that section of code/computation being dropped/skipped.

For the original observation (10x perf difference CUDA 6.5->7.5) discussed up through about comment 30 in this thread, the dev team seems to have narrowed it down to a particular compiler behavior. As njuffa previously surmised, the modification would be related to ptxas. I don’t have liberty to describe in detail at the moment, and confirmation (AFAIC) cannot be discussed until an actual updated driver appears (see below).

I can’t discuss schedule for an updated ptxas with the proposed change at this time.

With respect to related components in the driver, a future new version of r361 driver branch may appear that will incorporate the proposed change. The proposed change has been tested internally already to demonstrate that it restores the performance that was “lost” in CUDA 7.5 currently.

Thus it may be possible to test a future r361 driver by eliminating the SASS portion of the fatbinary, and allowing JIT to create the necessary SASS.

I’ll update the thread when I have more details, but probably not until said r361 driver appears.

thank you txbob for the update, also appreciate the dev team for their effort to quickly identify and fix the issue. I look forward to the new drive to appear.

Just want to follow up on this previously reported issues, and share some updates.

Lately, I purchase a GTX 1080 and upgraded driver to 367.35. I also upgraded CUDA from 7.5 to 8.0. I had some very interesting findings.

First of all, with CUDA 8+367.35, the 10x slow down due to that particular if() condition has disappeared. The code runs normal with a speed that is comparable to our expectations.

Interesting things happens when we tried CUDA 7.5+367.35, we found that this combination gave us 25% higher speed than the CUDA 8+367.35 combination!

Here is my speed benchmark summary page:

http://mcx.space/gpubench/

Comparing #6 and #10, or #9 to #14 from the list, you can see that a roughly 25% difference was observed between CUDA 7.5 generated binary (#6/#9) compared to CUDA 8 generated ones (#10/#14) running on the same card. This difference seems to be consistent on both Maxwell and Pascal.

if you want to reproduce the results, you may do

git clone https://github.com/fangq/mcx.git
cd mcx/src
make clean
make                         # compiles mcx binary with your current cuda
cd ../example/benchmark
./run_benchmark1.sh

when you switch CUDA version, you should able to reproduce this difference.

Interestingly, if I use CUDA 7.0, it also produces something similar to CUDA 7.5, but both are 25% faster than CUDA 8.

In addition this has no impact to the speed running on Fermi or Kepler.

I am wondering if any of you can provide some insight into this issue.

thanks

It was expected that a newer driver would “fix” the previously reported 10x perf issue, as I reported in comment 44.

The observation about 25% difference in perf between CUDA 7.5 and CUDA 8.0 may be unrelated (probably unrelated IMO).

A similar approach as was previously used is recommended:

  1. Use profiling to discover where the differences originate.

  2. Develop a simpler reproducer that shows the issue based on the localities identified in step 1.

  3. File a bug.

You can of course just skip to step 3 if you want, and/or folks here may want to chew on things to see if they have insights to share. Filing a bug with a complex test case (your whole project) in step 3 will most likely result in slower processing of the bug on the NVIDIA side, but that’s a rather subjective statement.