Huge performance difference depending on the machine I put my card in


I’m working on a CUDA port of an OpenCL miner for a new crypto ‘currency’ (Ethereum). It already works great on my primary dev machine (Win7-64 Xeon E5 GTX780) and some others have carried out positive tests too on various hardware. But now I have this Win 8 home system with a GTX750Ti, and will only hash at about 1/16th of the speed that others with a GTX750Ti have reported. And even stranger, when I take out the video card and add it to the Xeon workstation, it suddenly hashes at full speed.

So I thought perhaps the CPU or RAM of my Win 8 was slowing things down, leading me to replace the Celeron G1840 with a Core i5-4570 and doubled up the RAM to 8GB, but still the bad performance. Uninstalled drivers, CUDA, etc, nothing helps.

Also, common graphics and compute benchmarks report normal figures, it’s only my miner that suffers from slowness. It’s also not my build environment, because when I build binaries on the workstation and use them on the home system, I get the same low performance.

The only cause I can think of is some runtime DLL on the home system bogging things down, but which one I don’t know. And how to solve it? If you want to have a look at the source look here:


My only guess is that you’re running a Debug build at home.

Also make sure you’re targeting sm_50 on the GTX 750 Ti machine.

Sounds to me like PCI express bandwidth is not set correctly in bios ?

Thanks for your replies.

  • the Debug build runs even slower so that’s not it
  • i’m sure i target sm_50. if i specifically target sm_35, the kernel doesn’t even run
  • nvprof does tells me the memory bandwith is not good, but when i check that using a different tool, it reports about 11GBps in two directions, so I guess it’s fine. also the algo isn’t very PCIe transfer heavy. it only copies about 1GB to the card before hashing starts, but per kernel launch just a few bytes.

CUDA-Z output:

Memory Copy
	Host Pinned to Device: 11.0665 GiB/s
	Host Pageable to Device: 3724.82 MiB/s
	Device to Host Pinned: 11.2835 GiB/s
	Device to Host Pageable: 4787.63 MiB/s
	Device to Device: 32.6689 GiB/s
GPU Core Performance
	Single-precision Float: 1102.17 Gflop/s
	Double-precision Float: 47.5718 Gflop/s
	32-bit Integer: 319.027 Giop/s
	24-bit Integer: 238.472 Giop/s

Check for potentially forgotten environment settings such as CUDA_LAUNCH_BLOCKING or CUDA_PROFILE. Have you eliminated host activity as a bottleneck by checking with Windows Task Manager what tasks are running? You might want to profile the app on both machines using the CUDA profiler to see whether there is massive slowdown in any particular kernel. I do not know how you time the execution, could there be a bug of some sort in the benchmarking infrastructure (you could do a quick sanity check by comparing to wall clock elapsed).

Thanks @njuffa I’ll look into those. It might be useful to add that I experience a similar slowdown using the original opencl kernel. Anyway I took the card to work today to do some profiling. Also thinking about installing Linux to compare performance there.

Have you compared the card’s power and compute usage under the ‘slow’ hashing with ‘normal’ hashing? Maybe there’re some insights to be gained that way.

Speaking of power: Could there be an issue with either power supply or cooling that keeps the GPU in a low-power, low-performance throttled state? Have you had a chance to do a full cross check by moving the GTX 780 into the machine at home? It the issue is local to the machine that should see slow-own as well.

@njuffa, the GTX 750 Ti doesn’t have external power connector (except for a couple highly OC’d SKUs).

@Genoil, as @njuffa suggests, just run "nvprof " to obtain the true kernel run times. This will confirm whether or not you have a measurement issue.

I assume you’re not seeing any massive “spills” to local memory? If you haven’t already, enable verbose output (-Xptxas=-v) and look closely at the result.

Are you running the same CUDA Toolkit on both machines?

Also, I’m not a fan of “-maxrregcount”. I would suggest declaring kernel launch bounds instead.

Your CUDA-Z stats look fine and match my GTX 750 Ti:

[ PCIe 3.0 x8 slot / 1320 MHz boost ]

The benchmarking seems allright. On the slow machine, I have a ~500,000us kernel execution time that does 262,144 hashes. The reported Hashrate is ~524,288H/s. I didn’t manage to get good reports on the fast machine, other than that it hashes the 8MH/s as before and reported by others with the same card. So’I’ll have to take the card back to work tomorrow.

Unfortunately I can’t simply take the 780 home and swap it out, because I don’t have an adequate power supply for it and the fast machine’s power supply won’t fit in my home system either.

GPU-Z reports normal operation while benchmarking; 1188Mhz, 100% GPU load, Perfcap Reason VOp

With regards to maxregcount, it did give me an advantage on the GTX780. My kernel uses 73 registers on Compute 3.5, so limiting it to 72 yielded better results. On the GTX750Ti however, I found out that this setting didn’t work, as Compute 5.0 seems to need 80 registers for my kernel, leading to too much spillage when limiting it to 72. I’ll have a look at launch bounds as well, but first I have to solve this problem…

Back at work, the same kernel runs in 33,000us. The most notable differences between the two reports I ran is that the slow kernel has:

  • hardly any eligible Warps per SM: 0.14 vs 2.99. Active Warps per SM (24) and Occupancy (23,9%) are the same
  • therefore a much higher share(97% vs 54%) of “No eligible Warps” in Warp Issue Efficiency
  • caused by a higher share of Memory Dependencies (98% vs 63%). The shares of other Issue Stall reason Dependencies are more or less proportional to eachother between the two.

Then of course the reported memory bandwidth is much different between th two on all levels, but so are the reported iops. So I don’t know if can I really draw conclusions from that. If you would like to have look at the reports:

These differences in statistics pertain to a controlled experiment where you are running the exact same executable (that is, copied, not recompiled) on the same GTX 750, but plugged into different computers, correct?

If so, I am stumped. There has to be a rational explanation, though. Does this app use any kind of online compilation from either PTX or CUDA source code?

The stats are not from the same executables, but they were built from the same source on the two systems. As you can see in the stats, the amount and distribution of the instructions is exactly the same. Earlier, I ran binaries built on the fast system on the slow system, resulting in the same slow performance, so I don’t think it’s in the compiler. Apart from the hardware (GTX750Ti is the same though), the runtime environment is quite different as well. Geforce drivers and CUDA version are the same, different OS (Win 7 Ent-64 on the fast vs Win 8.1 Pro-64 on the slow). I’m going to try installing Linux with CUDA on a USB stick to rule out any weird hardware issues.

My guess is that I’ll end up wiping the slow machine to get rid of the issue, but I hope not…

Windows has power/throttling options… perhaps it’s throttling the GPU to save power/heat… might want to check that out.

Bumping this thread up with some new findings. My miner is now being used by quite a few people, and performance-wise, I can subdivide them in two groups: those with Win8 / Win10 (low performance) and those with Win7 or Ubuntu (good performance).

What’s also interesting, is that users with GTX9x0 cards and Win8/10 don’t experience the problem.

Another finding is that an alternative kernel written in OpenCL has the same issue.

So this looks like a driver issue that is beyond my control. Is there anywhere I can address this with NVidia?

If you have a crisp repro case of both the slow and fast observations (i.e. you can duplicate the difference yourself in a defined series of steps) then I would encourage you to file a bug at

Best case would be to demonstrate something like 1/16 the performance on the same GPU type.

However, with only a set of “findings” or “reports”, it’s less likely that that would be an effective path to understanding what is going on.

I have more or less narrowed down the problem to a situation where I can run the algorithm in a simulation and a real mode, with the only (apparent) difference being that in simulation mode, the application generates random proof of work packages, whereas in real mode, I get them from a mining pool. The difference in hash rate should be zero, but the bug causes the simulation mode to run at full speed, and the real mode at about 50%.

Then I ran GPU-Z and then this came up (left = real, right =simulation):

The main differences are in Memory Controlled Load and Bus Interface Load. What does this mean? I would assume that for some reason, in the “real” scenario, the kernel is loading stuff from host RAM.

The 750ti is running in headless mode, the primary adapter, gtx780, still runs at full speed in both scenario’s (memory controller load @ 67%, bus interface load @ 0%). On Linux, the 750Ti also runs full speed in both modes.

if you have a real mode case, and can build a simulation mode case, then you should be able to build a hybrid case, i would think

with the hybrid case, you may attempt to ‘import’ and run the sections of code the real mode has, but the simulation mode case does not, without actually relying on it
you still depend on the simulation mode, but may port sections of the real mode, and run it as well - you simply discard the results, and rely on the results of the simulation case
this way you may be able to better ring fence the problem, by ring fencing the particular section of code it occurs in/ with