Huge performance difference depending on the machine I put my card in

Genoil · June 16, 2015, 11:00pm

Hi,

I’m working on a CUDA port of an OpenCL miner for a new crypto ‘currency’ (Ethereum). It already works great on my primary dev machine (Win7-64 Xeon E5 GTX780) and some others have carried out positive tests too on various hardware. But now I have this Win 8 home system with a GTX750Ti, and will only hash at about 1/16th of the speed that others with a GTX750Ti have reported. And even stranger, when I take out the video card and add it to the Xeon workstation, it suddenly hashes at full speed.

So I thought perhaps the CPU or RAM of my Win 8 was slowing things down, leading me to replace the Celeron G1840 with a Core i5-4570 and doubled up the RAM to 8GB, but still the bad performance. Uninstalled drivers, CUDA, etc, nothing helps.

Also, common graphics and compute benchmarks report normal figures, it’s only my miner that suffers from slowness. It’s also not my build environment, because when I build binaries on the workstation and use them on the home system, I get the same low performance.

The only cause I can think of is some runtime DLL on the home system bogging things down, but which one I don’t know. And how to solve it? If you want to have a look at the source look here:

https://github.com/Genoil/cpp-ethereum/tree/cudaminer/libethash-cu

Thanks

allanmac · June 16, 2015, 11:52pm

My only guess is that you’re running a Debug build at home.

Also make sure you’re targeting sm_50 on the GTX 750 Ti machine.

Skybuck · June 17, 2015, 1:01am

Sounds to me like PCI express bandwidth is not set correctly in bios ?

Genoil · June 17, 2015, 4:44am

Thanks for your replies.

the Debug build runs even slower so that’s not it
i’m sure i target sm_50. if i specifically target sm_35, the kernel doesn’t even run
nvprof does tells me the memory bandwith is not good, but when i check that using a different tool, it reports about 11GBps in two directions, so I guess it’s fine. also the algo isn’t very PCIe transfer heavy. it only copies about 1GB to the card before hashing starts, but per kernel launch just a few bytes.

CUDA-Z output:

Memory Copy
	Host Pinned to Device: 11.0665 GiB/s
	Host Pageable to Device: 3724.82 MiB/s
	Device to Host Pinned: 11.2835 GiB/s
	Device to Host Pageable: 4787.63 MiB/s
	Device to Device: 32.6689 GiB/s
GPU Core Performance
	Single-precision Float: 1102.17 Gflop/s
	Double-precision Float: 47.5718 Gflop/s
	32-bit Integer: 319.027 Giop/s
	24-bit Integer: 238.472 Giop/s

njuffa · June 17, 2015, 6:23am

Check for potentially forgotten environment settings such as CUDA_LAUNCH_BLOCKING or CUDA_PROFILE. Have you eliminated host activity as a bottleneck by checking with Windows Task Manager what tasks are running? You might want to profile the app on both machines using the CUDA profiler to see whether there is massive slowdown in any particular kernel. I do not know how you time the execution, could there be a bug of some sort in the benchmarking infrastructure (you could do a quick sanity check by comparing to wall clock elapsed).

Genoil · June 17, 2015, 7:44am

Thanks @njuffa I’ll look into those. It might be useful to add that I experience a similar slowdown using the original opencl kernel. Anyway I took the card to work today to do some profiling. Also thinking about installing Linux to compare performance there.

Clochette · June 17, 2015, 4:29pm

Have you compared the card’s power and compute usage under the ‘slow’ hashing with ‘normal’ hashing? Maybe there’re some insights to be gained that way.

njuffa · June 17, 2015, 4:50pm

Speaking of power: Could there be an issue with either power supply or cooling that keeps the GPU in a low-power, low-performance throttled state? Have you had a chance to do a full cross check by moving the GTX 780 into the machine at home? It the issue is local to the machine that should see slow-own as well.

allanmac · June 17, 2015, 5:13pm

@njuffa, the GTX 750 Ti doesn’t have external power connector (except for a couple highly OC’d SKUs).

@Genoil, as @njuffa suggests, just run “nvprof ” to obtain the true kernel run times. This will confirm whether or not you have a measurement issue.

I assume you’re not seeing any massive “spills” to local memory? If you haven’t already, enable verbose output (-Xptxas=-v) and look closely at the result.

Are you running the same CUDA Toolkit on both machines?

Also, I’m not a fan of “-maxrregcount”. I would suggest declaring kernel launch bounds instead.

Your CUDA-Z stats look fine and match my GTX 750 Ti:

[ PCIe 3.0 x8 slot / 1320 MHz boost ]

Genoil · June 17, 2015, 8:23pm

The benchmarking seems allright. On the slow machine, I have a ~500,000us kernel execution time that does 262,144 hashes. The reported Hashrate is ~524,288H/s. I didn’t manage to get good reports on the fast machine, other than that it hashes the 8MH/s as before and reported by others with the same card. So’I’ll have to take the card back to work tomorrow.

Unfortunately I can’t simply take the 780 home and swap it out, because I don’t have an adequate power supply for it and the fast machine’s power supply won’t fit in my home system either.

GPU-Z reports normal operation while benchmarking; 1188Mhz, 100% GPU load, Perfcap Reason VOp

With regards to maxregcount, it did give me an advantage on the GTX780. My kernel uses 73 registers on Compute 3.5, so limiting it to 72 yielded better results. On the GTX750Ti however, I found out that this setting didn’t work, as Compute 5.0 seems to need 80 registers for my kernel, leading to too much spillage when limiting it to 72. I’ll have a look at launch bounds as well, but first I have to solve this problem…

Genoil · June 18, 2015, 8:54am

Back at work, the same kernel runs in 33,000us. The most notable differences between the two reports I ran is that the slow kernel has:

hardly any eligible Warps per SM: 0.14 vs 2.99. Active Warps per SM (24) and Occupancy (23,9%) are the same
therefore a much higher share(97% vs 54%) of “No eligible Warps” in Warp Issue Efficiency
caused by a higher share of Memory Dependencies (98% vs 63%). The shares of other Issue Stall reason Dependencies are more or less proportional to eachother between the two.

Then of course the reported memory bandwidth is much different between th two on all levels, but so are the reported iops. So I don’t know if can I really draw conclusions from that. If you would like to have look at the reports: [url]http://we.tl/hBLbOBhnI9[/url]

njuffa · June 18, 2015, 4:28pm

These differences in statistics pertain to a controlled experiment where you are running the exact same executable (that is, copied, not recompiled) on the same GTX 750, but plugged into different computers, correct?

If so, I am stumped. There has to be a rational explanation, though. Does this app use any kind of online compilation from either PTX or CUDA source code?

Genoil · June 18, 2015, 8:30pm

The stats are not from the same executables, but they were built from the same source on the two systems. As you can see in the stats, the amount and distribution of the instructions is exactly the same. Earlier, I ran binaries built on the fast system on the slow system, resulting in the same slow performance, so I don’t think it’s in the compiler. Apart from the hardware (GTX750Ti is the same though), the runtime environment is quite different as well. Geforce drivers and CUDA version are the same, different OS (Win 7 Ent-64 on the fast vs Win 8.1 Pro-64 on the slow). I’m going to try installing Linux with CUDA on a USB stick to rule out any weird hardware issues.

My guess is that I’ll end up wiping the slow machine to get rid of the issue, but I hope not…

Skybuck · June 20, 2015, 1:52am

Windows has power/throttling options… perhaps it’s throttling the GPU to save power/heat… might want to check that out.

Genoil · August 26, 2015, 9:16am

Bumping this thread up with some new findings. My miner is now being used by quite a few people, and performance-wise, I can subdivide them in two groups: those with Win8 / Win10 (low performance) and those with Win7 or Ubuntu (good performance).

What’s also interesting, is that users with GTX9x0 cards and Win8/10 don’t experience the problem.

Another finding is that an alternative kernel written in OpenCL has the same issue.

So this looks like a driver issue that is beyond my control. Is there anywhere I can address this with NVidia?

Robert_Crovella · August 26, 2015, 1:42pm

If you have a crisp repro case of both the slow and fast observations (i.e. you can duplicate the difference yourself in a defined series of steps) then I would encourage you to file a bug at developer.nvidia.com.

Best case would be to demonstrate something like 1/16 the performance on the same GPU type.

However, with only a set of “findings” or “reports”, it’s less likely that that would be an effective path to understanding what is going on.

Genoil · September 3, 2015, 1:00pm

I have more or less narrowed down the problem to a situation where I can run the algorithm in a simulation and a real mode, with the only (apparent) difference being that in simulation mode, the application generates random proof of work packages, whereas in real mode, I get them from a mining pool. The difference in hash rate should be zero, but the bug causes the simulation mode to run at full speed, and the real mode at about 50%.

Then I ran GPU-Z and then this came up (left = real, right =simulation):

The main differences are in Memory Controlled Load and Bus Interface Load. What does this mean? I would assume that for some reason, in the “real” scenario, the kernel is loading stuff from host RAM.

The 750ti is running in headless mode, the primary adapter, gtx780, still runs at full speed in both scenario’s (memory controller load @ 67%, bus interface load @ 0%). On Linux, the 750Ti also runs full speed in both modes.

little_jimmy · September 5, 2015, 4:51am

if you have a real mode case, and can build a simulation mode case, then you should be able to build a hybrid case, i would think

with the hybrid case, you may attempt to ‘import’ and run the sections of code the real mode has, but the simulation mode case does not, without actually relying on it
you still depend on the simulation mode, but may port sections of the real mode, and run it as well - you simply discard the results, and rely on the results of the simulation case
this way you may be able to better ring fence the problem, by ring fencing the particular section of code it occurs in/ with

Topic		Replies	Views
Looking for CUDA apps that can use more than 1 GPU. CUDA Programming and Performance	41	13019	December 9, 2009
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3351	January 10, 2009
300x to 600x times faster... really? CUDA Programming and Performance	92	34423	February 8, 2010
GTX 1080 very bad result for mining CUDA Programming and Performance	24	36757	October 2, 2017
GTX750Ti and buffers > 1GB on Win7 CUDA Programming and Performance	91	19883	July 21, 2016
CUDA very slow performance CUDA Programming and Performance	21	16759	March 6, 2020
Deminishing performance? CUDA Programming and Performance	29	13083	March 5, 2009
Modern GPU CUDA Programming and Performance	30	5667	April 11, 2016
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13483	July 9, 2008
Could anyone benchmark this for me on a 780 (Ti) or Titan? CUDA Programming and Performance	57	20889	February 16, 2014

Huge performance difference depending on the machine I put my card in

Related topics