GTX 1080 very bad result for mining

Tell me why gtx1080 gives a bad result in the algorithm dagger-Hashimoto (with a heavy load on the memory)? What is the problem and how to fix it?
I’ve already read everything, but can not find solutions

What do you mean by “bad result”? What is “dagger-Hashimoto”? Since you mention “mining” I assume it is some sort of cryptocurrency thing.

If the problem is low performance AND this is a new algorithm, the reason is probably that the people who cook up new coin algorithms purposefully try their best to make them as inefficient for accelerators (GPUs and FPGA) as possible. That game has been going on for years, I really wonder why anybody still cares. Last I checked there were already more than 500 different cryptocurrencies …

Look, I wrote a miner, a program for mining cryptocurrency ethereum based dagger Hashimoto algorithm (ethash)
All my code kernel write on PTX.
On the gtx 1070 I get performance 31Mh/s and on gtx 1080 only 4Mh/s.
On older architectures everything works fine (Kepler, Maxwell)
As I understand it, the reason GDDR5x memory
May be i need some other way to initialize the GPU, not the same as for the old architecture…
Or specifed power state, I can not understand what the problem is

When you compare the CUDA profiler outputs for the code running on GTX 1070 and GTX 1080, how are they different? It is entirely possible that the access patterns of your code interact differently with the GTX 1070 DDR5 memory subsystem compared to the GTX 1080 DDR5X memory subsystem. With the profiler it may be possible to narrow down relevant differences such as particular kind of stalls.

You may wish to point people to your source code. Discussing the performance of code other people can inspect (or better yet, run) typically is much more fruitful.

The device code is written on PTX (not C) and the host code is written in a Purebasic and used cuda driver api cuda.lib
I do not think that anyone can inspect the code with profiler.
To load data i am not use L1 cache:

ld.global.cg.v4.u32 	 {mem,mem2,mem3,mem4},[mem2];

in my code is only 2 lines work with memory
the rest of the work is done with the registers …
I do not intermeddle all the registers in the register memory and 128 bytes are located in local memory.

Unless I am mistaken, I believe you can at least, having the compiled code and using nvprof – see sub-heading ‘Use nvprof to Profile Anything’:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/

At this point it is clear that for some CUDA applications this GDDR5x memory has problems.
NVIDIA currently has a myopic focus on ‘machine learning’ and if a bug (or set of bugs) does not impact that sub-set of their market they have little interest in fixing the problem.

Between the “not as advertised” memory bandwidth performance of the high-end Pascal GPUs ( 76% of theoretical global memory bandwidth for the Pascal Titan X vs. 91% of theoretical global memory bandwidth performance for the Maxwell Titan X), the very late release of the final version of CUDA 8.0 and the lack of support for Visual Studio 2015, their priorities are apparent. And on top of this they have been ‘blaming the user’ for any problems.

A while back I filed bug with NVIDIA for the memory bandwidth issues. They did fix the big problem with random global memory reads for large buffers but the other more general memory bandwidth problems ‘are still under review’.

As annoying, troubling, and frustrating the delays and bugs are to CUDA users, I do not see evidence that NVIDIA is ‘blaming the user’ for any of these problems.

The symptoms we all observe seem consistent with a company that has bitten off a bit too much, relative to the size of their engineering staff, and the lack of bug fixing (the mfaktc folks for example have been waiting for a fix to a compiler bug introduced with CUDA 7.0; it’s supposed to be fixed in CUDA 8.0 final) can easily be explained by a lack of engineering bandwidth, rather than a lack of good will.

As far as DDR5X is concerned, it seems to me that this is technology that has been brought to market so rapidly that neither supplier(s) nor users have had adequate time to adjust and work out all the kinks. My expectation is that first-generation DDR5X products will simply not reap the full benefits of this new memory type. There is only so much one can tweak by fiddling with memory controller settings in the VBIOS or driver.

According to Wikipedia, September 8 marks the one-year anniversary of CUDA 7.5, and I am still having some hope that NVIDIA will use that opportunity to finally ship the final version of CUDA 8.0.

Just this, and the whole thing. The algorithm uses random global memory reads.

But I know that some of the miners app were able to fix it. And I can not understand how, because in the PTX I do not have the tools to the memory access method, I can only use or not to use the cache.
How can I solve this problem, I advise users to install the driver versions below 350.xx and use the library cuda.lib version 6.5
But it has not brought results.

Hey Etar/Etayson,

No, neither Claymore nor myself were able to fix it using C++. The fix is in Pascal + WDDM2.1 + right drivers.

Going by the reports, it seems that the random read bug fix (more or less known over here as the TLB trashing issue) works better for GDDR5 than for GDDR5X. Or the people reporting such low numbers to you just haven’t got the right combination of OS/drivers.

But the GDDR5X issue is something different. Even on Linux, where the TLB trashing is not happening at current memory allocation sizes for the algorhitm in question, we get only about 22MH/s hashrate, equivalent to about 55% of theoretical max. bandwidth, whereas the 1070 does about 27MH/s, which is about 84%.

I did a bit of research into this a while ago and I can’t remember whom I was discussing this with back then, but at the time we were speculating about the GDDR5X being stuck in DDR mode instead of the new QDR, effectively halving the memory bandwidth. This was pretty much in line with the measurements.

Ultimately, there’s not so much you can do about this.

Thanks Genoil.
Because already I searched in varinat possible to fix this. And it turns out it’s not quite the miner. But just need to install the correct drivers.
Then, if it is not a secret, can you say the right combination for win10?
I know that for win7 the best drivers is 369.05

Because not only have nothing to say to people. I do not have on hand video Pascal series.

Don’t know either, haven’t purchased Pascal card yet.

I looked into the 1080 GDDR5X performance on Ethereum / ethash last month without a satisfactory answer. Here’s some info.

Both 1070/1080 have 8 memory controllers (each is 32-bit wide = 4 bytes)

  • GDDR5 @ 2000 MHz - 8n prefetch (over 2 memory cycles) = 84 = 32 bytes in a 2 cycle transaction (or 20004 per cycle= 8 GHz)
  • GDDR5x @ 1251 MHz - 16n prefetch (over 2 memory cycles) = 164 = 64 bytes in a 2 cycle transaction (or 12518 per cycle=10 GHz)

So GDDR5X runs slower but moves more per cycle, so on paper it should be a win. Assuming the rest of the system is optimized to handle…

I attempted to use Nsight to try and determine if the 1080 was doing 16 access bursts, but the tools errored out. Has anyone looked into? The Genoil/Allanmac memory access program should be a good proxy app for ethash
https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4682735/#4682735

Details

Data - see Mem tab - https://docs.google.com/spreadsheets/d/1qFNXpVMxAPiwqI6RBs1b92yftqCe9Khix6xUdMFy5dQ/pubhtml
Notes on 1080 Ethereum mining issues and speculation & links to tech info - https://forum.ethereum.org/discussion/9277/1080-specific-ethereum-mining-issues

I looked into the 1080 GDDR5X performance on Ethereum / ethash last month without a satisfactory answer. Here’s some info.

Both 1070/1080 have 8 memory controllers (each is 32-bit wide = 4 bytes)

  • GDDR5 @ 2000 MHz - 8n prefetch (over 2 memory cycles) = 84 = 32 bytes in a 2 cycle transaction (or 20004 per cycle= 8 GHz)
  • GDDR5x @ 1251 MHz - 16n prefetch (over 2 memory cycles) = 164 = 64 bytes in a 2 cycle transaction (or 12518 per cycle=10 GHz)

So GDDR5X runs slower but moves more per cycle, so on paper it should be a win. Assuming the rest of the system is optimized to handle…

I attempted to use Nsight to try and determine if the 1080 was doing 16 access bursts, but the tools errored out. Has anyone looked into? The Genoil/Allanmac memory access program should be a good proxy app for ethash
https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4682735/#4682735

Details

Data - see Mem tab - https://docs.google.com/spreadsheets/d/1qFNXpVMxAPiwqI6RBs1b92yftqCe9Khix6xUdMFy5dQ/pubhtml
Notes on 1080 Ethereum mining issues and speculation & links to tech info - https://forum.ethereum.org/discussion/9277/1080-specific-ethereum-mining-issues

For CUDA purposes this GDDR5x memory does not perform as advertised, period.

It should be noted that the latest Tesla GPU models are NOT using this GDDR5x memory and is using the more reliable GDDR5 memory;

http://nvidianews.nvidia.com/news/new-nvidia-pascal-gpus-accelerate-deep-learning-inference

At least it seems NVIDIA is cutting their losses with this type of memory(I hope) and will move to HBM2 memory in the future. If there was a fix for the low memory bandwidth issues with CUDA and GDDR5x memory it would have been fixed by now. Then again one would have expected the CUDA 8 to be done by now as well.

At least the US DOE is going to be using Volta for the two new supercomputers currently being built;

https://www.olcf.ornl.gov/summit/

https://asc.llnl.gov/coral-info

NVIDIA can afford to disappoint the recreational CUDA users with GDDR5x memory, but they cannot get away with the same behavior with the DOE.

For the foreseeable future, I expect HBM2 to remain expensive specialty memory. The manufacturing is quite challenging, you have to thin the wafers and connect with through-silicon vias (TSV), and if anything goes wrong you just wasted some perfectly good dies (or should that be dice?). Lower yields, higher costs. Maybe the pricing would be OK for Titan, but I can’t imagine we’ll see HBM2 in consumer products in the next couple of years (famous last words :-)

GDDR5X looks like a good idea on paper. If the manufacturer can get it under control and GPU vendors can figure out how to adjust their memory controllers and interconnects, I think there is hope yet that it will fill the middle-performance slot (300-500 GB/sec) for GPU memory for a number of years. Everything about the GDDR5X introduction seemed rushed, even the JEDEC standard was produced in record time, from what I gather. Rushing new technology is rarely a good idea.

yes, it appears 1080Ti and the new P4 P40 cards will not use GDDR5x. i think NVIDIA know they stuffed up and are going back to GDDR5 for the time being.

I am wondering, do the $1200 Titans also have this problem? Probably. so much for high end.

236 Pascal Paillier
∀m1, m2 ∈ Zn and k ∈ N
d(e(m1) e(m2) mod n2) = m1 + m2 mod n
d(e(m)k mod n2) = km mod n
d(e(m1) gm2 mod n2) = m1 + m2 mod n
d(e(m1)m2 mod n2)
d(e(m2)m1 mod n2)
)
= m1m2 mod n .
These properties are known to be particularly appreciated in the design of voting
protocols, threshold cryptosystems, watermarking and secret sharing schemes,
to quote a few. Server-aided polynomial evaluation (see [27]) is another potential
field of application.
Self-Blinding. Any ciphertext can be publicly changed into another one without
affecting the plaintext :
∀m ∈ Zn and r ∈ N
d(e(m) rn mod n2) = m or d(e(m) gnr mod n2) = m ,
depending on which cryptosystem is considered. Such a property has potential
applications in a wide range of cryptographic settings.

I get 25 Mh/s on my 1080 and 32 Mh/s on my 1070. On Windows 10. The 1070 has peaks of well over 40 Mh/s. :-)

Any fixes awailable for this issue?