Emulating unaligned memory reads/writes with aligned ones

john.tromp · December 21, 2017, 5:42pm

I’m porting a program to CUDA that uses long arrays of 40 bit (5 byte) words.
x86 has no problem and not even a performance penalty for reading/writing 64 bit words
with a 5 byte stride. In order to perform these on CUDA, I have for now resorted to the
following pair of routines:

device void write40(u8 *p64, const u64 x) {
u32 off = (u64)p64 & 3;
u32 *a = (u32 *)(p64 - off);
const u64 y = (u64)a[1] << 32 | a[0];
int s = 8 * off;
u64 mask = 0xffffffffffULL << s;
const u64 z = (y & ~mask) | (x << s);
a[0] = z;
a[1] = z >> 32;
}

device u64 read40(const u8 *p64) {
u32 off = (u64)p64 & 3;
const u32 *a = (u32 *)(p64 - off);
const u32 lo = a[0];
const u32 hi = a[1];
return (((u64)hi << 32) | lo) >> (8 * off);
}

Not being a CUDA expert, I wonder to what extent these routines can be optimized.
Also, I wonder how much slower these are compared to aligned 64 bit reads/writes.

Please let me know if you can shed light on these questions.

regards,
-John

tera · December 21, 2017, 6:00pm

Any chance you could store the values as separate aligned u8 and u32 arrays?

If not, PTX has the PRMT instruction to help with unaligned reads. I’ve posted some example code on Stack Overflow a while ago.

BulatZiganshin · December 21, 2017, 6:42pm

try to use just memcpy. compilers are very smart nowadays, so it may be as good as hand-crafted code

john.tromp · December 21, 2017, 6:56pm

tera: thanks for the SO link! that looks mighty helpful.
i will eventually have separate u32 and u8 arrays. that will be an extensive rewrite.
for now i’m trying to see how well i can do with this “quick fix”.

BulatZiganshin: i’m not doing a memcpy. i’m reading 40 bit values from one place, then generating new 40 bit values
from that and writing that to different new places (in a bucketsort).

Robert_Crovella · December 21, 2017, 7:04pm

I think the memcpy idea was something like this:

__device__ void write40(u8 *p64, const u64 x) {
  memcpy(p64, (u8 *)&x, 5);
}

__device__ u64 read40(const u8 *p64) {
  u64 foo;
  memcpy((u8 *)&foo, p64, 5);
  return foo;
}

obviously you could change the byte count from 5 to 6 or 8 if that is important for your refactoring stage.

Probably best to actually test such a thing before drawing conclusions, but I don’t think the compiler is smart enough to intelligently convert that to “optimized” reads/writes of e.g. u32 or higher quantities.

BulatZiganshin · December 21, 2017, 7:05pm

yeah, i mean copy those 5 bytes into/from local var with memcpy and pray to nvcc to make a miracle :)

small edit for txbob code:

u64 foo;

require initialization:

u64 foo = 0;

John, do you know about radix sort in CUB? taking into account that it’s an order of magnitude faster than naive implementations like the one in Boost.Compute, i believe that it’s very hard to make something even close.

BulatZiganshin · December 21, 2017, 7:16pm

it’s my source of this trick: [url]RealTime Data Compression: Portability woes : Endianess and Alignment (Part 2)

the entire comments section there is great reading

one more comment: if you will copy just 5 bytes as i initially proposed, compiler can’t convert it into optimized code like your one (because compiler doesn’t know that there are accessible bytes after the array end). instead, you may try to copy entire 8 bytes and then mask the result

john.tromp · December 21, 2017, 11:15pm

BulatZiganshin + txbob. ah, i should’ve realized that’s what you suggested. neat!
this results in a roughly 20% speedup of my code. Thanks for elaborating.

john.tromp · December 21, 2017, 11:37pm

BulatZiganshin: what i’m doing is not quite a regular radix sort. i have half a billion (2^29) edges that have random endpoints in a billion (2^30) node bipartite graph. and i repeatedly want to identify edges that have no adjacent edge on one side. i first sort into 2^7 big buckets, and then sort each bucket into 2^7 sub buckets, and finally use a bytemap to count incidences.
i use the random distribution to limit the size of buckets, only provisioning a few extra % of space.
although each edge has 2x29=58 bits of info about its endpoints, i only need to store 40 bits, with the rest implicit in the
location of the bucket and the ordering within the bucket.
This is all part of a proof of work algorithm which will ultimately be optimized to extract every last bit of performance
(by more skilled miner authors).
does the CUB radix sort have source code available somewhere?

Robert_Crovella · December 21, 2017, 11:54pm

try googling “github cuda cub”

https://github.com/NVlabs/cub

BulatZiganshin · December 22, 2017, 8:38am

that’s very interesting!

mining+graph let’s me think that it’s about mining a cryptocurrency without exhaustive search, i.e. by solving set of linear equations or smth like that?
do you made this 20% speedup by using memcpy or PRMT? if you are used memcpy, do you looked into code? i mean that memcpy code may be suboptimal and hand-optimized PRMT may be even faster
if i understand you right, you apply some hashing function to the (node1,node2) 58-bit value in order to mix bits up. if so, i may propose even better bit mixing function. Look at that: [url]smhasher/MurmurHash3.cpp at master · aappleby/smhasher · GitHub - you can even apply them multiple times. I’ve even seen reversible variant of these functions, if you need one. Another possibility is CRC hashing - it’s reversible, have great distribution, but require one 8-byte load (from shared mem) per 8-12 bits of data which may be slower than the multiplication-based hashing
CUB and ModernGPU are invaluable source of ready-to-use algorithms as well as CUDA optimization techniques and code snippets - just read their docs and enjoy. In particular, they support histogramming and segmented sort (i.e. sorting multiple independent arrays in single operation)
According to my experience, 2^8 buckets is optimal for modern CPUs, and afaik, 2^4 buckets are used on each step of CUB sort. Why you use 2^7?
Why you think that CUB radix sort is inappropriate for your goal? In particular, it seems that your algo is memory-constrained. What about using such technique: histogram src data into 16 buckets, then extract (i.e. partition) and sort each bucket individually before going to the next bucket. This way you will need 16x less extra space for sorting of the entire array.
Overall, GPUs are very efficient in radix sort, so you may end up with simple radix sort procedure followed by trivial linear scan rather than sophisticated approach you are described

BulatZiganshin · December 22, 2017, 8:52am

Actually, googling “CUB radix sort” is enough :) BTW, it’s also is a part of Thrust, which is a part of CUDA installation and extremely easy to learn/use, although doesn’t provide access to all CUB radix sort features. So you already has it installed :)

About speed: 3.4 billion 32-bit keys/s on GTX 1080. Should be more than 2B keys/s for 40-bit sorting. You can find CUB sort benchmarking program in my repo: https://github.com/Bulat-Ziganshin/Compression-Research/tree/master/app_radix_sort

john.tromp · December 22, 2017, 12:22pm

This is part of my Cuckoo Cycle proof of work scheme, see GitHub - tromp/cuckoo: a memory-bound graph-theoretic proof-of-work system
It does use exhaustive search, but spends most time trimming away leaf edges.
The 20% speedup is from replacing my original routines with the memcpy based ones.
I didn’t try PRMT, since it seems limited to constructing 32 bit values, where I need 40 bits.
I did not check what the memcpy compiles to.
No; the 2x29 bits are already the result of a siphash-2-4 applied to a 29-bit edge index and a 1-bit endpoint selector.
I expect a custom sort to outperform a library sort, for several reasons. a) can generate inputs the fly b) can pack data in 5 rather than 8 bytes c) can exploit even distribution of values d) can accommodate the 2 different sorting orders.
2^7 worked slightly better on the CPU implementation. I will test which one is better on GPU.
yes, the sorting is memory bandwidth constrained. and i’m already sorting mostly in place. see description of “matrix solver” in https://github.com/tromp/cuckoo/blob/master/src/mean_miner.hpp
the type of sorting i do is similar to what is done in existing miners for the Equihash proof of work as used in Zcash, that
have been heavily optimized by now. they also do the two-level bucket sorting. e.g. nicehash’s miner at nheqminer/equi_miner.cu at master · nicehash/nheqminer · GitHub which is based on my own work at https://github.com/tromp/equihash

john.tromp · January 30, 2018, 10:23pm

See here a progress report on my CUDA solver, which has adopted the suggested memcpy solution:

github.com

tromp/cuckoo/blob/master/GPU.md

Progress report on the Cuckoo Cycle GPU solver
============

UPDATE
------
This document is obsoleted (in part) by photon's 4x faster GPU solver from early April 2018.

Since completing my rewrite of xenoncat's performance quadrupling CPU solver (winning a double bounty)
in the form of mean_miner.cpp, I've been slowly grinding away at porting that code to CUDA.

I consider myself an amateur GPU coder, having previously ported the latency bound lean_miner to CUDA
(and having to pay a bounty to fellow Dutchman Genoil for improving performance by merely
tweaking the threads-per-block which I had naively fixed at 1), as well as my own
[Equihash miner](https://github.com/tromp/equihash) submission to the
[Zcash Open Source Miner Challenge](https://z.cash/blog/open-source-miner-winners.html).
That Equihash CUDA miner achieved a paltry 27.2 Sol/s on an NVIDIA GTX 980,
matching the performance of my Equihash CPU solver. But it did serve as the basis for far more capable
rewrites such as this
[Nicehash miner](https://github.com/nicehash/nheqminer/blob/master/cuda_djezo/equi_miner.cu)
by Leet Softdev (a.k.a. djezo), which achieves around 400 Sol/s on similar hardware.

This file has been truncated. show original

I’m offering bounties for performance improvements…