Implementation of a poorly-aligned-memory, on-device std::copy/memcpy-like function?

In some kernels I’m writing, I need to have many warps copy regions of memory from one place to another (global->shared, shared->global, texture->shared etc.,)

The thing is, the memory segments are not well-aligned:

  • The element type might be sub-4-bytes (making a simple out[i] = in[i] loop inappropriate)
  • The source and/or target segments of memory may not be aligned at the boundary of a memory transaction (and I certainly do not want to have 2 transactions for every 4-bytes-per-lane write of a warp)
  • The alignments of the source and the target vis-a-vis the transaction size may be different
  • (but I’m not interested in mis-alignment of individual elements with their own type size, i.e. if I copy 4-byte int’s, I am assuming the address is divisible by 4 and I don’t have to use prmt or some such trickey.)

and this suggests I need a general-purpose (though templatized for efficiency) implementation of an std::copy() / memcpy() function. I don’t mean the runtime API all-device mempcy, which is irrelevant in my case (I think).

Has something like that been implemented and releaaed for free? I’d hate to reinvent the wheel here.

Use cudaMemcpy for global <-> global. It works also for ‘char’ array (1 element = 1 byte). And in the description of cudaMemcpy at there is nothing mentioned that ‘src’ and ‘dst’ pointer must be addresses aligned to 4 byte.

For load/store from global to shared memory, the cub library ‘BlockLoad’ might be fine. I suppose it works also for all types (so also for ‘char’). See

HannesF99: I’m talking about device-side code, not host-side. I need device-side code. As for block-load, I’m interest in warp-level code, and also I don’t need/want the block-level arrangement CUB uses. Just a straight copy.

In case you do need to (re-?) implement yourself, have a look at the prmt PTX instruction.
It comes handy for implementing misaligned memory reads as it does most of the work apart from actually accessing memory.

Oh, I didn’t actually mean that kind of non-alignment. I meant if you copy an array of bytes, it can stat anywhere; but we would rather have each thread read and write 4 bytes at a time from/into 4-byte-aligned addresses.

That is what I am thinking about as well. Unless you want to copy a misaligned array onto another one with the same misalignment, you’ll have to shuffle bytes within 32-bit words somewhere, no?

may be thrust suppoports on-device copying?

It does, but its on-device copying is naive. It was a good guess on your part though. ModernGPU doesn’t have this either, nor does CUB (unless I’ve missed something).