In some kernels I’m writing, I need to have many warps copy regions of memory from one place to another (global->shared, shared->global, texture->shared etc.,)
The thing is, the memory segments are not well-aligned:
- The element type might be sub-4-bytes (making a simple out[i] = in[i] loop inappropriate)
- The source and/or target segments of memory may not be aligned at the boundary of a memory transaction (and I certainly do not want to have 2 transactions for every 4-bytes-per-lane write of a warp)
- The alignments of the source and the target vis-a-vis the transaction size may be different
- (but I’m not interested in mis-alignment of individual elements with their own type size, i.e. if I copy 4-byte int’s, I am assuming the address is divisible by 4 and I don’t have to use
prmtor some such trickey.)
and this suggests I need a general-purpose (though templatized for efficiency) implementation of an std::copy() / memcpy() function. I don’t mean the runtime API all-device mempcy, which is irrelevant in my case (I think).
Has something like that been implemented and releaaed for free? I’d hate to reinvent the wheel here.