I’ve noticed that a significant bottleneck in my application is the packing and unpacking of the data to be sent and received on the CPU side of things. It is not the transfer over the PCIe bus that is the source of the bottleneck, rather the actual reformatting that takes place when copying to / from a CPU array, and a GPU formatted array held in pinned memory (from where it is then copied to / from the GPU device memory).
I intend to accelerate my packing / unpacking routines by rewriting them using SSE intrinsics (float4 manipulation seems especially well suited to this). However, for this to work I require that the pinned memory is 16 byte aligned. So my question is, is memory allocated using cudaMallocHost 16 byte aligned? If not, then how can ensure that it is so?