"cudaMemcpyToArray" and Pointer Type Conversion

So I am loading my data on the host side into a ushort, as that is how the data is formatted, and accessing it as a ushort4 to adjust for the memory bandwidth, which speeds up the kernel.

However it seems as though when I copy the data from host to device using “cudaMemcpyToArray” CUDA is doing some sort of conversion to it. Ie. I index through all of the data but I only get 1 out of every 4 pieces.

I assume I am missing something implicit, so any direction about where to read, or what to look at would be appreciated.

CUDA arrays use an opaque internal storage format which is different from simple linear storage. This means, at least as far as I know, that a CUDA array comprising ushort is laid out differently than a CUDA array of ushort4. In order to load ushort4 later, you need to set up the array accordingly, via cudaChannelFormatDesc. Depending on what your app is doing, you may not need to use CUDA arrays, at least not for performance reasons.

Note also that on the GPU, all accesses must be naturally aligned, that is, the alignment must be equal to the access width. While ushort requires only a two-byte alignment, ushort4 requires an eight-byte alignment.