So I am loading my data on the host side into a ushort, as that is how the data is formatted, and accessing it as a ushort4 to adjust for the memory bandwidth, which speeds up the kernel.
However it seems as though when I copy the data from host to device using “cudaMemcpyToArray” CUDA is doing some sort of conversion to it. Ie. I index through all of the data but I only get 1 out of every 4 pieces.
I assume I am missing something implicit, so any direction about where to read, or what to look at would be appreciated.