A built in way to quickly convert three float arrays into a single float3 array

As input from host will be getting three float arrays of the same length (x,y,z) and want to use a single device array of float3 type (same length).

Is there any built in way to do this via some variant of cudaMemcpy? I can write a custom kernel, but want to make sure there is not already method to accomplish this in a few operations.

This should be doable with three 2D memcpy operations, where the source data is accessed in unit stride, while written to the destination with a stride of 3 elements (or vice-versa).

If you prefer not to puzzle out the exact memcpy invocations, you could use cublasGetVector / cublasSetVector which let you copy vectors with different source and destination strides, and which use 2D copies under the hood to do this.

I seem to recall the question came up before and I might have given a full code example that time, but I can’t say for sure. [Later:] Here is the relevant thread from Stackoverflow. It dealt with float2 instead of float3, but the idea is obviously the same and can be adjusted easily:

http://stackoverflow.com/questions/13535182/copying-data-to-cufftcomplex-data-struct/13536437#13536437

njuffa,

Thanks, will give that a try.

Using Memcpy2D did work to solve this issue, but for some reason it was faster to allocate additional x,y and z memory in the device, copy over the x,y,z from host, then use a custom kernel to fill in the values of the float3 array.

Not sure why because it seems that method would be slower, but there was about a 50 ms difference between the two for a data set of size apx 1,000,000.

Take a look at the plots in this paper that was posted yesterday: Data Transfer Matters for GPU Computing.

allanmac,

Yes I did see that and am trying to adapt it to my situation.

But since I copy from three separate device x,y, and z arrays to one device float3 array, I am not sure how to use the reinterpret_cast<> on the output float3 array.

Transfer times of 50 ms (milliseconds) seem unexpectedly high for array sizes on the order of one million elements. Does your code use pinned host memory and asynchronous copies to maximize the throughput?

I do not recall ever measuring the throughput of strided copies across PCIe, but off-hand I cannot see why it should be lower than for contiguous copies. While it wouldn’t surprise me if the strided writes on the device side were slower than non-strided writes, given the large disparity between device memory bandwidth and PCIe bandwidth, I would expect the latter to be the limiting factor even for the strided writes.

What kind of GPU is in your system, and what PCIe configuration does it use?

Tesla K20c which is running at PCI-e 2.0 x8 (half its potential speed to due motherboard splitting between two GPUs). Using the TCC driver.

host to device copies= 3,355 MiB/s
device to host copies= 3,109 MiB/s

above are more or less the same for pinned or pageable (pinned about 5% faster)

the 50 ms difference was solely from the use of Memcpy2D, but with using 3 regular cudaMemcpy()s (host to device) and the kernel which copies x,y,z to the float3, it takes at most 3 ms for that same transfer-from-host-and-copy-to-float3-array set of operations.

I assume you’re concerned about efficiently writing out the float3 structure?

Yes.

Also I made a mistake earlier when you referred to a recent article. Initially I thought you were referring to this:

http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

which was why I brought up casting.

I was thinking about this… some unstructured thoughts on structured load/stores:

  • If you can get the host to cooperate, interleaving x and y would allow you to perform a coalesced float2 load followed by a float load of z.
  • If you can't get the host to help then you need to convert from SoA to AoS. Each warp can permute its load of the x/y/z segments and then, through a few exchanges or a couple SELPs, perform coalesced stores.
  • However, once it's on the device, repeatedly loading and storing a linearly packed AoS float3 is kind of a pain and won't be coalesced unless you transform it into a two-part float2 + float structure or repeatedly perform a transformation (and its inverse) like I described above.
  • I couldn’t find any definitive answers to this problem but I’m sure this is well-trodden territory. Are there better approaches?

    [s]I finally got around to run my own experiments, and I do not see any significant performance difference when copying two float arrays into an interleaved (float2) array compared to a straight copy of the entire data. This is the case I addressed in the Stackoverflow thread: given two float arrays on the host containing the real and imaginary components of complex data (i.e. SOA format) construct a device vector of float2 data, where each float2 object represents one complex number (i.e. AOS format).

    I locally modified my existing memcpy benchmarking framework as follows:

    #if 0
    cudaStat = cudaMemcpyAsync (devBuf, hostBuf, curBufSize,
    cudaMemcpyHostToDevice, s0);
    #else
    cudaStat = cudaMemcpy2DAsync (devBuf+0, 8, hostBuf + 0,            4, 4, curBufSize/8,
    cudaMemcpyHostToDevice, s0);
    cudaStat = cudaMemcpy2DAsync (devBuf+4, 8, hostBuf + curBufSize/2, 4, 4, curBufSize/8,
    cudaMemcpyHostToDevice, s0);
    #endif
    

    The reported performance using a K20c running with PCIe 2 on a 64-bit Linux host platform was as follows:

    1 x cudaMemcpyAsync:
    h2d: bytes= 1048576 time= 180.01 usec rate=5825.23MB/sec

    2 x cudaMemcpy2DAsync:
    h2d: bytes= 1048576 time= 182.87 usec rate=5734.09MB/sec

    For reference, the maximum PCIe throughput on this platform for very large transfers is about 6.0 GB/sec h2d and 6.6 GB/sec d2h, where 1 GB/sec = 1e9 bytes/sec.[/s]

    Sorry, I passed the incorrect commandline flags to my test app (which I hadn’t used in years) and so I measured not what I thought I measured in #12. With the correct settings, I measure the following on a K20c:

    1 x cudaMemcpyAsync()
    h2d: bytes= 1572864 time= 270.84 usec rate=5807.28MB/sec

    2 x cudaMemcpy2DAsync(), stride of two floats
    h2d: bytes= 1572864 time= 26398.90 usec rate=59.58MB/sec

    3x cudaMemcpy2DAsync(), stride of three floats
    h2d: bytes= 1572864 time= 26395.80 usec rate=59.59MB/sec

    Based on this, the strategy of performing straight copies of the three arrays to the device, followed by a kernel that rearranges the data from SOA into AOS format seems to be the best way to achieve the desired functionality. I do not understand the low performance of cudaMemcpy2DAsync() at this time.

    The code for the three copy variants listed above is:

    #if 0
                    cudaStat = cudaMemcpyAsync (devBuf, hostBuf, curBufSize, cudaMemcpyHostToDevice, s0);
    #elif 0
                    cudaStat = cudaMemcpy2DAsync (devBuf+0,  8, hostBuf + 0,            4, 4, curBufSize/8,
                                                  cudaMemcpyHostToDevice, s0);
                    cudaStat = cudaMemcpy2DAsync (devBuf+4,  8, hostBuf + curBufSize/2, 4, 4, curBufSize/8,
                                                  cudaMemcpyHostToDevice, s0);
    #else
                    cudaStat = cudaMemcpy2DAsync (devBuf+0, 12, hostBuf + 0,            4, 4, curBufSize/12,
                                                  cudaMemcpyHostToDevice, s0);
                    cudaStat = cudaMemcpy2DAsync (devBuf+4, 12, hostBuf + curBufSize/3, 4, 4, curBufSize/12,
                                                  cudaMemcpyHostToDevice, s0);
                    cudaStat = cudaMemcpy2DAsync (devBuf+8, 12, hostBuf+2*curBufSize/3, 4, 4, curBufSize/12,
                                                  cudaMemcpyHostToDevice, s0);
    #endif
    

    Great!

    I am going to give this a try, as I think this issue will come up again in future work.