Data transfer of one component of float4

mhdmac2017 · August 24, 2018, 2:15pm

I’ve been given this CFD code to optimise. A lot of the variables are declared as float4 and bundled into a float4 (e.g. position in x,y,z and pressure in w), but in many cases only one component needs to be transfered between devices. The data is held in a C++ vector of float4* called fields and to transfer a field between devices the following call is made:

cudaMemcpyAsync(fields[j]->at(i)+offsetDst,fields[j]->at(i+1)+offsetSrc, ... ,cudaMemcpyDefault, streams_[i])

My question is: how can I transfer just the w component of a float4 using this and without changing the data structures substantially?

Robert_Crovella · August 24, 2018, 3:05pm

cudaMemcpy2DAsync

(it is effectively a strided copy)

however the performance may not be at maximum rates.

Otherwise, you could create a temporary buffer, copy the selected data to the buffer, then transfer that buffer. You’d have to do an equivalent scatter operation on the other end.

This can still be a win because device memory bandwidth is usually much higher than device->device transfer bandwidth.

[url]https://devtalk.nvidia.com/default/topic/658779/a-built-in-way-to-quickly-convert-three-float-arrays-into-a-single-float3-array/[/url]

njuffa · August 24, 2018, 4:51pm

Using 2D copies is the most convenient way. However, the transfer speed of such strided copies may be only one tenth of contiguous bulk copies between CPU and GPU. Therefore the method of re-sorting data in either a CPU-side buffer (that fits into L1 cache) or on the GPU side should be given serious consideration where performance is important. As txbob points out, this takes advantage of high memory bandwidth in the range of hundreds of GB per second.

mhdmac2017 · September 5, 2018, 12:45pm

Yes, I can see the advantage of writing the w component to a buffer and transferring the buffer, but how would I use cudaMemcpy2DAsync to transfer just the w component of a 1d array of float4?

Robert_Crovella · September 5, 2018, 2:37pm

The method would be similar to here:

[url]cuda - Copying data to "cufftComplex" data struct? - Stack Overflow

Topic		Replies	Views
Using cudaMemcpy2DAsync to copy w component of one float4 array to another float4 array on different devices CUDA Programming and Performance	6	724	April 4, 2019
A built in way to quickly convert three float arrays into a single float3 array CUDA Programming and Performance	13	3939	December 20, 2013
The float and float4 types in CUDA CUDA Programming and Performance	1	10098	September 13, 2018
Newbie question about data transfer CUDA Programming and Performance	4	2705	July 25, 2008
copying a cuda float4 array back to host using memcpy CUDA Programming and Performance	1	12380	September 4, 2009
how to use cudamemcpy3dasync? Legacy PGI Compilers	4	3669	April 16, 2012
Async GPU Data Tranfer with CUDA Fortran Legacy PGI Compilers	1	1946	January 31, 2011
how to speed up? data transfer CUDA Programming and Performance	22	3826	April 5, 2011
seek the method to transfer data from host to device with highest speed CUDA Programming and Performance	2	1267	November 8, 2012
Memory Latencies for Small Data Transfers CUDA Programming and Performance	7	3149	July 15, 2014

Data transfer of one component of float4

Related topics