What is the fastest way to copy array data in device?

powvex · July 16, 2023, 7:36pm

I’ve a device function that copy data to a struct array:

__device__ void my_cpy(MYDATA1 *my1, BYTE *data1, WORD datalen1, WORD *state1)
{
	WORD i;

	// typedef struct {
	// BYTE data[128];
	// WORD datalen;
	// WORD state[64];
	// } MYDATA1;


	 for (i=0;i<=127;i++)
	  my1->data[i]=data1[i];
	 for (i=0;i<=63;i++)
	  my1->state[i]=state1[i];
	my1->datalen=datalen1;


}

Its working ok, but I need to optimize it for fastest copy to arrays and even replace “for loops” with it.

any help appreciated.

njuffa · July 16, 2023, 8:04pm

The fastest copies are those that are avoided. In my experience, any time the question of “fastest bulk copy” comes up in the context of performance tuning it is a read flag.

Physically, from a hardware perspective, the fastest copies are those that use vector loads and stores. The widest of these are 128 bits (16 bytes) at this time. That corresponds to CUDA’s uint4 type for example. Note that in GPUs, all loads and stores must be naturally aligned otherwise their behavior is undefined. That is, a 16-byte access must be to an address that is divisible by 16 without remainder.

This does not fall out of your current structure definition, and the alignment requirement means you cannot simply cast pointers to a type with stricter alignment. Since no context was provided, you will need to figure out the best way to ensure alignment. FWIW, conventional wisdom with regard to performance tuning suggests that it is usually best to sort structure members in order of decreasing element type size, while the opposite was done here (assuming that WORD is a wider type than BYTE, which seems like a reasonable assumption).

As for loop unrolling, that is something the CUDA compiler pursues aggressively by itself. You can intervene manually with the help of #pragma unroll if need be.

Robert_Crovella · July 16, 2023, 9:57pm

If you continue with your current approach, and the data in both cases is in global memory, then a suggestion is to make sure you have coalesced behavior (loads and stores) across threads. This affects your indexing and data storage patterns. There are numerous question about coalescing on various forums. If you search you will find some. In a nutshell, you want adjacent threads in the warp to read (or write) adjacent locations in global memory.

Topic		Replies	Views
Copying large data amount shared to global ? cudaMemCpy doesn't work in kernel... CUDA Programming and Performance	1	3803	May 30, 2007
cuda memcpy CUDA Programming and Performance	4	1339	March 22, 2019
Copying from array to array on the device CUDA Programming and Performance	2	1407	March 18, 2010
memcpy() in a device function? how is it implemented, how does it perform? CUDA Programming and Performance	2	3107	October 7, 2009
reduce memory access by combine two arrays into one CUDA Programming and Performance	2	2207	January 13, 2010
Implementation of a poorly-aligned-memory, on-device std::copy/memcpy-like function? CUDA Programming and Performance	7	1776	April 10, 2017
Optimizing global memory loads on a struct CUDA Programming and Performance	2	3632	November 26, 2008
Register / Shared memory question memory copy max performance CUDA Programming and Performance	6	8259	September 13, 2009
Dealing with Structures CUDA Programming and Performance	1	957	November 11, 2010
coalescing struct loading problem CUDA Programming and Performance	21	12927	March 5, 2010

What is the fastest way to copy array data in device?

Related topics