GPU to GPU transfers most effective method?

_constant · November 20, 2010, 9:32am

Hi!

What is the most effective way to perform GPU to GPU transfers? Is this dependant on motherboard etc,?

thanks,

_constant · November 20, 2010, 9:32am

Hi!

What is the most effective way to perform GPU to GPU transfers? Is this dependant on motherboard etc,?

thanks,

tera · November 20, 2010, 12:22pm

Allocate a buffer of pinned host memory and zerocopy transfer via that.

tera · November 20, 2010, 12:22pm

Allocate a buffer of pinned host memory and zerocopy transfer via that.

Jimmy_Pettersson · November 22, 2010, 9:49am

I’m also a bit curious about the details here…

So is it correct that using those API calls the data won’t just pass over the northbridge and straight to GPU #2 but it will first have to go via the CPU? Or how does it work really work in detail?

Thanks!

Jimmy_Pettersson · November 22, 2010, 9:49am

I’m also a bit curious about the details here…

So is it correct that using those API calls the data won’t just pass over the northbridge and straight to GPU #2 but it will first have to go via the CPU? Or how does it work really work in detail?

Thanks!

avidday · November 22, 2010, 10:01am

This is something of an educated guess, but zero copy effectively maps the host memory allocation into the GPU address space. Access still happens via the PCI-e bus, but without any user space intervention or additional host code. So I presume that if two threads holding gpu contexts map the same chunk of zero copy memory, they wind up with that chunk in both their address spaces. In theory when one writes to the zero copy buffer, the other will see those writes. The key question is when. Coherence or the lack thereof with zero copy is still something I still don’t fully understand.

avidday · November 22, 2010, 10:01am

This is something of an educated guess, but zero copy effectively maps the host memory allocation into the GPU address space. Access still happens via the PCI-e bus, but without any user space intervention or additional host code. So I presume that if two threads holding gpu contexts map the same chunk of zero copy memory, they wind up with that chunk in both their address spaces. In theory when one writes to the zero copy buffer, the other will see those writes. The key question is when. Coherence or the lack thereof with zero copy is still something I still don’t fully understand.

seibert · November 22, 2010, 1:46pm

Presumably, __threadfence_system() on the writer ensures the data is flushed back to the host memory, but I don’t know how the reader is sure they aren’t getting stale values. Use of the volatile keyword?

seibert · November 22, 2010, 1:46pm

Presumably, __threadfence_system() on the writer ensures the data is flushed back to the host memory, but I don’t know how the reader is sure they aren’t getting stale values. Use of the volatile keyword?

_constant · November 22, 2010, 3:35pm

Hey,

So i tried tying each GPU context to one thread and in one of the threads allocating portable memory as described:

"cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA

contexts, not just the one that performed the allocation."

So my idea is to let thread 1 copy it’s device data to this buffer and thread 2 read from it using some sort of synchronization point, without any success.

I created a threaded process as in the simpleMultiGPU example and wrote some very simple testing code:

static CUT_THREADPROC threadedRoutine(Plan* plan)

{

	int size = N*sizeof(float);

	int threadId = plan[0].threadId;

	

	cudaSetDevice(threadId);

	float* h_portBuff;

	// Should be available to all contexts!

	

	cudaHostAlloc((void**)&h_portBuff, size, cudaHostAllocPortable);

	if(threadId == 0)

	{

		printf("\n I'm thread 0! \n");

		float* d_ptr;

		cudaMalloc((void**)&d_ptr, size);

		// set to ~42.0f just for testing

		kernel_set<<< numBlocksX, threads >>>(d_ptr);

		cudaMemcpy(h_portBuff, d_ptr, size, cudaMemcpyDeviceToHost);

		//printf("\n error: %s\n", cudaGetErrorString(cudaGetLastError()));

		

		

	}

	else // Thread # 2

	{

		printf("\n I'm thread 1! \n");

		// results ptr

		float* h_ptr = (float*)malloc(size);

		float* d_ptr;

		cudaMalloc((void**)&d_ptr, size);

		// Copy onto device #1 memory space

		cudaMemcpy(d_ptr, h_portBuff, size, cudaMemcpyHostToDevice);

		// run som kernel here........

		// Check results

		cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);

		printf("\n Device 0, val %0.3f\n", h_ptr[1]);

		//printf("\n error: %s\n", cudaGetErrorString(cudaGetLastError()));

	}

	CUT_THREADEND;

}

I tried allocating the portable buffer both for a single thread and for both threads as in above example without any success. It seems I’ve misunderstood how to use portable buffers between contexts, any help would be much appreciated!

_constant · November 22, 2010, 3:35pm

Hey,

So i tried tying each GPU context to one thread and in one of the threads allocating portable memory as described:

"cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA

contexts, not just the one that performed the allocation."

So my idea is to let thread 1 copy it’s device data to this buffer and thread 2 read from it using some sort of synchronization point, without any success.

I created a threaded process as in the simpleMultiGPU example and wrote some very simple testing code:

static CUT_THREADPROC threadedRoutine(Plan* plan)

{

	int size = N*sizeof(float);

	int threadId = plan[0].threadId;

	

	cudaSetDevice(threadId);

	float* h_portBuff;

	// Should be available to all contexts!

	

	cudaHostAlloc((void**)&h_portBuff, size, cudaHostAllocPortable);

	if(threadId == 0)

	{

		printf("\n I'm thread 0! \n");

		float* d_ptr;

		cudaMalloc((void**)&d_ptr, size);

		// set to ~42.0f just for testing

		kernel_set<<< numBlocksX, threads >>>(d_ptr);

		cudaMemcpy(h_portBuff, d_ptr, size, cudaMemcpyDeviceToHost);

		//printf("\n error: %s\n", cudaGetErrorString(cudaGetLastError()));

		

		

	}

	else // Thread # 2

	{

		printf("\n I'm thread 1! \n");

		// results ptr

		float* h_ptr = (float*)malloc(size);

		float* d_ptr;

		cudaMalloc((void**)&d_ptr, size);

		// Copy onto device #1 memory space

		cudaMemcpy(d_ptr, h_portBuff, size, cudaMemcpyHostToDevice);

		// run som kernel here........

		// Check results

		cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);

		printf("\n Device 0, val %0.3f\n", h_ptr[1]);

		//printf("\n error: %s\n", cudaGetErrorString(cudaGetLastError()));

	}

	CUT_THREADEND;

}

I tried allocating the portable buffer both for a single thread and for both threads as in above example without any success. It seems I’ve misunderstood how to use portable buffers between contexts, any help would be much appreciated!

paulius · November 24, 2010, 12:54am

I assume you’re looking to copy bytes from one GPU to another GPU on the same host system. The fastest way in such situation is to have 2 light-weight threads (say pthreads, openmp, etc.), each controlling its own GPU. You’d allocate a shared pinned memory buffer on the host, where shared means accessible as pinned to both CPU threads. After that, a sample OpenMP pattern would look something like (inside some parallel region):

tid = omp_get_thread_num();

if( tid == src )

  cudaMemcpy( shared_pinned_buffer, ... ); // D2H copy

#pragma omp barrier

if( tid == dst )

  cudaMemcpy( ..., shared_pinned_buffer, ...); // H2D copy

Now, the key here is that the H2D copy doesn’t start until all the bytes in the D2H copy arrived on the host (so the two transfer times add up). You could improve on this further by breaking the message into segments and then using async memcopies to effectively pipeline the segment transfers - transfer segment K to destination GPU, while the source GPU is transferring segment (K+1). So, instead of doubling the transfer time, you’d be increasing it by a factor of 1/N, where N is the number of segments.

With larger messages (16-32 MB) and 8 or so segments I was able to get GPU-GPU transfers via the host to sustain ~5.3 GB/s, counting each byte once.

paulius · November 24, 2010, 12:54am

I assume you’re looking to copy bytes from one GPU to another GPU on the same host system. The fastest way in such situation is to have 2 light-weight threads (say pthreads, openmp, etc.), each controlling its own GPU. You’d allocate a shared pinned memory buffer on the host, where shared means accessible as pinned to both CPU threads. After that, a sample OpenMP pattern would look something like (inside some parallel region):

tid = omp_get_thread_num();

if( tid == src )

  cudaMemcpy( shared_pinned_buffer, ... ); // D2H copy

#pragma omp barrier

if( tid == dst )

  cudaMemcpy( ..., shared_pinned_buffer, ...); // H2D copy

Now, the key here is that the H2D copy doesn’t start until all the bytes in the D2H copy arrived on the host (so the two transfer times add up). You could improve on this further by breaking the message into segments and then using async memcopies to effectively pipeline the segment transfers - transfer segment K to destination GPU, while the source GPU is transferring segment (K+1). So, instead of doubling the transfer time, you’d be increasing it by a factor of 1/N, where N is the number of segments.

With larger messages (16-32 MB) and 8 or so segments I was able to get GPU-GPU transfers via the host to sustain ~5.3 GB/s, counting each byte once.

_constant · November 24, 2010, 6:12pm

Hey Guys,

Seems there’s currently no direct way to do a GPU to GPU transfer even though it should be possible in theory. I found GPU direct GPUDirect | NVIDIA Developer, does anyone know when it will support direct transfers( on one node) ?

_constant · November 24, 2010, 6:12pm

Hey Guys,

Seems there’s currently no direct way to do a GPU to GPU transfer even though it should be possible in theory. I found GPU direct GPUDirect | NVIDIA Developer, does anyone know when it will support direct transfers( on one node) ?

gogogpu · December 28, 2010, 6:10pm

There is indeed a way to transfer directly from GPU to GPU, without going through system ram. If you have multiple GPUs, it is likely there is a PCIe switch upstream, and all traffic will be routed through it. It’s kind of like a general-purpose SLI for cuda.

To do this, you need to map cuda device memory into the GPU’s 256MB BAR1 PCI address space. This requires modifying the virtual pagetables that exist on the GPU. After mapping that BAR1 address (you can find it with lspci, ect) into the other GPU’s cuda address space as mapped zero copy memory, you can read/write directly between the two.

Doing this requires you know about some registers on your GPU. You can find those here: https://github.com/pathscale/envytools

tmurray · December 28, 2010, 11:26pm

Oh my God, you don’t want to use BAR1. To say that this is a bad idea is a ridiculous understatement. (I speak from a lot of experience)

_constant · December 28, 2010, 11:44pm

Why, that sounds just great! If it’s fiendishly difficult maybe I will come out the wiser!

OR i will focus on other more pressing matters, waiting for GPUDirect and PCI-Express generation 3 lanes to save the day!

tmurray · December 28, 2010, 11:45pm

SPOILER ALERT: there are a lot of multi-GPU improvements in the next release of CUDA. (so… don’t try BAR1. just don’t.)

Topic		Replies	Views
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8289	June 30, 2010
CUDA 4.0 CUDA Programming and Performance	63	507730	March 28, 2013
copy in multi GPUs CUDA Programming and Performance	13	4932	February 7, 2009
on the fly polling of buffers. CUDA Programming and Performance	20	10499	December 2, 2010
Tranfers between device and host data exchanges initiated by device CUDA Programming and Performance	30	3785	August 23, 2010
bi-directional PCI-E transfer overlap CUDA Programming and Performance	23	37503	July 30, 2010
GPU Communication Protocol CUDA Programming and Performance	16	6474	May 17, 2010
DUAL GPU VIDEOBOARD INTERNAL COPY OPTIMIZATION Which is the best way of copy in DEVICE data between CUDA Programming and Performance	0	631	November 6, 2011
Could someone compile simple example for me on the mobile card? CUDA Programming and Performance	20	10286	November 11, 2009
Data transfer between CPU and GPU CUDA Programming and Performance	7	14412	January 30, 2012

GPU to GPU transfers most effective method?

Related topics