GPU to GPU transfers most effective method?

Hi!

What is the most effective way to perform GPU to GPU transfers? Is this dependant on motherboard etc,?

thanks,

Hi!

What is the most effective way to perform GPU to GPU transfers? Is this dependant on motherboard etc,?

thanks,

Allocate a buffer of pinned host memory and zerocopy transfer via that.

Allocate a buffer of pinned host memory and zerocopy transfer via that.

I’m also a bit curious about the details here…

So is it correct that using those API calls the data won’t just pass over the northbridge and straight to GPU #2 but it will first have to go via the CPU? Or how does it work really work in detail?

Thanks!

I’m also a bit curious about the details here…

So is it correct that using those API calls the data won’t just pass over the northbridge and straight to GPU #2 but it will first have to go via the CPU? Or how does it work really work in detail?

Thanks!

This is something of an educated guess, but zero copy effectively maps the host memory allocation into the GPU address space. Access still happens via the PCI-e bus, but without any user space intervention or additional host code. So I presume that if two threads holding gpu contexts map the same chunk of zero copy memory, they wind up with that chunk in both their address spaces. In theory when one writes to the zero copy buffer, the other will see those writes. The key question is when. Coherence or the lack thereof with zero copy is still something I still don’t fully understand.

This is something of an educated guess, but zero copy effectively maps the host memory allocation into the GPU address space. Access still happens via the PCI-e bus, but without any user space intervention or additional host code. So I presume that if two threads holding gpu contexts map the same chunk of zero copy memory, they wind up with that chunk in both their address spaces. In theory when one writes to the zero copy buffer, the other will see those writes. The key question is when. Coherence or the lack thereof with zero copy is still something I still don’t fully understand.

Presumably, __threadfence_system() on the writer ensures the data is flushed back to the host memory, but I don’t know how the reader is sure they aren’t getting stale values. Use of the volatile keyword?

Presumably, __threadfence_system() on the writer ensures the data is flushed back to the host memory, but I don’t know how the reader is sure they aren’t getting stale values. Use of the volatile keyword?

Hey,

So i tried tying each GPU context to one thread and in one of the threads allocating portable memory as described:

"cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA

contexts, not just the one that performed the allocation."

So my idea is to let thread 1 copy it’s device data to this buffer and thread 2 read from it using some sort of synchronization point, without any success.

I created a threaded process as in the simpleMultiGPU example and wrote some very simple testing code:

static CUT_THREADPROC threadedRoutine(Plan* plan)

{

	int size = N*sizeof(float);

	int threadId = plan[0].threadId;

	

	cudaSetDevice(threadId);

	float* h_portBuff;

	// Should be available to all contexts!

	

	cudaHostAlloc((void**)&h_portBuff, size, cudaHostAllocPortable);

	if(threadId == 0)

	{

		printf("\n I'm thread 0! \n");

		float* d_ptr;

		cudaMalloc((void**)&d_ptr, size);

		// set to ~42.0f just for testing

		kernel_set<<< numBlocksX, threads >>>(d_ptr);

		cudaMemcpy(h_portBuff, d_ptr, size, cudaMemcpyDeviceToHost);

		//printf("\n error: %s\n", cudaGetErrorString(cudaGetLastError()));

		

		

	}

	else // Thread # 2

	{

		printf("\n I'm thread 1! \n");

		// results ptr

		float* h_ptr = (float*)malloc(size);

		float* d_ptr;

		cudaMalloc((void**)&d_ptr, size);

		// Copy onto device #1 memory space

		cudaMemcpy(d_ptr, h_portBuff, size, cudaMemcpyHostToDevice);

		// run som kernel here........

		// Check results

		cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);

		printf("\n Device 0, val %0.3f\n", h_ptr[1]);

		//printf("\n error: %s\n", cudaGetErrorString(cudaGetLastError()));

	}

	CUT_THREADEND;

}

I tried allocating the portable buffer both for a single thread and for both threads as in above example without any success. It seems I’ve misunderstood how to use portable buffers between contexts, any help would be much appreciated!

Hey,

So i tried tying each GPU context to one thread and in one of the threads allocating portable memory as described:

"cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA

contexts, not just the one that performed the allocation."

So my idea is to let thread 1 copy it’s device data to this buffer and thread 2 read from it using some sort of synchronization point, without any success.

I created a threaded process as in the simpleMultiGPU example and wrote some very simple testing code:

static CUT_THREADPROC threadedRoutine(Plan* plan)

{

	int size = N*sizeof(float);

	int threadId = plan[0].threadId;

	

	cudaSetDevice(threadId);

	float* h_portBuff;

	// Should be available to all contexts!

	

	cudaHostAlloc((void**)&h_portBuff, size, cudaHostAllocPortable);

	if(threadId == 0)

	{

		printf("\n I'm thread 0! \n");

		float* d_ptr;

		cudaMalloc((void**)&d_ptr, size);

		// set to ~42.0f just for testing

		kernel_set<<< numBlocksX, threads >>>(d_ptr);

		cudaMemcpy(h_portBuff, d_ptr, size, cudaMemcpyDeviceToHost);

		//printf("\n error: %s\n", cudaGetErrorString(cudaGetLastError()));

		

		

	}

	else // Thread # 2

	{

		printf("\n I'm thread 1! \n");

		// results ptr

		float* h_ptr = (float*)malloc(size);

		float* d_ptr;

		cudaMalloc((void**)&d_ptr, size);

		// Copy onto device #1 memory space

		cudaMemcpy(d_ptr, h_portBuff, size, cudaMemcpyHostToDevice);

		// run som kernel here........

		// Check results

		cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);

		printf("\n Device 0, val %0.3f\n", h_ptr[1]);

		//printf("\n error: %s\n", cudaGetErrorString(cudaGetLastError()));

	}

	CUT_THREADEND;

}

I tried allocating the portable buffer both for a single thread and for both threads as in above example without any success. It seems I’ve misunderstood how to use portable buffers between contexts, any help would be much appreciated!

I assume you’re looking to copy bytes from one GPU to another GPU on the same host system. The fastest way in such situation is to have 2 light-weight threads (say pthreads, openmp, etc.), each controlling its own GPU. You’d allocate a shared pinned memory buffer on the host, where shared means accessible as pinned to both CPU threads. After that, a sample OpenMP pattern would look something like (inside some parallel region):

tid = omp_get_thread_num();

if( tid == src )

  cudaMemcpy( shared_pinned_buffer, ... ); // D2H copy

#pragma omp barrier

if( tid == dst )

  cudaMemcpy( ..., shared_pinned_buffer, ...); // H2D copy

Now, the key here is that the H2D copy doesn’t start until all the bytes in the D2H copy arrived on the host (so the two transfer times add up). You could improve on this further by breaking the message into segments and then using async memcopies to effectively pipeline the segment transfers - transfer segment K to destination GPU, while the source GPU is transferring segment (K+1). So, instead of doubling the transfer time, you’d be increasing it by a factor of 1/N, where N is the number of segments.

With larger messages (16-32 MB) and 8 or so segments I was able to get GPU-GPU transfers via the host to sustain ~5.3 GB/s, counting each byte once.

I assume you’re looking to copy bytes from one GPU to another GPU on the same host system. The fastest way in such situation is to have 2 light-weight threads (say pthreads, openmp, etc.), each controlling its own GPU. You’d allocate a shared pinned memory buffer on the host, where shared means accessible as pinned to both CPU threads. After that, a sample OpenMP pattern would look something like (inside some parallel region):

tid = omp_get_thread_num();

if( tid == src )

  cudaMemcpy( shared_pinned_buffer, ... ); // D2H copy

#pragma omp barrier

if( tid == dst )

  cudaMemcpy( ..., shared_pinned_buffer, ...); // H2D copy

Now, the key here is that the H2D copy doesn’t start until all the bytes in the D2H copy arrived on the host (so the two transfer times add up). You could improve on this further by breaking the message into segments and then using async memcopies to effectively pipeline the segment transfers - transfer segment K to destination GPU, while the source GPU is transferring segment (K+1). So, instead of doubling the transfer time, you’d be increasing it by a factor of 1/N, where N is the number of segments.

With larger messages (16-32 MB) and 8 or so segments I was able to get GPU-GPU transfers via the host to sustain ~5.3 GB/s, counting each byte once.

Hey Guys,

Seems there’s currently no direct way to do a GPU to GPU transfer even though it should be possible in theory. I found GPU direct GPUDirect | NVIDIA Developer, does anyone know when it will support direct transfers( on one node) ?

Hey Guys,

Seems there’s currently no direct way to do a GPU to GPU transfer even though it should be possible in theory. I found GPU direct GPUDirect | NVIDIA Developer, does anyone know when it will support direct transfers( on one node) ?

There is indeed a way to transfer directly from GPU to GPU, without going through system ram. If you have multiple GPUs, it is likely there is a PCIe switch upstream, and all traffic will be routed through it. It’s kind of like a general-purpose SLI for cuda.

To do this, you need to map cuda device memory into the GPU’s 256MB BAR1 PCI address space. This requires modifying the virtual pagetables that exist on the GPU. After mapping that BAR1 address (you can find it with lspci, ect) into the other GPU’s cuda address space as mapped zero copy memory, you can read/write directly between the two.

Doing this requires you know about some registers on your GPU. You can find those here: https://github.com/pathscale/envytools

Oh my God, you don’t want to use BAR1. To say that this is a bad idea is a ridiculous understatement. (I speak from a lot of experience)

Why, that sounds just great! If it’s fiendishly difficult maybe I will come out the wiser!

OR i will focus on other more pressing matters, waiting for GPUDirect and PCI-Express generation 3 lanes to save the day!

SPOILER ALERT: there are a lot of multi-GPU improvements in the next release of CUDA. (so… don’t try BAR1. just don’t.)