cuCtxEnablePeerAccess returns CUDA_ERROR_PEER_ACCESS_UNSUPPORTED even if cuDeviceCanAccessPeer returns 1

Hello,

I am trying to enable the P2P data sharing between two GTX 1080 with the Nvidia Driver API.

So I start with creating two threads :

  • 1st thread creates a GPU context on device 0 using cuCtxCreate
    • gpu data allocation
  • 2nd thread creates a GPU context on device 1

I want the 2nd thread to access the GPU data on the 1st one.

So I start my application by verifying that the second device can access the data from the 1st one by calling cuDeviceCanAccessPeer and the canAccessPeer parameters is : 1.

Then I call cuCtxEnablePeerAccess, with the created contexts and it returns CUDA_ERROR_PEER_ACCESS_UNSUPPORTED.

Is my process okay ?

Can you please help me ?

My system details are :

  • Centos 7 (3.10.0-123.el7.x86_64) with Nvidia driver 410.78
  • 2 GTX 1080 founders edition

Regards,

Mathieu

what is the output from the simpleP2P cuda sample code on your system?

Hello,

The output of the test is :

[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = “GeForce GTX 1080” IS capable of Peer-to-Peer (P2P)
GPU1 = “GeForce GTX 1080” IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer access from GeForce GTX 1080 (GPU0) → GeForce GTX 1080 (GPU1) : Yes
Peer access from GeForce GTX 1080 (GPU1) → GeForce GTX 1080 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
GeForce GTX 1080 (GPU0) supports UVA: Yes
GeForce GTX 1080 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 9.60GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Disabling peer access…
Shutting down…
Test passed

Regards,

Mathieu

So your system is configured correctly. Your process should be correct. I would assume therefore that the problem lies in something you haven’t shown.

If I try to make it simple.

In my application I have 20 contexts (10 on each GPU).

An additional context on the 2nd GPU is in charge of synchronizing all the data from all the others.

So I have to enable the P2P connexion the 10 first contexts with the 21 .

Is the number of P2P connexion limited even if I only have 2 devices ? In that case, I would have supposed that the function should have return CUDA_ERROR_TOO_MANY_PEERS

That sounds like a bad design pattern to me.

I can’t agree more with you but still I have to make it work. This code is legacy and I cannot modify it as I would like to.

According to the Nvidia documentation it should be possible to do what I mentionned earlier. So I wrote a simple test with 20 threads creating 20 contexts.

Then another thread creates a new context and try to enable the Peer access and it seems to work.

#include <string>
#include <iostream>
#include <thread>
#include <semaphore.h>
#include <exception>
#include <cuda.h>
#include <iostream>

using namespace std;

#define NB_THREADS 20

// Contexts on the 1st device
CUcontext context0[NB_THREADS];

// Contexts on the second device
CUcontext context1;

// Threads for the 1st device Contexts
thread t00[NB_THREADS];

// Mutex to unlock the t00 threads
sem_t mutexStop;

// Mutex to start the thread on the second device
sem_t mutexStart;

void task0(int i)
{
	// creates a context on the device 0
	CUresult ret = cuCtxCreate_v2 ( context0+i, 0, 0 ) ;
	if( ret != 0) {
		printf("task0 : cuCtxCreate returns : %d \n" , ret );
	}
	else {
		printf("context0[%d] with thread t00[%d] created \n", i , i);
	}
	if( i == (NB_THREADS-1)) {
		// unlock the 2nd thread
		sem_post(&mutexStart);
	}

	// wait the tests to complete
	sem_wait(&mutexStop);

	cuCtxDestroy_v2(*(context0+i));
}

void task1()
{
	// wait for the 1st thread to create one context
	sem_wait(&mutexStart);

	// creates the 2nd context
	CUresult ret = cuCtxCreate_v2 ( &context1, 0, 1 ) ;
	if( ret != 0) {
		printf("task1 : cuCtxCreate returns : %d \n" , ret );
	}
	else {
		printf("2nd context created \n");
	}

	// Enable Peer access
	for (int i = 0 ; i < NB_THREADS ; i++) {
		ret = cuCtxEnablePeerAccess ( context0[i], 0 ) ;
		if( ret != 0) {
			printf("task1 [%d]: cuCtxEnablePeerAccess returns : %d \n" , i , ret );
		}
		else {
			printf("task1 : context1 Enabled Peer Access with context0[%d] \n" , i);
		}
	}

	// Unlock the threads0 for thread completion
	for (int i = 0 ; i < NB_THREADS ; i++) {
		sem_post(&mutexStop);
	}

	// context destroy
	cuCtxDestroy_v2(context1);
}

int main()
{
	// init cuda
	CUresult ret = cuInit(0);
	if( ret != 0) {
		printf("main : cuInit returns : %d \n" , ret );
	}

	// init mutex
	sem_init(&mutexStart, 0, 0);
	sem_init(&mutexStop, 0, 0);

	// Constructs the t00 threads and runs them.
    try {
		for( int i =0 ; i < NB_THREADS ; i++) {
			(t00[i]) =  (thread(task0, i));
		}
		thread t1(task1);

		// Makes the main thread wait for the new thread to finish execution, therefore blocks its own execution.
		for( int i =0 ; i < NB_THREADS ; i++) {
			(t00[i]).join();
		}
		t1.join();
    }
	catch (std::exception& e) {
		std::cout << e.what() << std::endl;
	}

    sem_destroy(&mutexStop);
    sem_destroy(&mutexStart);
}

I can’t tell what’s different about this working case vs. the previous failing one. Is it simply the number of P2P connections?

If you’re failing after making 8 or 9 connections, then it wouldn’t surprise me if the limit of P2P connections is per unique connection rather than per device.

Unfortunately I can’t tell either. I was expecting like you that the connexion will fail after 8 or 9 but it works so …

It might be something hidden. I will continue investigating and if I find something I will post it.

Regards,

Hello,

I have modified the sample I posted earlier to enable connexion in both directions and it still works for at least 20 of them.

So, because I agreed that the design of this legacy code was bad I finally conviced my hierarchy to re-design it.

I created a singleton to manage a single context per device and everything worked as expected.

Unfortunately we won’t have the explaination but it took me 2 weeks to re-write the full application.

Thank you Robert