RegisterResource sometimes fails with multiple cards

We have a problem were doing a cudaGraphicsD3D11RegisterResource on multiple cards fails on some machines , on my machine it has never worked , on a colleagues machine it worked until updating the drivers, and downgrading again didn’t help , on a third machine it works flawlessly

this is a critical issue for us as many customers have bought multiple Nvidia cards to improve decoding performance

my machine GTX 1080 + GTX 1060
Colleagues machine dual GTX 1060
Working machine dual RTX 2080Ti

Source to reproduce in comment

// TestNvidiaDxThreads.cpp : This file contains the ‘main’ function. Program execution begins and ends there.
//

#include “pch.h”
#include
#include
#include
#include
#include
#include
#include
#include // std::string
#include // std::cout
#include

#include <winerror.h>

#include <cuda.h>
#include <cuda_d3d11_interop.h>

#include <dxgi.h>
#include <d3d11.h>

void thread(int parm)
{

std::stringstream ss;
DXGI_ADAPTER_DESC adapterDesc;

try
{
	CUresult res = cuInit(0);

	if (res != CUDA_SUCCESS)
		throw "Could not init Cuda";

	IDXGIFactory1* pFactory1 = nullptr;

	if(parm == 0)
		std::this_thread::sleep_for(std::chrono::milliseconds(2000));


	if (SUCCEEDED(CreateDXGIFactory1(__uuidof(IDXGIFactory1), (void**)&pFactory1)))
	{
		IDXGIAdapter1 * pAdapter1 = nullptr;

		if (pFactory1->EnumAdapters1(parm, &pAdapter1) == DXGI_ERROR_NOT_FOUND)
			throw "Adapter not found";

		
		pAdapter1->GetDesc(&adapterDesc);

		CUcontext ctx;

		int cudaDevIx = -1;
		auto ores = cudaD3D11GetDevice(&cudaDevIx, pAdapter1);
		if (ores != cudaError::cudaSuccess)
			return;
		res = cuCtxCreate(&ctx, CU_CTX_SCHED_BLOCKING_SYNC, cudaDevIx);
		if (res != CUDA_SUCCESS)
			throw "Couldnt create cuda context";

		res = cuCtxPushCurrent(ctx);
		if (res != CUDA_SUCCESS)
			throw "Couldnt push cuda context";
		int m_flags = 0;

		ID3D11Device*  pDevice;
		ID3D11DeviceContext* pContext;

		HRESULT hRes = D3D11CreateDevice(pAdapter1,
			D3D_DRIVER_TYPE_UNKNOWN,
			NULL,
			m_flags,
			NULL,
			NULL,
			D3D11_SDK_VERSION,
			&pDevice,
			NULL,
			&pContext);

		if (hRes != S_OK)
			throw "Couldnt create device";

		int height = 10;

		const size_t totalNumberOfRows = 15;
		const unsigned int bytesPrPixel = 1;
		const unsigned int bytesPrReadOperation = 4;

		ID3D11Texture2D* pTexture;

		D3D11_TEXTURE2D_DESC TextureDesc;
		memset(&TextureDesc, 0, sizeof(D3D11_TEXTURE2D_DESC));

		TextureDesc.Width = 10;
		TextureDesc.Height = 10;
		TextureDesc.MipLevels = 1;
		TextureDesc.ArraySize = 1;
		TextureDesc.Format = DXGI_FORMAT::DXGI_FORMAT_NV12;
		TextureDesc.SampleDesc.Count = 1;
		TextureDesc.SampleDesc.Quality = 0;
		TextureDesc.Usage = D3D11_USAGE_DEFAULT;
		TextureDesc.CPUAccessFlags = 0;
		TextureDesc.MiscFlags = D3D11_RESOURCE_MISC_SHARED;
		TextureDesc.BindFlags = D3D11_BIND_RENDER_TARGET | D3D11_BIND_SHADER_RESOURCE;


		hRes = pDevice->CreateTexture2D(&TextureDesc, nullptr, &pTexture);
		if (hRes != S_OK)
			throw "Couldnt create texture";

		size_t pitch = 10;
		uint8_t *rawPointer;
		CUresult result = cuMemAllocPitch((CUdeviceptr *)(uintptr_t *)&rawPointer, &pitch, (size_t)10, totalNumberOfRows, bytesPrReadOperation);
		if (result != CUDA_SUCCESS)
			throw "couldnt allocate";

		cudaGraphicsResource* pCudaResource;
		auto cuError = cudaGraphicsD3D11RegisterResource(&pCudaResource, pTexture, CU_GRAPHICS_REGISTER_FLAGS_NONE);
		if (cuError != cudaSuccess)
			throw "couldnt register resource";

	}

	ss << "Device " << parm << " (" << std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(adapterDesc.Description) << ") succeeded\r\n";
}
catch (const char* s)
{
	ss << "Device " << parm << "(" << std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(adapterDesc.Description) << "): " << s << "\r\n";
}

std::cout << ss.str();

}

int main()
{
std::vectorstd::thread threads;

for (int a = 0; a < 3; a++)
//for (int a = 3; a >=0 ; a--)
{
	std::thread threadObj(thread, a);
	threads.emplace_back(std::move(threadObj));
}

std::this_thread::sleep_for(std::chrono::milliseconds(100000));

for (auto &t : threads)
{
	t.join();
}

}

Interesting and helpful thread, do share more.

What’s the cudaError cudaGraphicsD3D11RegisterResource() returned ?

Hi Chris, thank you for answering :-)

registerresource returns cudaErrorInvalidDevice (101)

Best regards,
Michael Bodenhoff

Oh and i realized that had left important information out

the problem isn’t GPU dependant, the first GPU that does RegisterResource works, subsequent GPUs doesn’t.

the GPU that works continue to be able to do RegisterResource, but the other(s) still can’t

you can try switching the two for loops in main, one incrementing, the other decrementing

sorry for spamming

it is my theory that deprecating cudaD3D11SetDirect3DDevice has broken something that only shows up in rare cases

Hi Mibosripl,
Sorry for late response!
The error log - “cudaErrorInvalidDevice” indicates the cuda api call does not run on the target device.
Could you refer to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#direct3d-9-version to call cudaSetDevice() after CreateDeviceEx() call.

// Use the same device
cudaSetDevice(dev);

I tried that, but then I get a CUDA_ERROR_CONTEXT_IS_DESTROYED (709) when doing cuMemAllocPitch :-(

And then there’s the detail that it is currently working on two other machines, one with dual GTX 1080Ti , and one with dual RTX2080Ti , the same code i posted in this thread and that fail on my GTX 1060+GTX1080 machine , and another with dual GTX 1060

Could you help do two more experiments on the GTX 1060+GTX1080 machine?

Experiment#1:

  1. $ export CUDA_VISIBLE_DEVICES=0
  2. Run your application

Experiment2:

  1. $ export CUDA_VISIBLE_DEVICES=2
  2. Run your application

Thanks!

But ofc
😊

C:\TestNvidia>TestNvidiaDxThreads.exe

Device 0 (NVIDIA GeForce GTX 1080) succeeded

Device 1(NVIDIA GeForce GTX 1060 6GB): couldnt register resource

Device 2(Intel® UHD Graphics 630): Adapter not Nvidia

Device 3(Microsoft Basic Render Driver): Adapter not Nvidia

^C

C:\TestNvidia>set CUDA_VISIBLE_DEVICES=0

C:\TestNvidia>TestNvidiaDxThreads.exe

Device 0 (NVIDIA GeForce GTX 1080) succeeded

Device 1(NVIDIA GeForce GTX 1060 6GB): Adapter not Nvidia

Device 2(Intel® UHD Graphics 630): Adapter not Nvidia

Device 3(Microsoft Basic Render Driver): Adapter not Nvidia

^C

C:\TestNvidia>set CUDA_VISIBLE_DEVICES=2

C:\TestNvidia>TestNvidiaDxThreads.exe

Device 0(ÝíÇýƺ╔í): Could not init Cuda

Device 1(): Could not init Cuda

Device 2(): Could not init Cuda

Device 3(): Could not init Cuda

Device 4(): Could not init Cuda

Device 5(): Could not init Cuda

image008.jpg

image009.jpg

image010.jpg

image011.jpg

image012.jpg

image013.png

image014.png

@mchi
@ChrisDing

i played around with it some more, my daughter is playing in her room and i’m kinda baffled by the fact that the problem isn’t a specific GPU in my machine, that the first GPU to do registerresource work, but the following ones doesn’t

cudaGetDevice returns the correct deviceId (that cudaD3D11GetDevice returned) even without cudaSetDevice

cuCTXGetCurrent returns the same context i set earlier using the deviceId that cudaD3D11GetDevice returned

cuCTXGetDevice returns the same deviceId i set earlier , that cudaD3D11GetDevice returned

everything i can think of is exactly as expected, but still the register resource call fail with cudaErrorInvalidDevice (101)

@mchi
@ChrisDing

sorry for tagging both of you, i don’t know if you’re both involved still

i tried making it run in serial instead of parallel , and the registerresource still works on the first GPU , and subsequent GPUs fail

unless ALL the first GPU resources are released , tries on subsequent GPUs will fail

ie

pTexture->Release(); // registered Dx11 texture
cudaGraphicsUnregisterResource(pCudaResource); // registered Dx11 texture

pContext->Release(); // Dx11 context of the registered texture
pDevice->Release(); // Dx11 device of the registered texture

if any resource of the first GPU is still alive when trying to register resource on other cuda GPUs , it will fail every time on my machine

Please remember that we have other dual GPU machines where it works on both GPUs in parallel , without releasing everything

and since we’re trying to decode a lot of streams at the same time, loadbalancing between all GPU’s, not being able to utilize multiple Nvidia GPUs is a major showstopper :-(

  1. According to the log, your 1080 should have id :0 and 1060 id:1.
    As as result, you should not use “set CUDA_VISIBLE_DEVICES=2” but “set CUDA_VISIBLE_DEVICES=1” instead.
    Could you try “CUDA_VISIBLE_DEVICES=1”?

  2. What is your CUDA and GPU driver version
    It seems that CUDA is not the latest version on your side.

hi @leif

i just did like i was told :-)

here the result of what you asked :

Question 1 :

C:\TestNvidia>set CUDA_VISIBLE_DEVICES=0

C:\TestNvidia>TestNvidiaDxThreads.exe
Device 0 (NVIDIA GeForce GTX 1080) succeeded
Device 1(NVIDIA GeForce GTX 1060 6GB): Adapter not Nvidia
Device 2(Intel(R) UHD Graphics 630): Adapter not Nvidia
Device 3(Microsoft Basic Render Driver): Adapter not Nvidia
^C
C:\TestNvidia>set CUDA_VISIBLE_DEVICES=1

C:\TestNvidia>TestNvidiaDxThreads.exe
Device 0(NVIDIA GeForce GTX 1080): Adapter not Nvidia
Device 1 (NVIDIA GeForce GTX 1060 6GB) succeeded
Device 2(Intel(R) UHD Graphics 630): Adapter not Nvidia
Device 3(Microsoft Basic Render Driver): Adapter not Nvidia
^C
C:\TestNvidia>

if CUDA_VISIBLE_DEVICES just enables disables one or more GPUs i’m not suprised by the result, the code works on both adapters fine on their own, but trying to do registerresource on two GPU’s will always fail on some machines, and it’s always the second GPU that registerresource fails on, i.e. if i do register resource on my 1080 the 1060 will fail, and vice versa

BUT if the first GPU unregisters and releases it’s DX resources the second GPU will be fine, but as long as the first GPU (1080 or 1060, it doesn’t matter) still has registered resources the second GPU will always fail.

We have made a fallback solution where instead of doing registerresource we copy from CUDA to sysmem and from there to DX memory and that works, except for the horrible performance on the GPUs that can’t registerresource

one thing that we noticed though was that if registerresource fail, ALL operations thereafter fail on that GPU/thread, we have to reinitialize on that GPU again , it’s as if registerresource corrupts the context/whatever

Question 2 :

CUDA version is 10.1, but i downloaded 10.2 and tried that as well with same outcome

driver version is 445.87

Hi Mibisripl,
Sorry for delay! Since we don’t the HW (two GPU cards) and SW setup (Window), it’s hard for us to debug.

BTW, I think you could submit a ticket on our CUDA sub-forum - https://forums.developer.nvidia.com/c/accelerated-computing/cuda/cuda-programming-and-performance/7 or Graphics - https://forums.developer.nvidia.com/c/visual-and-game-development/general-graphics-programming/197 , they may have more senses than us.

Thanks!

Hi Mibisripl,
Because it’s hard for us to debug the issue, could you try the two actions:

Action 1:

  1. move “CUresult res = cuInit(0);” to main() and keep one instance
    The system wide API is for all devices
  2. use cuD3D11GetDevice instead of cudaD3D11GetDevice
    Because cudaD3D11GetDevice() return device id is ordinal value which cannot be used for cuCtxCreate (), cuCtxCreate () request CUdevice which is device handle ( check the data type below).
    From
    int cudaDevIx = -1;
    cudaD3D11GetDevice ( int* pcudaDevIx , IDXGIAdapter* pAdapter )
    cuCtxCreate ( CUcontext* pctx, unsigned int flags, CUdevice cudaDevIx )
    to
    CUdevice cudaDevIx = -1;
    cuD3D11GetDevice(CUdevice* pCudaDevice, IDXGIAdapter* pAdapter )
    cuCtxCreate ( CUcontext* pctx, unsigned int flags, CUdevice dev )

OR

Action 2:
USE CUDA RT API instead of CUDA driver API compeltely

  1. remove “cuInit(0);”, “cuCtxCreate();” and “cuCtxPushCurrent();”
  2. use cudaMallocPitch() instead of cuMemAllocPitch();
  3. use cudaSetDevice(0) for 1080 / cudaSetDevice(1) for 1060 before cudaMallocPitch()

Anyway, we recommand CUDA RT API because it is easy to use.
And our code should be not a a mixture of CUDA RT API and CUDA driver API.

Hi leif,

I realize that locating the problem without being able to reproduce it takes a lot of trial and error, and I’ll do my best to test stuff for
you guys and reporting my findings as accurately as possible 😊

after #ifdef’ing the heck out of my test program I have established that

It fails the same way with

  • RT and Driver API

  • Cuda 10.1 and 10.2

  • Debug and release

  • Registerresource on first the 1080 and on the 1060 first

I have tested all combinations, all fails

Neither of the GPUs shows any issues on their own, but after doing registerresources on one or the other, the subsequent GPUs fail. The first
adapter can continue doing registerresource without any problem though.

RegisterResource on subsequent GPUs not only fail, all cuda operations on that context fails, even freeing the memory allocated with cuMemAllocPitch/cudaMallocPitch
fails with 101 invalid device.

If however the first GPU does unregisterresource and releases all it’s Dx11 resources , the second GPU works just fine

My theory is that something about the registerresource code violates the context in rare situations/installations. My manager has a theory
that it only happens on developer machines, I’m not sure I buy that, however the two machines that fail are developer machines though, and the two that doesn’t fail aren’t.

image008.jpg

image009.jpg

image010.jpg

image011.jpg

image012.jpg

image013.png

image014.png

If an error happened during CUDA invoking, the subsequent APIs cannot work.
Could you help that :

  1. Use CUDA runtime API instead of CUDA driver API as like as commet #18 in Apr 23.
  2. Add cuda cudaGetLastError() after every CUDA RT API
    auto status = cudaGetLastError() ;
    std::cout<< "Error Code: “<<status<<” ErrorString "<< cudaGetErrorString(status)<<std::endl;