cudaGraphicsMapResources returns cudaErrorUnknown

hello,
I’m writing a Unity(d3d11) progam. It calls a native plugin which uses cuda to do some post processing. The process loop is like this

	cudaGraphicsMapResources(1, &inputRes, gPostSR->getCudaStream()); // return cudaErrorUnknown after a few loops
	cudaGraphicsSubResourceGetMappedArray(&inArr, inputRes, 0, 0);
	cudaMemcpy2DFromArrayAsync(gPostSR->inputBuffer(), iw * ic, inArr, 0, 0, iw * ic, ih,
		cudaMemcpyDeviceToDevice, gPostSR->getCudaStream());
	cudaGraphicsUnmapResources(1, &inputRes, gPostSR->getCudaStream());

	bool ret = gPostSR->infer();

	cudaArray_t outArr;
	cudaGraphicsMapResources(1, &outputRes, gPostSR->getCudaStream());
	cudaGraphicsSubResourceGetMappedArray(&outArr, outputRes, 0, 0);
	cudaMemcpy2DToArrayAsync(outArr, 0, 0, gPostSR->outputBuffer(), ow * oc, ow * oc, oh,
		cudaMemcpyDeviceToDevice, gPostSR->getCudaStream());
	cudaGraphicsUnmapResources(1, &outputRes, gPostSR->getCudaStream());

This loop is called by the OnRenderImage() in Unity. The input and output resources are registered before entering this loop (cudaGraphicsD3D11RegisterResource).

for the first few loops it runs ok. But soon it will crash into nvwgf2umx.dll.

I debugged it a little. Finding that before crashing into the driver dll, the cudaGraphicsMapResources returns cudaErrorUnknown. I don’t know which one is the root cause. Please help me with advice on how to debug further.

My environment:
Windows 11 22H2
nvidia driver 31.0.15.3818
cuda sdk: 11.7.0
Unity 2023.2.20f1c1
graphic card: rtx4070 laptop

attaching debug output from visual studio
log.txt (101.8 KB)

Is the outputRes still valid?

Hi,
by printing outputRes, I can see it’s value are unchanged from register until cudaGraphicsMapResources returns cudaErrorUnknown. Is this sufficient to prove outputRes is still valid?

Printing out is not enough, you should make sure that the resource is not deleted by another API call.

For example if the resource was created with cudaGraphicsGLRegisterBuffer, it should not be unregistered, and so on.

I see what you mean.

But my code is a really simple demo. I’m pretty sure the unregister is not called while running the loop.

here’s the c++ side of the code:

	UNITY_INTERFACE_EXPORT bool PRSR_inferUnityTexture()
	{
		// model input/output is NHWC
		int ih = gPostSR->getInputSize(1);
		int iw = gPostSR->getInputSize(2);
		int ic = gPostSR->getInputSize(3);
		int oh = gPostSR->getOutputSize(1);
		int ow = gPostSR->getOutputSize(2);
		int oc = gPostSR->getOutputSize(3);
		size_t isize = 1 * ic * ih * iw;
		size_t osize = 1 * oc * oh * ow;

		cudaArray_t inArr;
		check(cudaGraphicsMapResources(1, &inputRes, gPostSR->getCudaStream()));
		check(cudaGraphicsSubResourceGetMappedArray(&inArr, inputRes, 0, 0));

		check(cudaMemcpy2DFromArrayAsync(gPostSR->inputBuffer(), iw * ic, inArr, 0, 0, iw * ic, ih,
			cudaMemcpyDeviceToDevice, gPostSR->getCudaStream()));

		check(cudaGraphicsUnmapResources(1, &inputRes, gPostSR->getCudaStream()));
		
		bool ret = gPostSR->infer();

		cudaArray_t outArr;
		check(cudaGraphicsMapResources(1, &outputRes, gPostSR->getCudaStream()));

		check(cudaGraphicsSubResourceGetMappedArray(&outArr, outputRes, 0, 0));

		check(cudaMemcpy2DToArrayAsync(outArr, 0, 0, gPostSR->outputBuffer(), ow * oc, ow * oc, oh,
			cudaMemcpyDeviceToDevice, gPostSR->getCudaStream()));

		check(cudaGraphicsUnmapResources(1, &outputRes, gPostSR->getCudaStream()));
		
		check(cudaStreamSynchronize(gPostSR->getCudaStream()));

		return true;
	}

	UNITY_INTERFACE_EXPORT bool PRSR_registerTextures(void* inputTexture, void* outputTexture)
	{
		if (!registered)
		{
			check(cudaGraphicsD3D11RegisterResource(&inputRes, (ID3D11Resource*)inputTexture, cudaGraphicsRegisterFlagsNone));
			check(cudaGraphicsD3D11RegisterResource(&outputRes, (ID3D11Resource*)outputTexture, cudaGraphicsRegisterFlagsNone));
			registered = true;
			LOGI("PRSR registerTextures done");
		}
		return true;
	}

	UNITY_INTERFACE_EXPORT bool PRSR_unregisterTextures()
	{
		if (registered)
		{
			check(cudaGraphicsUnregisterResource(inputRes));
			check(cudaGraphicsUnregisterResource(outputRes));
			registered = false;
			LOGI("PRSR unregisterTextures done");
		}
		return true;
	}

here’s the unity c# side:


    private void OnRenderImage(RenderTexture source, RenderTexture destination)
    {
        try
        {
            if (!textureRegisterd)
            {
                PRSR_registerTextures(image.GetNativeTexturePtr(), imageOut.GetNativeTexturePtr());
                textureRegisterd = true;
            }

            if (PRSR_inferUnityTexture())
            {
                Graphics.Blit(imageOut, destination);
            }
        }
        catch (System.Exception ex)
        {
            Debug.LogError("exception: " + ex.Message);
        }
    }

    private void OnDestroy()
    {
        if (textureRegisterd)
        {
            PRSR_unregisterTextures();
            textureRegisterd = false;
        }
        PRSR_deinit();
    }

Another observation:
I tried the same code on 2 different pc, one is my laptop (rtx4070), one is my desktop (rtx3090). On laptop it crash almost immediately after launch, but on desktop it could run for tens of minutes before crash.

I’ve checked if it’s a memory leak, both gpu and cpu, but found no proof ( system monitor shows no mem usage increase while running).

Perhaps make sure that image and imageOut stay the same in further calls to the function (when textureRegistered == true)?

How often is the OnRenderImage with textureRegistered == false and OnDestroy called? The same amount between your laptop and your desktop? Once?

I did printed image and imageOut 's pointer value out, and find them never changed.
OnRenderImage with textureRegistered == false and OnDestroy are both called only once. One for the first frame, one before quit.

Another sub-optimal resolution is putting register/unregister into the loop. I found that it’s actually quite fast, less than 1ms most of the time, unlike the document’s statement:

This call potentially has a high-overhead and should not be called every frame in interactive applications.

I have run out of further ideas, perhaps somebody from Nvidia can help you on this forum/in this thread.

At least you have around 2(+4) approaches:

  • Use a different GPU
  • Putting register/unregister into the loop
  • (Try out a different driver version on the laptop, also check WDDM vs. TCC)
  • (Check for cudaErrorUnknown on cudaGraphicsMapResources and then restart application or 3D view?)
  • (Reinitialize texture each frame)
  • (Find an alternative way to transfer the resources, host-based?)

Thank you Curefab, appreciate your time and effort. I’ll make further checks based on your advise.

Meanwhile I’ll leave an even simpler project here. Anyone who’s kind enough can try to reproduce it now.

D3DResourceRegisterTest.zip (64.0 KB)