CuMemcpyDtoHAsync - CUDA Launch Error

Ok so I’m developing a renderer for the company I work for using Optix, this has been in development for a number of months now and has generally been going very well, however recently we have implemented area lights into our system and I am receiving an error in certain circumstances. As this is proprietary code I can’t really post any of it here but I will attempt to describe everything we’re doing including the scene hierarchy as well as possible.

The bug
The bug manifests itself when we turn our new area lights implementation on, we essentially load triangle mesh geometry and create area lights for submeshes with emissive materials, sometime into the render I get an error which reads as follows:

Unknown error (Details: Function "_rtContextLaunch2D" caught exception: Encountered a CUDA error: cudaDriver().CuMemcpyDtoHAsync( dstHost, srcDevice, byteCount, hStream.get() ) returned (719): Launch failed)

When it happens
The first thing to mention is that this bug does not happen on my colleagues machine “works on my machine!”, I will post the details of both of our systems below.

The bug happens for me during rendering of a single equirectangular image, we render this using a single launch method which uses iteration directly in the camera CUDA program to render multiple samples:

for(unsigned int i = 0; i < g_samples; ++i) 
... //Ray generation code
rtTrace( g_top_object, ray, prd )
g_output_buffer[launch_index] = make_color( prd.result );

The bug has only started happening since enabling our new area lights feature, with this disabled it renders fine, it’s worth noting that we are doing a number of other techniques that require multiple samples such as ambient occlusion, reflection, refraction and shadow mapping, all of these work fine.

We also have a test app that uses our rendering library to render to a realtime gl window, the camera used in this is a standard pinhole accumulation camera and we iterate context->launch calls to obtain multiple samples. this works fine even with area lights on.

Scene structure
We load a project file that contains information about all of the items in the scene (.obj meshes), their locations, material data and also some information about that walls that make up a room in that scene, we deserialise it and setup our scene as follows:

Create a top group object and set a Trbvh accel structure on it.
Create a geometry object for each wall/floor object, these are polygon geometry with their own bounds and intersection programs, we create materials and then create a geometry instance for each wall/floor geometry object.
Create a geometry group and add all wall/floor geometry instances as children of the group, set a Trbvh accel structure on the geometry group.
Add the geometry group as a child of the top group.
Go through each obj mesh in the scene and; create a geometry object from the mesh, create a geometry instance from that geometry and materials, create a geometry group add the single geometry instance as a child, set a Trbvh accel structure on the geometry group, create a transform with the object position, add the geometry group as a child of the transform.
Add the transform as a child of the top group.
Set the top group as the context “g_top_object” variable.

System Information

Optix SDK 4.1.0
Compiling with CUDA version 7.5

My machine
Alienware 15r2 laptop.
Windows 10.0 64-bit
Intel Core i7 6700HQ, 2.6GHZ
NVIDIA GeForce GTX 970m, 3GB
CUDA cores: 1280
Driver version: 385.41 (Just updated, was receiving error before update too)

Colleague machine
Dell XPS desktop.
Windows 10.0 64-bit
Intel Core i7 4790, 3.6GHZ
NVIDIA GeForce GTX 1060, 3GB
CUDA Cores: 1152
Driver version: 384.76

I guess first what I’m asking is are there any blatantly obvious errors maybe to do with the scene setup ? Also the bug is extremely vague is there any way to get extra information our of Optix/Cuda ?
As the problem is happening when enabling an intensive feature and not happening on my colleagues machine it leads me to guess that it is a memory issue potentially ? Also as the CUDA function reporting the error, CuMemcpyDtoHAsync, is about copying memory from the device to host machine.

But any help or ideas would be greatly appreciated

Hi. Thanks for all the details. The scene setup looks ok to me, although I am not clear on how you’re sampling the emissive meshes for area lights. Do these go into a buffer that is exposed to the programs for direct sampling?

For this question:

Also the bug is extremely vague is there any way to get extra information our of Optix/Cuda

The CuMemcpyDtoHAsync error just means that the launch failed. It’s a generic error, unfortunately.

The first thing to try is enabling all OptiX exceptions, which will make things run slower but will check for typical errors, e.g., out-of-bounds buffer access, and throw a more specific exception for these errors. Check the header for syntax on rtContextSetExceptionEnabled, or the C++ wrapper equivalent.

Hopefully that will catch the bug, but if not…

The most typical reason for a launch failure is a memory access error, as you guessed. If you can limit the launch indices in the ray gen program to just a small region of pixels or render a very small image, then sprinkling printfs in the code will usually narrow down the crash. You can include <stdio.h> and use regular CUDA printfs; ignore the older OptiX printfs described in some of docs.

Let’s see how those suggestions work out before continuing.

Thanks for the reply, yeah we essentially build geometry for the emissive meshes and send them through in a buffer.

So the first thing I tried after posting this was removing the sampling loops out of our camera program and instead accumulating samples by iterating launch calls instead, this seems to have fixed the issue.

Thanks for the info on exceptions though that seems very useful