cudaGraphicsD3D11RegisterResource performance. Any tips?

Just to be 100% I am not missing anything going to try here as well.

I need to send a few thousand objects to cuda/optix. This works fine, except it takes a few seconds just to register the buffers. (Doing this only once, but it needs to be done there)

Now I am thinking about multi threading it, but was just wondering if I am missing something.

This is outside my area of expertise. However, given the general nature of the activity, I would consider it unlikely that multi-threading will result in speedup, as the bottleneck is probably an internal lock which effectively results in only a single thread actively engaging at any time. This is a common limitation of allocation and mapping functions.

If my hypothesis is correct, my usual advice is to use a host platform with high single-thread performance, which generally means using a CPU with high base (not boost) frequency (> 3.5 GHz).

If you do attempt to multi-thread this portion of the work, I would be very much interested in seeing a brief note posted here what performance differences were observed.

Thanks! It’s not quite clear to me what registering and mapping actually does.

The app runs in Unity and my idea of multi threading was to get that work out of the blocking main thread. It wouldn’t be faster per se, but wouldn’t impact fps as much was my thinking.

I would expect (speculation!) cudaGraphicsD3D11RegisterResource to be a thin wrapper around various OS API calls, including ioctl calls into the D3D driver. If there is a system-level trace utility in Windows (like strace or dtrace on other platforms) that should enable you to get a reasonable idea what is happening under the hood.

If the multi-threading plan involves simply offloading the registration work to a separate thread while the application proceeds with other host-side activity, that would seem likely to succeed, as long as that work doesn’t need OS resources also needed for D3D registration.

Thanks! I guess there is only one way to find out ;)

This may or may not be useful. If it’s just noise, please disregard. it probably doesn’t go to the level of description that you would like.

Registering a resource does all the work that can be done one time, ahead of time (usually/hopefully, apparently not in this case) to make a resource potentially useful on both the graphics side and the compute side. As you’ve already discovered, this is not zero effort, so it is broken out so that it can be done once, ahead of time.

Mapping a resource does the work necessary to actually make a resource available to either the compute side or the graphics side. This necessarily must be done each time the resource changes hands, and among other things it establishes a current/up-to-date pointer/reference/handle that can be used to actually access the resource, and makes access from the side in question “reliable” (whereas it may be “unreliable” on the other side). These things (including pointer retrieval) must be done each and every time the resource changes hands.

For the reasons njuffa stated, I’d be surprised if multithreading made any of the actual runtime calls themselves proceed faster. I suspect they will serialize due to a lock inside the CUDA runtime. As you point out, there may be benefit by doing it asynchronously to other work you are doing.

I’m confused by the juxtaposition of these statements:

It’s not obvious that a one-time operation would have any general impact on fps except for the point at which it occurred (maybe that is your point). Doing it one time, the idea is you do it early enough in your application that the fps impact (of the registering) doesn’t matter.

It’s also confusing that you believe you must do it at a specific point in your application but at the same time you are willing to relegate it to a thread to get it off the “blocking path”.

Anyway, no need to connect the dots, I’m sure it makes sense in the context of your application.

I would also say wrt Optix, I view Optix as a “higher level” programming paradigm that to a certain degree rests on CUDA. Therefore, if you’re facing a problem with resource sharing, it might be that that problem is solved (or at least there is a best practice defined) at the Optix level. There is an Optix sub-forum on these forums. There aren’t likely to be many Optix experts on this sub-forum, but probably more on that one. They may have some ideas. For that matter, there is a Unity forum as well.

Thanks Robert! Yes, I guess there was some miscommunication on my part. Happy to connect the dots!

(Doing this only once, but it needs to be done there)

This was just to get ahead of “You’re sure you’re not doing this every frame?” or something like that. Currently in the app geometry is loaded and unloaded pretty much continuously. Think: map tiles. Right now the user presses a button and what’s currently loaded is shipped to cuda → optix.

That button press takes too long for my liking and there are also new requirements that would mean that the user has to press that button each time they make specific changes to the scene.

I know the optix sub forum quite well. The mods there are also extremely helpful. I came here because it was about this one specific function call, which I assumed was purely a cuda issue.

Thanks again for taking the time! It helped tie things up for me.