Has anyone used 2 cards for CUDA rendering?

lyso · May 12, 2010, 9:39am

External Image External Image External Image

Hi all~

I’m using 2 GTX260+ for rendering, each card’s computing are basically the same, and independent.
(Just imagine a stereo display system: one card responsible for left-eye-scene, and the other for right-eye-scene.)

My program of drawing only one scene goes well.

And I’m now trying to use 2 host threads in CPU, in each thread:

Thread1:

CreateOpenGLContext
cudaGLSetGLDevice(0);
CreatePBO(0) & cudaRegisterBufferObject();
AllocCudaMemory();
LaunchCudaKernel();
RenderToScreenUsingPBO();

Thread2:

CreateOpenGLContext
cudaGLSetGLDevice(1);
CreatePBO(1) & cudaRegisterBufferObject();
AllocCudaMemory();
LaunchCudaKernel();
RenderToScreenUsingPBO();

but it always fails at “cudaRegisterBufferObject” with “unknown error”,
Does anyone know what’s wrong with that?

thanks !

mouser58907 · May 12, 2010, 9:18pm

I’m not pro by any means, but why are you doing the initialization in two different threads? If I were you I would do all initializing, then create separate threads to do the rendering. It’s possible it’s trying to create two buffers and register them simultaneously and that isn’t handled or something.

lyso · May 13, 2010, 1:39am

thanks for replying~

There are two reasons for doing this:

OpenGL doesn’t support multi-thread, so I have to do OpenGL-initialize and set GLcontext in each thread;
It seems if I initialize in the main thread, the cuda arrays can’t be seen by sub-threads, which I’m not very sure about.

and I’m using cudaGLSetGLDevice() in each thread to tell them to allocate memory in different cards, and the pbo-registering in 2 threads are supposed to be independent. I don’t know if I understood this the wrong way.

Nighthawk13 · May 16, 2010, 12:06pm

I implemented what you describe for a dual-GPU stereo raytracer, but with a twist.

The problem is, unless you have a Quadro card, only the “primary” device may create an OpenGL context.
Also, the Cuda context and the OpenGL context are bound to the thread which they were created with, they are not accessible from another thread.
=>If you want to share data between Cuda and OpenGL, both contexts must be created by the same thread(Maybe wglShareLists() can relax this restriction, i did not try).

So my thread1 does basically what you describe(Hint: it is faster to update the texture directly with the cuda3.0 mapping instead of updating the PBO).
In my thread2, the kernel writes the resulting pixels to host memory(memory mapped).
After the kernel is done on thread2, thread1 uploads the data from host memory as usual with glTexSubImage2D() and draws a fullscreen quad.

If you can choose your API freely you should consider using Direct3D, because there you can create a Direct3D context on the secondary device as well(and use the interop, i suppose).

If someone from Nvidia is reading this: Please give us access to secondary consumer devices via OpenGL as well!
This API has its share of problems already, no need to restrict it more than necessary.

lyso · May 17, 2010, 2:53am

Thanks to Nighthawk13 very much for the cuda3.0 hint and Direct3d advice!

I’ll try these methods.

In your implementation, thread2 puts data to host memory, isn’t that a time-comsuming operation?

Thanks again:)

I implemented what you describe for a dual-GPU stereo raytracer, but with a twist.

The problem is, unless you have a Quadro card, only the “primary” device may create an OpenGL context.

Also, the Cuda context and the OpenGL context are bound to the thread which they were created with, they are not accessible from another thread.

=>If you want to share data between Cuda and OpenGL, both contexts must be created by the same thread(Maybe wglShareLists() can relax this restriction, i did not try).

So my thread1 does basically what you describe(Hint: it is faster to update the texture directly with the cuda3.0 mapping instead of updating the PBO).

In my thread2, the kernel writes the resulting pixels to host memory(memory mapped).

After the kernel is done on thread2, thread1 uploads the data from host memory as usual with glTexSubImage2D() and draws a fullscreen quad.

If you can choose your API freely you should consider using Direct3D, because there you can create a Direct3D context on the secondary device as well(and use the interop, i suppose).

If someone from Nvidia is reading this: Please give us access to secondary consumer devices via OpenGL as well!

This API has its share of problems already, no need to restrict it more than necessary.

Nighthawk13 · May 17, 2010, 10:47am

Not at all, the kernel runtime does not increase by a significant amount(The high latency of the few host memory accesses is hidden well by the many ALU instructions needed for raytracing).

However, the upload in thread1 from host memory to texture memory(via PBO+glTexSubImage2D) does cost ~2 milliseconds extra(with 800x600x32bitsPerPixel).

This extra time could be avoided with the interop on the second GPU.

lyso · May 18, 2010, 3:16am

Thanks~

I’m planning to try this:

the main thread take responsible for rendering, and It create 2 threads for cuda computing.

These 2 sub threads update host memory after comutation for main thread, after which main thread upload texture memory and render.

Topic		Replies	Views
Different threads in runtime api CUDA Programming and Performance	8	6707	September 4, 2008
How to start 2 kernels on 2 devices CUDA Programming and Performance	16	10495	January 7, 2009
cudaGraphicsMapResources called on background thread returns cudaErrorInvalidGraphicsContext CUDA Programming and Performance cuda	1	773	January 10, 2023
Error executing two threads using OpenGL CUDA Programming and Performance	3	1386	November 20, 2008
cuda interop with two OpenGL windows CUDA Programming and Performance	1	792	February 22, 2014
multi GPU programming issues time taken to create context? CUDA Programming and Performance	1	745	March 28, 2011
GTX295 multi GPU programming CUDA Programming and Performance	22	10655	July 9, 2009
Using CUDA/CudaContexts simultanously from multiple CPU threads CUDA Programming and Performance	4	5448	February 3, 2010
Using GL buffers from a second render thread OptiX	6	1229	June 14, 2022
MultiGPUs newbie question Data transformation problem CUDA Programming and Performance	12	5152	March 18, 2008

Has anyone used 2 cards for CUDA rendering?

Related topics