I’m using 2 GTX260+ for rendering, each card’s computing are basically the same, and independent.
(Just imagine a stereo display system: one card responsible for left-eye-scene, and the other for right-eye-scene.)
My program of drawing only one scene goes well.
And I’m now trying to use 2 host threads in CPU, in each thread:
Thread1:
CreateOpenGLContext
cudaGLSetGLDevice(0);
CreatePBO(0) & cudaRegisterBufferObject();
AllocCudaMemory();
LaunchCudaKernel();
RenderToScreenUsingPBO();
Thread2:
CreateOpenGLContext
cudaGLSetGLDevice(1);
CreatePBO(1) & cudaRegisterBufferObject();
AllocCudaMemory();
LaunchCudaKernel();
RenderToScreenUsingPBO();
but it always fails at “cudaRegisterBufferObject” with “unknown error”,
Does anyone know what’s wrong with that?
I’m not pro by any means, but why are you doing the initialization in two different threads? If I were you I would do all initializing, then create separate threads to do the rendering. It’s possible it’s trying to create two buffers and register them simultaneously and that isn’t handled or something.
OpenGL doesn’t support multi-thread, so I have to do OpenGL-initialize and set GLcontext in each thread;
It seems if I initialize in the main thread, the cuda arrays can’t be seen by sub-threads, which I’m not very sure about.
and I’m using cudaGLSetGLDevice() in each thread to tell them to allocate memory in different cards, and the pbo-registering in 2 threads are supposed to be independent. I don’t know if I understood this the wrong way.
I implemented what you describe for a dual-GPU stereo raytracer, but with a twist.
The problem is, unless you have a Quadro card, only the “primary” device may create an OpenGL context.
Also, the Cuda context and the OpenGL context are bound to the thread which they were created with, they are not accessible from another thread.
=>If you want to share data between Cuda and OpenGL, both contexts must be created by the same thread(Maybe wglShareLists() can relax this restriction, i did not try).
So my thread1 does basically what you describe(Hint: it is faster to update the texture directly with the cuda3.0 mapping instead of updating the PBO).
In my thread2, the kernel writes the resulting pixels to host memory(memory mapped).
After the kernel is done on thread2, thread1 uploads the data from host memory as usual with glTexSubImage2D() and draws a fullscreen quad.
If you can choose your API freely you should consider using Direct3D, because there you can create a Direct3D context on the secondary device as well(and use the interop, i suppose).
If someone from Nvidia is reading this: Please give us access to secondary consumer devices via OpenGL as well!
This API has its share of problems already, no need to restrict it more than necessary.
Not at all, the kernel runtime does not increase by a significant amount(The high latency of the few host memory accesses is hidden well by the many ALU instructions needed for raytracing).
However, the upload in thread1 from host memory to texture memory(via PBO+glTexSubImage2D) does cost ~2 milliseconds extra(with 800x600x32bitsPerPixel).
This extra time could be avoided with the interop on the second GPU.