Error executing two threads using OpenGL

Hi, first of all, sorry for my poor English.

I keep developing in Cuda, I’m programming a few examples using Cuda and OpenGL, the results are great, but now I have a problem and I can’t solve it after a lot ot tries.

I try now using two devices for computing. They’re two Geforce 9600GT on a 780 SLI motherboard. I’ve disabled the SLI mode, so CUDA can see the two devices.

I’m following the ‘simpleMultiGPU’ example in CUDA SDK. This example compiles and run well (although CPU [Quad 9300] compute time is lower than GPU compute time, two times faster!) and CUDA detects two devices in my PC, the two threads are launched correctly.

Let me tell you my problem:

My example is very simple. I have a plane mesh, defined with vertices and indices, for each frame the position of all vertices increase -0.1 units in Y axis. I’m using VBO’s (vertex buffer object) and IBO’s (index buffer object).

Executing my example without threads (I refer application threads, not CUDA threads), I have no problem. The plane moves quick and smooth towards Y axis. What I do is:

[indent]1. Create VBO and IBO
2. Register VBO and IBO in CUDA
3. For each frame:
[indent]3.1. Map VBO in device memory
3.2. Modify vertex in device memory
3.3. Unmap VBO
3.4. Draw plane using OpenGL[/indent][/indent]

As I told you before, it works perfectly. But when I use multi threading, the application crashes:

[indent]1. Create VBO and IBO
2. Register VBO and IBO in CUDA
3. For each frame:
[indent]3.1. Create two threads.
3.2. Set the device in CUDA for each thread.
3.3. Map VBO in device memory
3.4. Modify vertex in device memory
3.5. Unmap VBO
3.6. Draw plane using OpenGL[/indent][/indent]

I’m using the same code to create the threads in ‘simpleMultiGpu’:

threads [0] = cutStartThread((CUT_THREADROUTINE)dispatcher, (void *)(&data1));
threads [1] = cutStartThread((CUT_THREADROUTINE)dispatcher, (void *)(&data2));

‘dispatcher’ function sets the device and map VBO, then executes kernel and unmap VBO:

[b][i]// Set the device
cudaSetDevice (data->device);

// Map VBO
float3 d_vboPlane;
[u]CUDA_SAFE_CALL(cudaGLMapBufferObject((void
*)&d_vboPlane, data->planeId)); // CRASHES[/u]

// Creates dimensions
dim3 blk (nBlocks, 1, 1);
dim3 thrd (nThreads, 1, 1);

// Call kernel
sampleMultiThread_kernel <<<blk, thrd>>>(d_vboPlane, data->planeSize);

// Synchronize threads
cudaThreadSynchronize ();

// Unmap object
CUDA_SAFE_CALL(cudaGLUnmapBufferObject(data->sphereId));
[/i][/b]

When application reaches 'cudaGLMapBufferObject, application crashes. The message in output is:

First-chance exception at 0x77d4dd10 in testOpenGLCubo.exe: Microsoft C++ exception: cudaError_enum at memory location 0x051efdb8…

If I execute only one thread, the application crashes the same way.

I’ve searched for this error in the forum, and the partial solutions discussed here havn’t solved my problem :(. Please, could you help me?

Thanks in advance.

Anybody, please?

It’s don’t understand exactly what you’re trying to do. You have a single OpenGL context? You have to make sure all GL calls are from the same thread.

Anyway, there is no real advantage to using OpenGL interop across multiple GPUs (it just does copies internally), if I were you I would simply read back the geometry to the CPU and render from there.

Sorry, I have a very, very poor English level, I’ll try to explain better my problem.

Soon (in a few months) we have to program an application that makes a lot of calculations on various meshes. Basically, we have to modify the meshes vertices position in each frame using an advanced and weighted mathematical algorithm. This algorithm is so heavy that we get only 0,5~1 frames per second using a Quad CPU.

Our goal is programming this algorithm in CUDA, using parallel computation to get better results, at least real-time (25 fps). Vertices calculations are independant, you can apply the algorithm on one of them and you don’t have to know the values of the rest of vertices.

We want to take advantage of our two GForge 9600 cards, so we have to use multi threading. Basically, our intention is mapping the mesh in CUDA and then each device calculates the half of mesh vertices:

VBO Mesh → Map VBO in CUDA (device memory) → Create Threads

– Thread 1 (in device 1) calculates half vertices

– Thread 2 (in device 2) calculates half vertices

Could I take any advantage for my problem in CUDA?

Sorry again for my English, and thanks, thanks a lot for your time.