strange behavior pattern of compiled program Emulation works fine: device execution works (approx) 1


I’m a baby-Cuda (about 3 month old)

Playing with the code sample “simpleGL”. Adjusting it for an application similar to “Life”.

Gotten to the point of an object moving around on the screen.

In previous iterations, as complexity increased,
both emulation mode and device execution mode worked identically.

In the last iteration of increasing complexity (the only thing added was a y-axis of motion) a strange behavior
began. Emulation works perfectly 100% of the time, but, when compiled for device execution, 3 events alternate,
completely at random (that is, all sequences you might ask about occur with equal frequency). They are

  1. blank screen (the output screen that used to display simpleGL’s waves and now does my objects)
  2. correct function ( an object moves across the screen)
  3. 2 runtime errors saying: CudaSafeCall() Runtime API error in file <simpleGL.cpp>, line 336 : unspecified launch failure.
    The two lines referred to in the error messages are un-mapping and de-registering cuda-calls:

AT first I thought I just wasn’t cleaning up memory after execution properly, which could explain
differing successive behaviors by identical executables, but I have since discovered that the pattern
of 1-error 2- blankness 3- correct function
don’t provide a consistent explanation.

Can anyone out there tell me where to look?

I don’t think it’s appropriate to post the entire code, but if you think you have a clue, I’ll email it to you.
In the meantime, here’s a summary. If you have this summary plus the code samples “SimpleGL” and :MatrixMul"
plus the code examples in the Programmers’ Guide you can get a good idea of what I’m doing:

all includefiles
most constants & variables (I removed everything possible from the original,
at the very beginning of experimentation, e.g.
command-line argument uses, mouse controls, etc)
forward declarations

(FROM various sources)
runTest(argc, argv);
cutilExit(argc, argv);

CUTBoolean runTest{
set some integer variables & object size_t stuff
allocate a tiny amount of host mem using malloc (exactly as in Programmer’s Guide)
define some pointers for transferring data to device
allocate tiny amount of global device mem using MatrixMul code
copy it to device as in MatrixMul
register callbacks and create VBO as in simpleGL
create, initialize & allocate for buffer (exactly as simpleGL)
atexit (delete vbo & checkRender - from simpleGL)
memory cleanup (from matrixMul, taking care of the small memory amounts I allocated
that were NOT included in simpleGL)
cudaThreadExit (why this is duplicated here, I don’t know - see simpleGL)


Init GL routine - exactly as simpleGL, with some removed (all removed long before current
problems surfaced)

RunCuda {

Create VBO exactly as simpleGL

display routine as called from glutMainLoop {
runcuda, then the following routines from simpleGL:
glClear, glLoadIdentity, glTranslatef, glBindBuffer, glVertexPointer,
glEnableClientState, glColor3f, glDrawArrays

glDisableClientState, glutSwapBuffers, glutPostRedisplay


calculate x & y as in simpleGL
calculate u,v,w, using different functions, but conceptually identical to simpleGL
write output vertex using pos exactly as simpleGL

the same “launch_kernel” as in simpleGL, but with a small grid