CUDA DLL AND JNI

immuner · November 12, 2008, 3:31pm

Hi,

we are porting our Java application that uses CUDA to perform some mathematical calculations, write to an image buffer and return the result to Java through JNI. I am using vista 32-bit, latest cuda and jdk and running on a quadro FX1700 512mb card.
So far so good (most of my previous post problems have been solved).
However what i would like to ask is whether this combination can cause a cuda memory loss or other issues.
To explain myself.
Inside the dll i have a number of functions to init cuda, allocate memory on the host, allocate memory on the device, execute the kernels (two of them, first is for init, second for writing onto buffer), etc.
When i allocate memory on the device i return to Java a pointer which i pass through to cuda when i need it again.
When i need to write on a Java image buffer, i allocate memory in Java, pass the pointer to cuda, run the kernel and copy the result buffer to the memory block pointed by the pointer passed from Java.

Seems to work alright, but here are some issues i am dealing with. THe main problem is instability. I am allocating to cuda a large volume (could easily be 200mb), but at cases the application might crash because of a memory access at the volume ptr, or it could go through and crash afterwards when it tries to free the volume ptr. In other cases it might crash on my first kernel with an out of memory error (that could be because i am using one card for desktop and cuda). It’s quite unstable nevertheless.
I have come to assume that the issue could be that the volume ptr is either getting lost on the gpu after some point, or something similar. Based on the idea that i cannot use a global device pointer on a dll (can i?), could the issue be that cuda frees memory by itself if called from a dll after some time has passed?

These are my kernels:
1>ptxas info : Compiling entry function ‘__globfunc__Z18renderkernelPsS_4int3S0_s6float3Pf’
1>ptxas info : Used 40 registers, 68+64 bytes smem, 52 bytes cmem[1]
1>ptxas info : Compiling entry function ‘__globfunc__Z17initkernelPsiiiiPdS0_S0_S0_S0_S0_Pis’
1>ptxas info : Used 25 registers, 9600+0 bytes lmem, 66+64 bytes smem, 8 bytes cmem[1]

Any help would be really appreciated as this is a critical stage for our project.

P.S. Is there any way we (my company) can contact nvidia members to discuss development issues such as this apart from this forum?