Crashes at cuCtxDestroy() when running multiple threads

Hello,

I’m using CUDA driver level API and each thread function making calls like:

at init:
cuDeviceGet()
cuCtxCreate()
cuModuleLoadData()
etc

with main loop containing multiple:
cuLaunchGrid()
cuCtxSynchronize()

and clean-up code before thread exits:
cuMemFree()
cuModuleUnload()
cuCtxDestroy()

I’m using at least one thread for each GPU (sometimes several threads creating contexts on one GPU as it makes the whole computations faster and there should be no problem even with CUDA 4.0 because of driver level API. I guess at least).

However, with fresh new GT 640 aka SM 3.0 device (and 301.42 drivers) I’ve faced crashes when calling cuCtxDestroy() when multiple threads where spawned. Looks like only first call to cuCtxDestroy() is successful. I’ve limited everything to “one thread per one GPU” and it works with single GT 640. But once two devices are used (GTX 470 + GT 640) it’s again the same crashing problem appears. It sometimes works – like 1 out of 10 runs – but there no stability at all.

Anyone facing something similar?