Calling CUDA kernels from 2 or more CPU threads simultaleously gives “unknown error”

I am developing java-application which uses CUDA with help of native DLL. The problem which i faced recently related to calling CUDA code from two java-threads in parallel.

simplified version of C++ code below (I ommitted kernels code):

void calcDistances(...) {
  cudaStream_t stream; 
  cudaStreamCreate(&stream);

  HANDLE_ERROR(cudaMalloc(...));
  ....more cudaMalloc...

  HANDLE_ERROR(cudaMemcpyAsync(...));

  for (int index = 0; index < anglesCount; index++) {
    ... kernel1<<< >>> ...
    ... kernel2<<< >>> ...
    ... kernel3<<< >>> ...
    rotAngle += angleStep; 
  }

  HANDLE_ERROR(cudaMemcpyAsync(...));

  HANDLE_ERROR(cudaFree(dFloats1));
  ... more cudaFree()...
}

Symptoms:

  1. when this code called in serial (placed “synchronized” on java-side) - OK
  2. when two java-threads called this code in parallel - it gives “unknown error” (with ~50% chance, so, sometimes it’s ok) on random line after cudaMemcpyAsync() copying to host.
  3. commenting out cudaFreee() makes this error to dissapear.
  4. descreasing input data-sizes also makes GPU happy, no error

Thanks for your comments and suggestions, guys.

PS: Posted this question on stackoverflow.com but got no reply there.