Single Device, Multithreaded host, cuda error: unspecified launch failure

I’m writing some large CUDA module which contain multiple kernel calls, mainly cufft and my code(I can’t post the code).

The GPU module tested perfectly on a single GPU using single executable thread (no problems has been yield).

When I create 2 threads, where each thread has its own CUDA module class’s instance, I failed to run the module after running several kernels calls and receive an error message.

Sometimes the multithreaded module works fine, but it’s often failing and exiting by specifying on a small back screen “unspecified launch failure” at line # (The # line number may differ on different runs)

1st, I was looking for segmentation fault but couldn’t find any, keep in mind that the single threaded code raised no error (which means that if there was a segmentation error I’d already discovered it on the single threaded version)

2nd, I run CUDA memory check and didn’t find anything.

3rd, I compiled the project with G and g, no signs for anything suspicious.

4th, I build my own multithreaded version which works with smaller kernels calls, everything seems to be working fine with my multithreaded methodology.

5th, I’m quite familiar with CUDA context, since I’m using two different threads from the same process I’m using the same CUDA context.

I’m running CUDA latest version 5.5, Device Tesla C2075
My hint is that I’m overloading the GPU more than its computational capacity , I’d expect the GPU worker queue to queue my different kernels calls , even though if they come from different threads in the same process. somehow, everything in the GPU is collapse, instead of been queue I’m being thrown.

Few questions:

How can I know if my GPU is overloaded (i.e out of resources? ) in a multithreaded system
I’m using texture memory, may it cause some extra trouble ?
Any help, hint or advice would be very appreciated