seg fault when calling cudaThreadExit multiGPU issue

We have an application that spawns several threads that each do some computation on a S870. At the start of the thread, cudaSetDevice is used to set the cuda context. The work performed during the thread execution appears correct, and no error is returned after calling cudaThreadSynchronize. However, when the thread terminates, the program seg faults. I finally figured out that cudaThreadExit is where the seg fault is occurring, but I have no idea why. If we set the application up to only use 1 device, the program executes as expected with no seg faults, its only when two or more devices are used that the seg fault is happening.

Has any one seen this type of behavior before, or have any hints on what could seg fault cudaThreadExit? We’re using boost on ubuntu to do the threading, and I’m not convinced that we’re doing everything correctly. Thanks.

A quick update, we’ve tried using the low level api to detach the context, but the program still crashes at thread exit. Maybe cudaThreadExit is still being called? Any ideas?

Just to close out in case anyone else every sees this error. The problem was definitely in our code. Just before exiting, each thread calls a destroy function to free all of the allocated memory on the device. In this destroy function a host variable was also being freed that was not allocated by the thread, but allocated in the master thread. Once that was fixed, the seg fault as disappeared. This is not the first time I was convinced that something was wrong with CUDA only to find the mistake in code I had written.