It looks like you are facing some issue related to the CUDA context.
In general, you should store/restore the CUDA context when switching the tasks for each thread.
A sample for CUDA with thread can be found here:
Please let me know if I misunderstood your question.
Thanks.