Strange problem with cuda .dll

Here’s the problem. I have three separate .dlls that wrap CUDA functions, let’s call them A, B and C. C depends on B, which in turn depends on A. Each runs fine if I call them from separate .exe launches. However, if I use all three .dll s in the same .exe, running A, B, C, C fails with the good old “unspecified launch failure”. I can run just A and C, and just B and C, but ABC and BAC both fail.

I don’t want to go into too much detail. I do know that none of my kernels take more than a few ms to run, and my thread and block sizes are well within the limit. I have also checked available card memory at the beginning of each set, and that is not an issue. If I run A and B, and then call C from a different .exe, they run fine and as expected. I have a 8800 GTX on windows XP. I’m still using CUDA 1.1. Any one have any ideas?

Thank you.