I have a program that works for a single GPU and I’m porting it to a multi-GPU version using the project in the SDK as a reference. What I want to do is set up a bunch of global memory, textures, etc., store handles in structures, then spawn threads to run the kernels. Inside the threads the structure would be used to access the per-gpu variables, upload data run the kernel, and download data.
The problem is that when I do things like cudaMalloc outside of the thread, I get garbage out of my kernel. When I do the cudaMalloc inside the thread, the kernel works fine. When trying to call cudaMalloc before spawning the thread, if I print the address of the pointer returned by cudaMalloc outside and inside the thread, they are the same but something breaks. Any ideas? :huh: