I think I have to ask here to get the problem solved.
In MultiGPU example of CUDA SDK, seemly the device memory can only be allocated inside the thread, which means I can only allocate the device memory inside the function of gpuThread. If I allocate the device memory for each GPU before I create the threads, the whole program will hang there.
Are there any solutions for such kind of probem? Because I prefer to allocate all the needed memory before I use CUDA to compute something.
You can do the allocation whenever you like - just make sure to talk to the correct device! The demo ensures this by doing the allocs in the thread that has the correct context info. You can pull that out of the thread code and do the init beforehand switching to the right context.
I personally would dislike your “global” approach. It is much cleaner the way it is done in the demo because the resources are local to each GPU after all.
When I print out the memory address, I think that the problem.
If I allocate the memory in the gpuThread, the same variable will get the same address for GPU0 and GPU1. However if I allocate the memory before starting the threads, the same variable will get different memory address. I promise when I allocate the memory I always use cudaSetDevice to make sure they are on the same device.
Are you sure that you are not accidentally overwriting a thread global variable with the addresses returned from the cudaMalloc ? If you get the device addresses before the thread fork, you’ll need individual variables for them.