These segments of code aren’t enough to say what the problem is. Where and how is hostDeformImage[id] allocated? How are you running the kernel that performs the computation?
So what i show here is exactly what i change in my code, nothing else. I even use cudaThreadsynchorinzation() to make sure the transformation of data is completed before i perform any computation. So if the outcome of the two equivalent, my implementation will give exactly the same result.
The hostDeformImage[id] is allocated separately on CPU side, there’s no overlap between any two.
Typo - I meant to ask where is dataGPU[id].d_template3D allocated?
To do multi-GPU, you need to spawn a thread that grabs the GPU with cudaSetDevice(), allocates memory on the GPU, runs the computation, and copies the results back to the host. The thread cannot exit between any of these steps.
I asked for more code because I can’t tell if you’re doing that from the code you posted; it appears you are not. Maybe the description above is enough to point you in the right direction.
Thank you, you question is now clearer. So the dataGPU[i].d_template3D was allocated on each GPU. I’m pretty sure that it is there because the result without the threads run correct. It give the same answer with one GPU solution.
I was asking where in your code is the GPU memory allocated, not where the buffer is located physically.
Still, I think you’re missing my point. You need one thread per GPU for allocation, copying, computation, and copying the results back to the CPU. Each thread may NOT exit between these steps. GPU resources are free’d at thread exit because cudaThreadExit is called implicitly.
The code you posted does not do this, and since I haven’t seen the rest of your code, I can only guess that is the problem.
I think you need understand the concept of cuda “context” and the relation between cuda “context” and “thread” even you use the runtime API. Please read the programming guide carefully.
No. What I said above goes for an OS thread, regardless of what API you use to create and manage the thread. When it exits, any resources on a GPU assigned to the thread will be released.
There’s some point unclear to me here. I don’t allocate the memory inside the thread. I locate in the main thread, by setdevice to the GPUs and allocate the memory. So i think it should only be free if the main program exit. Why do other CPU thread (the copy thread in my program) can free my memory.
thread 1 : allocate memory on GPU (d_mem)
thread 2 : d_mem is not a valid pointer anymore to memory on the device, since each thread has a different CUDA context.
So you cannot exchange pointers to device memory between threads