MultiGPUs newbie question Data transformation problem

I have two GPUs, and they both work.

In my program, i transfer data from each device memory to each host.

for (int id=0; id<2; ++id){

                cudaSetDevice(id);

                cudaMemcpy(hostDeformImage[id], dataGPU[id].d_template3D, sizeof(float)* nElems, cudaMemcpyDeviceToHost);

            }

Because i want to run it in parallel I define a download function and create 2 thread,

void* download(void* deviceId) {

    int* pid = (int*)deviceId;

    int id = *pid;

    cudaSetDevice(id);

    cudaMemcpy(hostDeformImage[id], dataGPU[id].d_template3D, sizeof(float)* nElems,  cudaMemcpyDeviceToHost);

    pthread_exit((void*) 0);

}
pthread_t thread[MAX_NUM_INPUTS];

int threadIds[MAX_NUM_INPUTS];

pthread_attr_t attr;

 void* status;

int rc;

for (int i=0; i< 2; ++i) threadIds[i] = i;

pthread_attr_init(&attr);

pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

for (int id=0; id<2; ++id)

    rc = pthread_create(&thread[id], &attr, download,(void*)&threadIds[id]);

for (int id=0; id<2; ++id){

      rc = pthread_join(thread[id], &status);

 }

It seems pretty easy. However parallel version does not give the same result as sequential version.

Can any one tell me what is the problem, and how to solve it

Thank you

These segments of code aren’t enough to say what the problem is. Where and how is hostDeformImage[id] allocated? How are you running the kernel that performs the computation?

So what i show here is exactly what i change in my code, nothing else. I even use cudaThreadsynchorinzation() to make sure the transformation of data is completed before i perform any computation. So if the outcome of the two equivalent, my implementation will give exactly the same result.

The hostDeformImage[id] is allocated separately on CPU side, there’s no overlap between any two.

Is that clear enough

Typo - I meant to ask where is dataGPU[id].d_template3D allocated?

To do multi-GPU, you need to spawn a thread that grabs the GPU with cudaSetDevice(), allocates memory on the GPU, runs the computation, and copies the results back to the host. The thread cannot exit between any of these steps.

I asked for more code because I can’t tell if you’re doing that from the code you posted; it appears you are not. Maybe the description above is enough to point you in the right direction.

Thank you, you question is now clearer. So the dataGPU[i].d_template3D was allocated on each GPU. I’m pretty sure that it is there because the result without the threads run correct. It give the same answer with one GPU solution.

I

I was asking where in your code is the GPU memory allocated, not where the buffer is located physically.

Still, I think you’re missing my point. You need one thread per GPU for allocation, copying, computation, and copying the results back to the CPU. Each thread may NOT exit between these steps. GPU resources are free’d at thread exit because cudaThreadExit is called implicitly.

The code you posted does not do this, and since I haven’t seen the rest of your code, I can only guess that is the problem.

I don’t use CUDA thread, i use pthread instead, so the memory should not be free. Am I right ?

I think you need understand the concept of cuda “context” and the relation between cuda “context” and “thread” even you use the runtime API. Please read the programming guide carefully.

Good luck

No. The CUDA thread class used in the multi-gpu example is just a thin wrapper on top of pthreads if you look at it…

For better or worse CUDA is very tightly coupled to host thread.

No. What I said above goes for an OS thread, regardless of what API you use to create and manage the thread. When it exits, any resources on a GPU assigned to the thread will be released.

There’s some point unclear to me here. I don’t allocate the memory inside the thread. I locate in the main thread, by setdevice to the GPUs and allocate the memory. So i think it should only be free if the main program exit. Why do other CPU thread (the copy thread in my program) can free my memory.

No, you don’t understand:

thread 1 : allocate memory on GPU (d_mem)
thread 2 : d_mem is not a valid pointer anymore to memory on the device, since each thread has a different CUDA context.

So you cannot exchange pointers to device memory between threads

You are right. I write program to check that.

Thank you all you guys