MultiGPUs newbie question Data transformation problem

Linh_Ha · March 11, 2008, 4:10pm

I have two GPUs, and they both work.

In my program, i transfer data from each device memory to each host.

for (int id=0; id<2; ++id){

                cudaSetDevice(id);

                cudaMemcpy(hostDeformImage[id], dataGPU[id].d_template3D, sizeof(float)* nElems, cudaMemcpyDeviceToHost);

            }

Because i want to run it in parallel I define a download function and create 2 thread,

void* download(void* deviceId) {

    int* pid = (int*)deviceId;

    int id = *pid;

    cudaSetDevice(id);

    cudaMemcpy(hostDeformImage[id], dataGPU[id].d_template3D, sizeof(float)* nElems,  cudaMemcpyDeviceToHost);

    pthread_exit((void*) 0);

}

pthread_t thread[MAX_NUM_INPUTS];

int threadIds[MAX_NUM_INPUTS];

pthread_attr_t attr;

 void* status;

int rc;

for (int i=0; i< 2; ++i) threadIds[i] = i;

pthread_attr_init(&attr);

pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

for (int id=0; id<2; ++id)

    rc = pthread_create(&thread[id], &attr, download,(void*)&threadIds[id]);

for (int id=0; id<2; ++id){

      rc = pthread_join(thread[id], &status);

 }

It seems pretty easy. However parallel version does not give the same result as sequential version.

Can any one tell me what is the problem, and how to solve it

Thank you

jimh · March 11, 2008, 7:28pm

These segments of code aren’t enough to say what the problem is. Where and how is hostDeformImage[id] allocated? How are you running the kernel that performs the computation?

Linh_Ha · March 11, 2008, 9:47pm

So what i show here is exactly what i change in my code, nothing else. I even use cudaThreadsynchorinzation() to make sure the transformation of data is completed before i perform any computation. So if the outcome of the two equivalent, my implementation will give exactly the same result.

The hostDeformImage[id] is allocated separately on CPU side, there’s no overlap between any two.

Is that clear enough

jimh · March 11, 2008, 10:01pm

Typo - I meant to ask where is dataGPU[id].d_template3D allocated?

To do multi-GPU, you need to spawn a thread that grabs the GPU with cudaSetDevice(), allocates memory on the GPU, runs the computation, and copies the results back to the host. The thread cannot exit between any of these steps.

I asked for more code because I can’t tell if you’re doing that from the code you posted; it appears you are not. Maybe the description above is enough to point you in the right direction.

Linh_Ha · March 12, 2008, 12:38am

Thank you, you question is now clearer. So the dataGPU[i].d_template3D was allocated on each GPU. I’m pretty sure that it is there because the result without the threads run correct. It give the same answer with one GPU solution.

I

jimh · March 12, 2008, 7:53pm

I was asking where in your code is the GPU memory allocated, not where the buffer is located physically.

Still, I think you’re missing my point. You need one thread per GPU for allocation, copying, computation, and copying the results back to the CPU. Each thread may NOT exit between these steps. GPU resources are free’d at thread exit because cudaThreadExit is called implicitly.

The code you posted does not do this, and since I haven’t seen the rest of your code, I can only guess that is the problem.

Linh_Ha · March 13, 2008, 10:53pm

I don’t use CUDA thread, i use pthread instead, so the memory should not be free. Am I right ?

hqyang · March 14, 2008, 12:47am

I think you need understand the concept of cuda “context” and the relation between cuda “context” and “thread” even you use the runtime API. Please read the programming guide carefully.

Good luck

eelsen · March 14, 2008, 1:10am

No. The CUDA thread class used in the multi-gpu example is just a thin wrapper on top of pthreads if you look at it…

For better or worse CUDA is very tightly coupled to host thread.

jimh · March 14, 2008, 5:05pm

No. What I said above goes for an OS thread, regardless of what API you use to create and manage the thread. When it exits, any resources on a GPU assigned to the thread will be released.

Linh_Ha · March 15, 2008, 4:56pm

There’s some point unclear to me here. I don’t allocate the memory inside the thread. I locate in the main thread, by setdevice to the GPUs and allocate the memory. So i think it should only be free if the main program exit. Why do other CPU thread (the copy thread in my program) can free my memory.

DenisR · March 15, 2008, 7:33pm

No, you don’t understand:

thread 1 : allocate memory on GPU (d_mem)
thread 2 : d_mem is not a valid pointer anymore to memory on the device, since each thread has a different CUDA context.

So you cannot exchange pointers to device memory between threads

Linh_Ha · March 18, 2008, 9:23am

You are right. I write program to check that.

Thank you all you guys

Topic		Replies	Views
MultiGPU start help CUDA Programming and Performance	8	10521	August 10, 2010
Simple multiGPU - Why is it failed Example to understand how multiGPU work CUDA Programming and Performance	8	4343	March 6, 2008
Questions for multiple GPUs CUDA Programming and Performance	8	7151	April 20, 2009
memcopy fails in multiple pthreads with cudaSetDevice() i m unable to use pthread with multiple GPUs CUDA Programming and Performance	5	3276	August 8, 2011
A little help with Multi-GPU example please :) How do I pass data to each GPU? CUDA Programming and Performance	8	27999	March 4, 2012
cudaMalloced pointer in one thread not allow cudaMemcpy in another CUDA Programming and Performance	2	4623	April 13, 2011
CUDAFreeHost() not clearing allocated host memory, when multiple devices are used. CUDA Programming and Performance	2	1168	November 13, 2019
On which device are __device__ variables allocated? CUDA Programming and Performance	21	6440	March 13, 2009
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3842	January 28, 2022
CUDA + CPU threads CUDA Programming and Performance	5	11638	August 20, 2008

MultiGPUs newbie question Data transformation problem

Related topics