Multiple GPU memory address problem help

poirot · November 17, 2009, 4:27am

I am trying to implement an algorithm using multiple gpus. And I followed the simplemultipleGPU example provided by CUDA SDK but I am facing a memory address problem. I use the
ThreadList[i] = (HANDLE)_beginthreadex( NULL, 0, &solverThread, plan , 0, NULL ); the same as simplemultipleGPU example. The “plan” is a struct. And I am trying to use the
cudaHostAlloc((void **)&x,sizeof(float)*2,cudaHostAllocMapped); to allocate some memory to the varible “x” in the main thread. But when the program goes into the function “solverThread”. And I am trying to use cutilSafeCall( cudaHostGetDevicePointer((void **)&x_gpu,(void *)x,0) ); to get the mapped addresse in the GPU. The program displayed an error from cutilSafeCall (I think) .

Meanwhile, if I use cutilSafeCall( cudaHostGetDevicePointer((void **)&x_gpu,(void *)x,0) ); to get the mapped address in the main thread and then pass the gpu pointer to the “solverThread” directly, the program works but the result is not right compared with using the my kernal function directly in the main. So I think this problem is caused by _beginthreadex. Does any one know how to solve it? Thank you very much!!!

poirot · November 17, 2009, 9:32am

I tried to make an experiment in the simplemultipleGPU. The follwing is what I have changed. But it still shows the error in

cutilSafeCall(cudaHostGetDevicePointer((void **)&d_Data, (void *)(plan->h_Data), 0)) ;

//main part

 cudaSetDeviceFlags(cudaDeviceMapHost);
 cutilSafeCall(cudaHostAlloc((void **)&(h_Data),DATA_N*sizeof(float),cudaHostAllocMapped));
 cutilSafeCall(cudaHostAlloc((void **)&(h_Data),DATA_N*sizeof(float),cudaHostAllocPortable));
    for(i = 0; i < DATA_N; i++)
         (h_Data)[i] = (float)rand() / (float)RAND_MAX;
    for(i = 0; i < GPU_N; i++)
    plan[i].dataN = DATA_N / GPU_N;
//Take into account "odd" data sizes
for(i = 0; i < DATA_N % GPU_N; i++)
    plan[i].dataN++;
//Assign data ranges to GPUs
gpuBase = 0;
//plan[i].h_Data=(float **)malloc(sizeof(float *)*1);
for(i = 0; i < GPU_N; i++){
    plan[i].device = i;
    plan[i].h_Data = h_Data + gpuBase;
    plan[i].h_Sum = h_SumGPU + i;
    gpuBase += plan[i].dataN;
}

//Start timing of GPU code
printf("main(): waiting for GPU results...\n");
cutilCheckError(cutResetTimer(hTimer));
cutilCheckError(cutStartTimer(hTimer));
    for(i = 0; i < GPU_N; i++)
        threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));

//solverThread part
cutilSafeCall( cudaSetDevice(plan->device) );
cutilSafeCall( cudaSetDeviceFlags(cudaDeviceMapHost));
cutilSafeCall( cudaMalloc((void**)&d_Sum, ACCUM_N * sizeof(float)) );
cutilSafeMalloc( h_Sum = (float *)malloc(ACCUM_N * sizeof(float)) );
cutilSafeCall( cudaHostGetDevicePointer((void **)&d_Data, (void *)(plan->h_Data), 0)) ;

launch_reduceKernel(d_Sum, d_Data, plan->dataN, BLOCK_N, THREAD_N);

Sarnath · November 17, 2009, 9:38am

456

Sarnath · November 17, 2009, 9:39am

On multiple GPUs – memory allocted for a KERNEL must be allocated by the SAME thread. This is a common mgpu mistake.

poirot · November 17, 2009, 10:06am

That means if I allocate some bytes of cudaHostAllocMapped memory in “main”, I can not use cudaHostGetDevicePointer to get the device pointer in the thread created by “beginthreadex”. And I cannot pass the device pointer obtained in “main” directly to the thread created by “beginthreadex”.

Sarnath · November 17, 2009, 10:13am

Yes! Right… For kernels called in THREAD1 , all memory allocations to happen in THREAD1.

poirot · November 17, 2009, 11:45am

Thank you!

Topic		Replies	Views
Multiple GPU, device-memory CUDA Programming and Performance	1	1881	April 23, 2010
CUDa programming for multi-GPU questions about multi-GPU memory allocation CUDA Programming and Performance	1	7324	September 13, 2009
MultiGPU example in the CUDA SDK some stack problems CUDA Programming and Performance	5	3126	March 11, 2018
MultiGPU start help CUDA Programming and Performance	8	10523	August 10, 2010
CPU threads and CUDA CUDA Programming and Performance	8	7226	January 15, 2018
Mapped Memory on a Multi-Threaded Single GPU application CUDA Programming and Performance	1	6420	August 4, 2011
Problem with Memory allocation in GTX 295 multiThread out of memory when allocation of more than 768 CUDA Programming and Performance	1	1458	October 5, 2009
Pinned memory error invalid device pointer CUDA Programming and Performance	9	6080	April 10, 2009
Problem With GPU Memory CUDA Programming and Performance	4	3005	May 29, 2009
CUDAFreeHost() not clearing allocated host memory, when multiple devices are used. CUDA Programming and Performance	2	1182	November 13, 2019

Multiple GPU memory address problem help

Related topics