Multiple GPU memory address problem help

I am trying to implement an algorithm using multiple gpus. And I followed the simplemultipleGPU example provided by CUDA SDK but I am facing a memory address problem. I use the
ThreadList[i] = (HANDLE)_beginthreadex( NULL, 0, &solverThread, plan , 0, NULL ); the same as simplemultipleGPU example. The “plan” is a struct. And I am trying to use the
cudaHostAlloc((void **)&x,sizeof(float)*2,cudaHostAllocMapped); to allocate some memory to the varible “x” in the main thread. But when the program goes into the function “solverThread”. And I am trying to use cutilSafeCall( cudaHostGetDevicePointer((void **)&x_gpu,(void *)x,0) ); to get the mapped addresse in the GPU. The program displayed an error from cutilSafeCall (I think) .

Meanwhile, if I use cutilSafeCall( cudaHostGetDevicePointer((void **)&x_gpu,(void *)x,0) ); to get the mapped address in the main thread and then pass the gpu pointer to the “solverThread” directly, the program works but the result is not right compared with using the my kernal function directly in the main. So I think this problem is caused by _beginthreadex. Does any one know how to solve it? Thank you very much!!!

I tried to make an experiment in the simplemultipleGPU. The follwing is what I have changed. But it still shows the error in

cutilSafeCall(cudaHostGetDevicePointer((void **)&d_Data, (void *)(plan->h_Data), 0)) ;

//main part

 cutilSafeCall(cudaHostAlloc((void **)&(h_Data),DATA_N*sizeof(float),cudaHostAllocMapped));
 cutilSafeCall(cudaHostAlloc((void **)&(h_Data),DATA_N*sizeof(float),cudaHostAllocPortable));
    for(i = 0; i < DATA_N; i++)
         (h_Data)[i] = (float)rand() / (float)RAND_MAX;
    for(i = 0; i < GPU_N; i++)
    plan[i].dataN = DATA_N / GPU_N;
//Take into account "odd" data sizes
for(i = 0; i < DATA_N % GPU_N; i++)
//Assign data ranges to GPUs
gpuBase = 0;
//plan[i].h_Data=(float **)malloc(sizeof(float *)*1);
for(i = 0; i < GPU_N; i++){
    plan[i].device = i;
    plan[i].h_Data = h_Data + gpuBase;
    plan[i].h_Sum = h_SumGPU + i;
    gpuBase += plan[i].dataN;

//Start timing of GPU code
printf("main(): waiting for GPU results...\n");
    for(i = 0; i < GPU_N; i++)
        threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));

//solverThread part
cutilSafeCall( cudaSetDevice(plan->device) );
cutilSafeCall( cudaSetDeviceFlags(cudaDeviceMapHost));
cutilSafeCall( cudaMalloc((void**)&d_Sum, ACCUM_N * sizeof(float)) );
cutilSafeMalloc( h_Sum = (float *)malloc(ACCUM_N * sizeof(float)) );
cutilSafeCall( cudaHostGetDevicePointer((void **)&d_Data, (void *)(plan->h_Data), 0)) ;

launch_reduceKernel(d_Sum, d_Data, plan->dataN, BLOCK_N, THREAD_N);


On multiple GPUs – memory allocted for a KERNEL must be allocated by the SAME thread. This is a common mgpu mistake.

That means if I allocate some bytes of cudaHostAllocMapped memory in “main”, I can not use cudaHostGetDevicePointer to get the device pointer in the thread created by “beginthreadex”. And I cannot pass the device pointer obtained in “main” directly to the thread created by “beginthreadex”.

Yes! Right… For kernels called in THREAD1 , all memory allocations to happen in THREAD1.

Thank you!