Problem with multiGPU code using cuda Threads to launch the same kernel to multiple GPUs


I 've got the following strange problem
I wrote a code for 2D correlation which is running perfectly at one device.
I am trying to divide the problem size and distribute it to 2 identical devices.
For that purpose i launch, through the main function, 2 threads using the
For sizes up to 100X100 matrices the program is running fine.
Also, the code is running correctly when launching 1 thread (without actually
dividing the problem).
When running 2 threads for bigger sizes i get the following:
Both threads are launched.
Data are transfered correctly to both devices.
One thread after accessing kernel function returns
correct results.
The other returns zeros not only for the Outcome but also for the input data
which -as said- were transfered correctly.

Any thoughts to solve the problem???

ps: If needed i can post a part of the code.

  1. How do you verify that the data was transferred correctly to both GPUs?
  2. Are you following one of the multi-GPU samples in the SDK? If so, what’s different between your code and the sample?


For the transportation of the data i allocate memory in the CUT_THREADPROC routine using CudaMalloc and then i copy them to

the GPUs using cudaMemcpy. So each thread does the same thing for each device and returns the data to a different matrix which is

declared as a pointer in a strucrure.

I do that with the following code:




printf(“The device running now is %d\n”, dev);

CUDA_SAFE_CALL( cudaMalloc((void**)&d_Pad_Frame, str->PAD_FRAME_SIZE_PER_DEVICE));

CUDA_SAFE_CALL( cudaMalloc((void**)&d_OutcomeGPU, str->OUTCOME_SIZE_PER_DEVICE));

gettimeofday (&str->start_in, NULL);

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Template, h_Template, TEMPLATE_SIZE));

CUDA_SAFE_CALL( cudaMemcpy(d_Pad_Frame, str->h_Pad_Frame, str->PAD_FRAME_SIZE_PER_DEVICE, cudaMemcpyHostToDevice));

gettimeofday (&str->end_in, NULL);

CUDA_SAFE_CALL( cudaMemcpy(str->test_Pad_Frame, d_Pad_Frame, str->PAD_FRAME_SIZE_PER_DEVICE, cudaMemcpyDeviceToHost) );

Printing_Matrix(PAD_FRAME_M, str->PAD_FRAME_SIZE_PER_DEVICE/(PAD_FRAME_M * sizeof(float)), str->test_Pad_Frame);


So actually i print the matrices.

For the second thing you ask: In order to launch the threads i use the steps of the MonteCarloMultiGpu from NVIDIA CUDA SDK

If you know pthreads I would recommend using them. I had a similar problem until mranderson42 pointed out that cutthreads are simply wrappers around pthreads. Using pthreads gives you so much more control and functionality.

You could also use openmp.