problem with multi GPU application

Hi all,

I am trying to make a simple multi-thread program to work with multiple GPUs. In this program I have defined two thread functions that each of them copies two vectors on a GPU and multiply these 2 vector by using a defined multiplication kernel. I am using windows commands to create threads, events, and synchronization. As I wanted to simulate a time-domain problem in which the value of the variables are changing at each time-step, I made a while-loop that at each of its iteration the above mentioned input vectors are increased and after that thread functions continue their task of multiplication. Here is one of the thread functions, the other one is exactly the same:

static void Thread1 (LPVOID lpParam){


	float *d_Data1, *d_Data2, *d_Mul;

	cublasStatus status;	

	status = cublasInit();

	status = cublasAlloc(plan[0].dataN * 1, sizeof(float), (void**)&d_Data1);

	status = cublasAlloc(plan[0].dataN * 1, sizeof(float), (void**)&d_Data2);

	status = cublasAlloc(plan[0].dataN * 1, sizeof(float), (void**)&d_Mul);

	while (true){

		WaitForMultipleObjects(MAX_GPU_COUNT, StartEvent, TRUE, INFINITE);

		CUDA_SAFE_CALL(cudaMemcpy(d_Data1, plan[0].pointA, sizeof(float)*32, cudaMemcpyHostToDevice));

		CUDA_SAFE_CALL(cudaMemcpy(d_Data2, plan[0].pointB, sizeof(float)*32, cudaMemcpyHostToDevice));

		vecDOTvec(d_Data1, d_Data2, d_Mul, plan[0].dataN);


		status = cublasGetVector(plan[0].dataN,sizeof(float), d_Mul, 1, plan[0].AtimesB, 1);








and here is the while-loop:

for (int i=0; i<MAX_GPU_COUNT; i++){

		hEvent[i] = CreateEvent( NULL, TRUE, TRUE, NULL );

		StartEvent[i] = CreateEvent( NULL, TRUE, TRUE, NULL );


	//start two new threads

	_beginthread( Thread1, 0, NULL );

	_beginthread( Thread2, 0, NULL );


	while (i<4){

		for (int j=0; j<MAX_GPU_COUNT; j++)



		WaitForMultipleObjects(MAX_GPU_COUNT, hEvent, TRUE, INFINITE);

		for (int j=0; j<MAX_GPU_COUNT; j++)	


		printf("\n Thread1, round: %d \n",i);

		printf("vec_A	 vec_B 	   AxB \n",i);

		for(int i=0; i<32; ++i)

			printf("%f %f %f\n",plan[0].pointA[i], plan[0].pointB[i], plan[0].AtimesB[i]);

		printf("\n Thread2, round: %d \n",i);

		for(int i=0; i<32; ++i)

			printf("%f %f %f\n",plan[1].pointA[i], plan[1].pointB[i], plan[1].AtimesB[i]);		

		for(int j = 0; j < 32; j++){

			plan[0].pointA[j] = plan[0].pointA[j]+1;

			plan[1].pointA[j] = plan[1].pointA[j]+1;



Although the thread functions are exactly similar but I found out that thread1 has a delay in its outputs in the second iteration of the while-loop. In the 2nd round even though the input of the thread1 has changed the result is not changing and it sends out the round 1’s outputs, while thread 2 works completely fine. Because it is a little bit confusing to explain this problem in words, I have attached the files here.

I will be thankful if anybody could take a look at these codes or runs it on his/her devices and let me know what is wrong in this program. (please look at the output results under the: Thread1, round 2)

As it doesn’t let me to upload cuda files I put the codes here:

this is file:

#include <stdlib.h>

#include <stdio.h>

#include <cutil.h>

#include <>

extern "C" void vecDOTvec(float* DEVICE1, float* DEVICE2, float* RESULT, int N)


	vecDOTvecKernel<<< (N+127)/128, 128 >>>(DEVICE1, DEVICE2, RESULT, N);

	CUT_CHECK_ERROR("Kernel execution failed");


and this is file:



__global__ void

vecDOTvecKernel(float* DEVICE1, float* DEVICE2, float* RESULT, const int N)


	int idx = threadIdx.x + blockDim.x * blockIdx.x;

	if (idx < N)	




Thanks for your time.
I guess the problem is not with the thread-programming, because if instead of using GPU and cuda stuff in each of the thread-functions (i.e Thread1 and Thread2) I use a for-loop to multiply 2 vectors it works fine for both threads. The problem comes with the cuda: even if I free the vectors at the end of the thread-function and re-allocate them at the beginning of the thread-function, it keeps the previous round value for the output in the second round!
If you take a look you can see that it works fine for rounds 1, 3, 4,…, it’s not working just for round 2 of Thread 1!

I found the problem…I have a GTX280 (device 0) and a Tesla unit S1070 (devices 1 to 4). In my program I was using devices 0 and 1 (i.e. the GTX 280 card and one unit of the Tesla), this causes some delays or asynchronism, when I changed it to devices 1 and 2 (i.e. not using the GTX 280) it worked well.