Hi all,
I am trying to make a simple multi-thread program to work with multiple GPUs. In this program I have defined two thread functions that each of them copies two vectors on a GPU and multiply these 2 vector by using a defined multiplication kernel. I am using windows commands to create threads, events, and synchronization. As I wanted to simulate a time-domain problem in which the value of the variables are changing at each time-step, I made a while-loop that at each of its iteration the above mentioned input vectors are increased and after that thread functions continue their task of multiplication. Here is one of the thread functions, the other one is exactly the same:
static void Thread1 (LPVOID lpParam){
CUDA_SAFE_CALL(cudaSetDevice(0));
float *d_Data1, *d_Data2, *d_Mul;
cublasStatus status;
status = cublasInit();
status = cublasAlloc(plan[0].dataN * 1, sizeof(float), (void**)&d_Data1);
status = cublasAlloc(plan[0].dataN * 1, sizeof(float), (void**)&d_Data2);
status = cublasAlloc(plan[0].dataN * 1, sizeof(float), (void**)&d_Mul);
while (true){
WaitForMultipleObjects(MAX_GPU_COUNT, StartEvent, TRUE, INFINITE);
CUDA_SAFE_CALL(cudaMemcpy(d_Data1, plan[0].pointA, sizeof(float)*32, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy(d_Data2, plan[0].pointB, sizeof(float)*32, cudaMemcpyHostToDevice));
vecDOTvec(d_Data1, d_Data2, d_Mul, plan[0].dataN);
CUDA_SAFE_CALL(cudaThreadSynchronize());
status = cublasGetVector(plan[0].dataN,sizeof(float), d_Mul, 1, plan[0].AtimesB, 1);
ResetEvent(StartEvent[0]);
SetEvent(hEvent[0]);
}
cublasFree(d_Data1);
cublasFree(d_Data2);
cublasFree(d_Mul);
}
and here is the while-loop:
for (int i=0; i<MAX_GPU_COUNT; i++){
hEvent[i] = CreateEvent( NULL, TRUE, TRUE, NULL );
StartEvent[i] = CreateEvent( NULL, TRUE, TRUE, NULL );
}
//start two new threads
_beginthread( Thread1, 0, NULL );
_beginthread( Thread2, 0, NULL );
Sleep(3000);
while (i<4){
for (int j=0; j<MAX_GPU_COUNT; j++)
SetEvent(StartEvent[j]);
++i;
WaitForMultipleObjects(MAX_GPU_COUNT, hEvent, TRUE, INFINITE);
for (int j=0; j<MAX_GPU_COUNT; j++)
ResetEvent(hEvent[j]);
printf("\n Thread1, round: %d \n",i);
printf("vec_A vec_B AxB \n",i);
for(int i=0; i<32; ++i)
printf("%f %f %f\n",plan[0].pointA[i], plan[0].pointB[i], plan[0].AtimesB[i]);
printf("\n Thread2, round: %d \n",i);
for(int i=0; i<32; ++i)
printf("%f %f %f\n",plan[1].pointA[i], plan[1].pointB[i], plan[1].AtimesB[i]);
for(int j = 0; j < 32; j++){
plan[0].pointA[j] = plan[0].pointA[j]+1;
plan[1].pointA[j] = plan[1].pointA[j]+1;
}
}
Although the thread functions are exactly similar but I found out that thread1 has a delay in its outputs in the second iteration of the while-loop. In the 2nd round even though the input of the thread1 has changed the result is not changing and it sends out the round 1’s outputs, while thread 2 works completely fine. Because it is a little bit confusing to explain this problem in words, I have attached the files here.
I will be thankful if anybody could take a look at these codes or runs it on his/her devices and let me know what is wrong in this program. (please look at the output results under the: Thread1, round 2)
As it doesn’t let me to upload cuda files I put the codes here:
this is cppIntegration.cu file:
#include <stdlib.h>
#include <stdio.h>
#include <cutil.h>
#include <cppIntegration_kernel.cu>
extern "C" void vecDOTvec(float* DEVICE1, float* DEVICE2, float* RESULT, int N)
{
vecDOTvecKernel<<< (N+127)/128, 128 >>>(DEVICE1, DEVICE2, RESULT, N);
CUT_CHECK_ERROR("Kernel execution failed");
}
and this is cppIntegration_kernel.cu file:
#ifndef _CPP_INTEGRATION_KERNEL_H_
#define _CPP_INTEGRATION_KERNEL_H_
__global__ void
vecDOTvecKernel(float* DEVICE1, float* DEVICE2, float* RESULT, const int N)
{
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (idx < N)
RESULT[idx]=DEVICE1[idx]*DEVICE2[idx];
}
#endif
Thanks for your time.
main.cpp (4.44 KB)