The structure of my program is
[codebox]Input<<<>>>(…);
for(time=0 ; time<maxtime ; time++)
{
kernel1<<<>>>(…);
kernel2<<<>>>(…);
kernel3<<<>>>(…);
}
[/codebox]
In moving to multiple GPUs the kernel calls above have been replaced by calls to cutthread functions as follows
[codebox]for(time=0 ; time<maxtime ; time++)
{
for(thread=0; thread<NGPU; thread++) threadID[thread] = cutStartThread((CUT_THREADROUTINE)Thread_Kernel1, (void *)(plan + thread));
cutWaitForThreads(threadID, NGPU);
for(thread=0; thread<NGPU; thread++) threadID[thread] = cutStartThread((CUT_THREADROUTINE)Thread_Kernel2, (void *)(plan + thread));
cutWaitForThreads(threadID, NGPU);
for(thread=0; thread<NGPU; thread++) threadID[thread] = cutStartThread((CUT_THREADROUTINE)Thread_Kernel3, (void *)(plan + thread));
cutWaitForThreads(threadID, NGPU);
}[/codebox]
and the cutthread functions have the structure
[codebox]static CUT_THREADPROC Thread_kernelX(…)
{
lots of cudamalloc();
some cudaMemcpy(hosttodevice);
kernelX<<<>>>(…);
some cudaMemcpy(devicetohost);
lots of cudaFree();
}
[/codebox]
I am finding this disappointingly slow, and would like to know the slow points in this structure and how to speed them up.
For instance, kernel1 uses arrays x,y,z and returns arrays a and b, while kernel 2 uses the same values of x,y and z as kernel 1 as well as the arrays a and b. There needs to be a synchronisation at the endof kernel 1 because all threads globally need to read arrays a and b. But do i need to deallocate x,y and z at the end of kernel 1 and then reallocate x,y and z in kernel2 AND pass their values too? It seems unnecessary. Is it?
And I would like to know alot more about cutthreads. How do they work? How much memory do they occupy on the GPU for thread management? Do they need to be destroyed? Is there a more efficient method of employing them as I have done so far?