Can I just allocate memory once with multiGPUs?

The structure of my program is


for(time=0 ; time<maxtime ; time++)







In moving to multiple GPUs the kernel calls above have been replaced by calls to cutthread functions as follows

[codebox]for(time=0 ; time<maxtime ; time++)


for(thread=0; thread<NGPU; thread++) threadID[thread] = cutStartThread((CUT_THREADROUTINE)Thread_Kernel1, (void *)(plan + thread));

    cutWaitForThreads(threadID, NGPU);

for(thread=0; thread<NGPU; thread++) threadID[thread] = cutStartThread((CUT_THREADROUTINE)Thread_Kernel2, (void *)(plan + thread));

    cutWaitForThreads(threadID, NGPU);

for(thread=0; thread<NGPU; thread++) threadID[thread] = cutStartThread((CUT_THREADROUTINE)Thread_Kernel3, (void *)(plan + thread));

    cutWaitForThreads(threadID, NGPU);


and the cutthread functions have the structure

[codebox]static CUT_THREADPROC Thread_kernelX(…)


lots of cudamalloc();

some cudaMemcpy(hosttodevice);


some cudaMemcpy(devicetohost);

lots of cudaFree();



I am finding this disappointingly slow, and would like to know the slow points in this structure and how to speed them up.

For instance, kernel1 uses arrays x,y,z and returns arrays a and b, while kernel 2 uses the same values of x,y and z as kernel 1 as well as the arrays a and b. There needs to be a synchronisation at the endof kernel 1 because all threads globally need to read arrays a and b. But do i need to deallocate x,y and z at the end of kernel 1 and then reallocate x,y and z in kernel2 AND pass their values too? It seems unnecessary. Is it?

And I would like to know alot more about cutthreads. How do they work? How much memory do they occupy on the GPU for thread management? Do they need to be destroyed? Is there a more efficient method of employing them as I have done so far?

Initialization and memory allocations are extremely time consuming, I’m not surprised. What you want is:

input data

launch as many threads as there are GPUs
// in each thread

for(time=0; time<maxtime; time++)





( inter-thread communication, if needed)


They are just a wrapper around the host threads libraray. You will do better if you read up on pthreads or whatever your OSs thread library is.

None, they are host threads.

yes, that would be ideal but I need each cutthread or pthread to exchange two arrays, and at the end of kernel 3 a similar transaction is required for several different arrays.

I think this can be done in pthreads, and I would like to know how it could be done with cutthreads.

CUT threads is a small wrapper solely for NVIDIA’s simple SDK examples. It doesn’t provide that kind of functionality.

Here is the entirety of the CUT threads code from the SDK to prove it:

//Create thread

	CUTThread cutStartThread(CUT_THREADROUTINE func, void * data){

		pthread_t thread;

		pthread_create(&thread, NULL, func, data);

		return thread;


	//Wait for thread to finish

	void cutEndThread(CUTThread thread){

		pthread_join(thread, NULL);


	//Destroy thread

	void cutDestroyThread(CUTThread thread){



	//Wait for multiple threads

	void cutWaitForThreads(const CUTThread * threads, int num){

		for(int i = 0; i < num; i++)



So if you know how to do what you want in pthreads, do it that way. CUT threads is not the solution.

I thought that because cutthreads were used in the simple MultiGPU project in the CUDA2.0 SDK then cutthreads were the only way. I didn’t know they simply used pthreads.

rRght then. pthreads it is.

Thanks for that.