MultiGPU start help

KingGuru · July 28, 2010, 8:17am

I’m trying to make a program work on a multiGPU setup. But I have a few questions.

before running the code I need to allocate some memory on each device. I do that in a init function because It’s for openGL and I can’t alloc in the loop, because then i would quite fast run opt of memory.

is that just

cudaSetDevice(0);

cudaMalloc((void**)&d_Data_0, number * sizeof(float));

cudaSetDevice(1);

cudaMalloc((void**)&d_Data_1, number * sizeof(float));

and then when I have to run my program

cudaSetDevice(0);

launch_Kernel(d_Data_0);

cudaSetDevice(1);

launch_Kernel(d_Data_1);

or is there a smarter way to do that?

after my calculations on the device, I put my data in a VBO. Is there a way to merge those, now that I will have multiple?

KG

YDD · July 28, 2010, 1:49pm

Spawn a thread for each GPU you wish to control, and give work to it.

Gregory_Diamos · July 28, 2010, 8:00pm

You can do this on Ocelot if you want (using the NVIDIA backend). Here is an example that I wrote a while back:

report(" Launching kernels...");

				for( int device = 0; device != devices; ++device )

				{

						report("  for device " << device);

						cudaSetDevice( device );

						cudaMalloc( (void**) &pointers[ device ], sizeof( unsigned int ) );

						cudaMemset( pointers[ device ], 0, sizeof( unsigned int ) );

						

						cudaConfigureCall( dim3( 1, 1, 1 ), dim3( 1, 1, 1 ), 0, 0 );

						long long unsigned int p = (long long unsigned int)pointers[device];

						cudaSetupArgument( &p, sizeof( long long unsigned int ), 0 );

						ocelot::launch( "simpleKernels", "increment" );

				}

				

				report(" Loading results.");

				for( int device = 0; device != devices; ++device )

				{

						cudaSetDevice( device );

						unsigned int result = 0;

						cudaMemcpy( &result, pointers[ device ], 

								sizeof( unsigned int ), cudaMemcpyDeviceToHost );

						if( result != 1 )

						{

								status << "Test Point 1 FAILED:\n";

								status << " Expected result 1 for device " << device << " (" 

										<< getDeviceName(device) << "), but got " << result << "\n";

								return false;

						}					   

						cudaFree( pointers[ device ] );

				}

I personally think that it is simpler than launching a bunch of worker threads…

Edit: Note that the ocelot api calls (e.g. ocelot::launch ) are optional, you can use the standard <<< >>> notation as well.

KingGuru · July 29, 2010, 6:27am

Thanks for the replies…

I can’t see how any of them would be easy to play with. I would prefer not declaring pointers within a thread, since it loops, and I reuse it instead of allocating for new, because the first have giving me some memory leak problems. But my best guess is that they can’t use the same device pointer, but I might be wrong here. I can see the simplicity in ocelot, but I think it have the same problem.
But if they can use the same pointers, as long as they are within a “thread” then I could see some smart things. Though I don’t have diffrent devices on the system, I can imagine that would give allocation problems with a threadded approach

KG

Gregory_Diamos · July 29, 2010, 4:56pm

Yeah, even with ocelot, if you allocate memory on one device pointers to that memory will only be valid in kernels that are called on that device. It is one of the drawbacks of each card having a separate address space.

gthazmatt · July 30, 2010, 1:41am

If I understand what you’re asking correctly, you’re trying to do something very similar to what I’m doing. Let me take a stab at what I think you want to accomplish.

[codebox]

//Set these to whatever you need

#define SIZE 1024

#define COND 1

float *h_data, *g_data0, *g_data1;

int num_gpus;

cudaGetDeviceCount(&num_gpus);

//cudaSetDevice has to be called before any other cuda call

omp_set_num_threads(num_gpus);

#pragma omp parallel

{

cudaSetDevice(omp_get_thread_num());

switch(omp_get_thread_num()) {

	case 0:

		cudaMalloc((void**) &g_data0, SIZE/2*sizeof(float));

		break;

	case 1:

		cudaMalloc((void**) &g_data1, SIZE/2*sizeof(float));

		break;

}

}

cudaMallocHost((void**) &h_data, SIZE*sizeof(float));

while(COND) {

//Prepare your host data

#pragma omp parallel

{

	switch(omp_get_thread_num()) {

	case 0:

		cudaMemcpy(g_data0, &h_data[0], SIZE/2*sizeof(float), cudaMemcpyHostToDevice);

		//Of course, this won't work without the thread information

		launch_Kernel(g_data0);

		cudaMemcpy(&h_data[0], g_data0, SIZE/2*sizeof(float), cudaMemcpyDeviceToHost);

		break;

	case 1:

		cudaMemcpy(g_data1, &h_data1, number*sizeof(float), cudaMemcpyHostToDevice);

		launch_Kernel(g_data1);

		cudaMemcpy(&h_data1, g_data1, number*sizeof(float), cudaMemcpyDeviceToHost);

		break;

}

}[/codebox]

Two things to note. In the memory transfers, you are using an offset for h_data so that the end result is a combination of the two arrays. The other thing is that while you have to use different device pointers for each device, you are able to reuse them throughout the loop.

KingGuru · July 30, 2010, 4:44am

If I understand what you’re asking correctly, you’re trying to do something very similar to what I’m doing. Let me take a stab at what I think you want to accomplish.

[codebox]

//Set these to whatever you need

define SIZE 1024

define COND 1

float *h_data, *g_data0, *g_data1;

int num_gpus;

cudaGetDeviceCount(&num_gpus);

//cudaSetDevice has to be called before any other cuda call

omp_set_num_threads(num_gpus);

#pragma omp parallel

{
cudaSetDevice(omp_get_thread_num());

switch(omp_get_thread_num()) {

	case 0:

		cudaMalloc((void**) &g_data0, SIZE/2*sizeof(float));

		break;

	case 1:

		cudaMalloc((void**) &g_data1, SIZE/2*sizeof(float));

		break;

}
}

cudaMallocHost((void**) &h_data, SIZE*sizeof(float));

while(COND) {
//Prepare your host data

#pragma omp parallel

{

	switch(omp_get_thread_num()) {

	case 0:

		cudaMemcpy(g_data0, &h_data[0], SIZE/2*sizeof(float), cudaMemcpyHostToDevice);

		//Of course, this won't work without the thread information

		launch_Kernel(g_data0);

		cudaMemcpy(&h_data[0], g_data0, SIZE/2*sizeof(float), cudaMemcpyDeviceToHost);

		break;

	case 1:

		cudaMemcpy(g_data1, &h_data1, number*sizeof(float), cudaMemcpyHostToDevice);

		launch_Kernel(g_data1);

		cudaMemcpy(&h_data1, g_data1, number*sizeof(float), cudaMemcpyDeviceToHost);

		break;

}
}[/codebox]

Two things to note. In the memory transfers, you are using an offset for h_data so that the end result is a combination of the two arrays. The other thing is that while you have to use different device pointers for each device, you are able to reuse them throughout the loop.

OK… it’s something like this I’m thinking. Thanks…

Jackal7 · August 10, 2010, 4:15pm

hi guys

I hope to post in the correct area!

I’m trying to write a simple program to understando how CUDA context works:

this is my program:

include <pthread.h>

include <stdio.h>

include <cuda.h>

define NUM_THREADS 2

float d1, d2;

float * m1, * m2;

int devnumber = 1;

CUcontext hcuContext = 0;

void *

inizialize (void *)

{

CUdevice hcuDevice;

cuDeviceGet( &hcuDevice, devnumber );

cuCtxCreate( &hcuContext, 0, hcuDevice );

cudaMalloc ((void **) &m1, sizeof (float));

cudaMalloc ((void **) &m2, sizeof (float));

float dd1 = 1.0;

float dd2 = 2.0;

cudaMemcpy (m1, &dd1, sizeof (float), cudaMemcpyHostToDevice);

cudaMemcpy (m2, &dd2, sizeof (float), cudaMemcpyHostToDevice);

//cudaMemcpy (&d1, m1, sizeof (float), cudaMemcpyDeviceToHost);

//cudaMemcpy (&d2, m2, sizeof (float), cudaMemcpyDeviceToHost);

//fprintf (stdout, “%f %f \n”, d1, d2);

cuCtxPopCurrent(&hcuContext);

cudaThreadSynchronize ();

pthread_exit (NULL);

}

void *

compute_function (void *)

{

cuCtxPushCurrent( hcuContext );

cudaMemcpy (&d1, m1, sizeof (float), cudaMemcpyDeviceToHost);

cudaMemcpy (&d2, m2, sizeof (float), cudaMemcpyDeviceToHost);

fprintf (stdout, “%f %f \n”, d1, d2);

cudaThreadSynchronize ();

pthread_exit (NULL);

}

int

main (int argc, char *argv)

{

pthread_t threads;

pthread_create (&threads, NULL, inizialize, NULL);

if (pthread_join (threads, NULL))

{

fprintf (stderr, “error pthread_join\n”);

return EXIT_FAILURE;

}

pthread_create (&threads, NULL, compute_function, NULL);

if (pthread_join (threads, NULL))

{

fprintf (stderr, “error pthread_join\n”);

return EXIT_FAILURE;

}

cuCtxDestroy(hcuContext);

return EXIT_SUCCESS;

}

is corret to use the context in this way? I need that the second thread print 1.0 and 2.0 but without cuda context doesn’t work. With this solution the compiler return the following errors:

/tmp/tmpxft_00006a46_00000000-12_th1.o: In function `main’:

tmpxft_00006a46_00000000-1_th1.cudafe1.cpp:(.text+0x10a5c): undefined reference to `cuCtxDestroy’

/tmp/tmpxft_00006a46_00000000-12_th1.o: In function `compute_function(void*)':

tmpxft_00006a46_00000000-1_th1.cudafe1.cpp:(.text+0x10a80): undefined reference to `cuCtxPushCurrent’

/tmp/tmpxft_00006a46_00000000-12_th1.o: In function `inizialize(void*)':

tmpxft_00006a46_00000000-1_th1.cudafe1.cpp:(.text+0x10b12): undefined reference to `cuDeviceGet’

tmpxft_00006a46_00000000-1_th1.cudafe1.cpp:(.text+0x10b24): undefined reference to `cuCtxCreate’

tmpxft_00006a46_00000000-1_th1.cudafe1.cpp:(.text+0x10b90): undefined reference to `cuCtxPopCurrent’

someone can give me an hand? Please i need it works for my degree thesis

Jackal7 · August 10, 2010, 4:15pm