Using CUDA to run many instances

tommy000001 · March 31, 2012, 9:09am

Hi together …

I just got an overview of CUDA and I’m impressed.
But before spending too much time with this System I need to know if CUDA is suited to run multiple instances of one program. My program can’t be parallelized but I have to run many instances with different parameters. Every instance needs about 5 MB of memory and I need an arbitrary number of instances (the more, the better).

My question:
Is it possible to write a wrapper program wich distributes my current program to the graphics chip and then collects the results? How efficient would this be. My program takes (depending on parameters) about 1 hour on a 3 GHz CPU.

Any help is appreciated.
Thanks.
Tommy

pasoleatis · March 31, 2012, 12:05pm

It is possible. I ran multiple programs on the same gpu and it worked. There is something better. I have a problem in which there are large gaps in time, so something else could run. I used streams to run several copies of the same system in parallel.

tommy000001 · March 31, 2012, 12:08pm

Hi, thank you very much.
Any Examples, Tutorials, …?

Kind regards,
Tommy

pasoleatis · March 31, 2012, 12:30pm

Here is some pieces of code how I use it. In the code I have the positions of particles defined double3 and each system has Np particles. I define arrays of pointers and I allocate them one by one on both host and device

#define nstr 3 // number of streams I use 

int main(void)

{

    double3 *pos[nstr];

    double3 *dev_pos[nstr];

    double *dev_newuuu[nstr],*dev_olduuu;

    double *dev_charge; 

    double jump; 

    double *h_ene[nstr];

    static int h_acc[1];

    double *dev_energy[nstr],*dev_totalene;

    int *dev_acceptance;

    double3 jxyz[nstr];

    double rnd[nstr];

    int atom_i[nstr];

    double enepene[nstr];

cudaStream_t stream[nstr]; // streams

// memory allocations     

    for (int is = 0; is < nstr; is++)

    {

    cudaStreamCreate(&stream[is]);

    cudaMalloc(&dev_pos[is],sizeof(double3)*Np);

    cudaMalloc(&dev_energy[is],sizeof(double));

    cudaMalloc(&dev_newuuu[is],sizeof(double)*gss);

    cudaHostAlloc(&pos[is],sizeof(double3)*Np,cudaHostAllocDefault);

    cudaHostAlloc(&h_ene[is],sizeof(double),cudaHostAllocDefault);

    }

float gputime;

    cudaEvent_t start,stop;

    cudaEventCreate(&start);

    cudaEventCreate(&stop);

init_config(pos,Np,lx,ly,lz,diamsq); // initilize the positions on the host

// copy the initial configurations to the device     

    for (int is = 0; is < nstr; is++)

    {    

    cudaMemcpy(dev_pos[is],pos[is], sizeof(double3)*Np,cudaMemcpyHostToDevice);

    }

// Exampl of how I run the strams in parallel 

    for(int ist=0;ist<nstr;ist++)

    {

    cudaMemcpy(dev_energy[ist], h_ene[ist], sizeof(double),cudaMemcpyHostToDevice);

    }

cudaEventRecord(start,0);

for(int imes=0;imes<Neq;imes++)

    {

      h_acc[0]=0.0;

      cudaMemcpy(dev_acceptance, h_acc, sizeof(int),cudaMemcpyHostToDevice);

    	for(int idl=0;idl<Nout;idl++)

    	{

    		for(int isp=0;isp<Np;isp++)

    		{

			for(int ist=0;ist<nstr;ist++) // here calling the same function for different streams

			{

    			      jxyz[ist].x=jump*(2.0*genrand64_real2()-1.0);

			      jxyz[ist].y=jump*(2.0*genrand64_real2()-1.0);

			      jxyz[ist].z=jump*(2.0*genrand64_real2()-1.0);

			

			      atom_i[ist]=round((Np-1)*genrand64_real2());

			

			      rnd[ist]=genrand64_real2();

			

			      newMCenergyarray<<<gss,2*bsl,0,stream[ist]>>> (dev_pos[ist],dev_newuuu[ist], Np,jxyz[ist],atom_i[ist]); // first step

			}

		      for(int ist=0;ist<nstr;ist++) // here calling the same function for different streams

			{

                        vsu<<<1,1,0,stream[ist]>>>dev_pos[ist],dev_newuuu[ist],jxyz[ist],atom_i[ist],dev_acceptance,dev_energy[ist],rnd[ist]); // second step

			}

		}        

    			    			

    	}   	

    }    

cudaEventRecord(stop,0);

    cudaEventSynchronize(stop);

    cudaEventElapsedTime(&gputime,start,stop);

cudaEventDestroy(start);

    cudaEventDestroy(stop) ;   

    printf(" \n");

printf("Time = %g \n",  gputime/1000.0f);  

printf(" \n");

tommy000001 · March 31, 2012, 3:05pm

Thank you … very instructive, i’ll give it a try.

pasoleatis · March 31, 2012, 3:17pm

I just looked on the internet for stream tutorials and ripped off some parts of the code.

seibert · April 1, 2012, 11:43am

While CUDA does allow multiple programs to run on one GPU, the kernels do not run concurrently. The GPU driver switches control of the GPU between client programs between each kernel launch. If your source of parallelism is running multiple instances, CUDA will not accelerate this at all.

pasoleatis · April 1, 2012, 12:17pm

So using streams is the only possibility to gather more statistics for programs which would not fill the gpu? At least for my problem it worked to some extent.

seibert · April 1, 2012, 3:07pm

Yes, multiple streams in the same process can launch concurrent kernels (on Fermi and later, anyway). Different processes have to time slice the entire GPU. Moreover, the multitasking is cooperative, not preemptive, so the GPU can only switch processes between operations.

The reason I wanted to jump in here is that it sounds like original poster might think that the GPU is like a multicore CPU, and parallelism can be achieved by launching many independent serial processes. That absolutely does not work on CUDA devices, unfortunately. The parallelism has to be found from within the same process, and the goal is to exploit data parallelism rather than task parallelism.

That said, the compute problem described could be amenable to a data-parallel calculation, I just wanted to throw some caution in so no one is surprised later. :)

pasoleatis · April 1, 2012, 3:25pm

Yes. I was asked like that before. People asking me if they could just run 448 processes in the same time like each core would be a cpu.

tommy000001 · April 1, 2012, 7:31pm

Thanks for your replies …

Unfortunately I thought it can be used like a many core cpu, too good to be true.

Kind regards.

Topic		Replies	Views
My streams are not running concurrently CUDA Programming and Performance	7	1740	March 6, 2018
confusions about CUDA streams CUDA Programming and Performance	5	791	July 30, 2017
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1423	September 14, 2017
streams in Multi-gpu system CUDA Programming and Performance	7	5925	May 23, 2017
\|\| programming, basic question CUDA Programming and Performance	18	1287	April 30, 2018
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3616	November 6, 2018
Question on Stream, Connection and Performance CUDA Programming and Performance hw , cuda	6	949	February 23, 2024
CUDA and NPP Misc Issues CUDA Programming and Performance	6	1449	March 28, 2011
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3018	November 13, 2007
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5541	April 28, 2012

Using CUDA to run many instances

Related topics