Neural network on GPU, physics on CPU?

Hello,

I’m quite new to GPGPU computing, I’ve read through NVIDIA’s OpenCL and Cuda guides and still have a question about how to best leverage the capabilities of CUDA in my particular case, which typically looks like this:

• the system to simulate is composed of an artificial neural network coupled to a physics engine

• one wants to run a few hundred of such systems in parallel, for a specified number of iteration, and then collect some statistics about the results.

• a single run can be described as follows:

[codebox]while time < maxTime:

take one neural network step

feed output into physics engine

take one physics step

feed output into neural network

return some statistics

[/codebox]

In an ideal world, the physics engine should also run on the GPU so as to minimise memory transfers between host and device. However I don’t know of any GPU-only physics engine around there and I don’t think I can write one myself ;-) Moreover there are things that MUST run on the CPU, for example one might replace the physics engine with data from actual sensors and the same problem would arise: how do we efficiently stream data to and from the gpu in such a situation?

So… would you

(a) invoke the neural network kernel once for each timestep (a typical run is 500-5,000 timesteps): straightforward, but possibly very inefficient, since data structures in shared memory would be lost between invocations and would need to be reallocated every time?

(b) invoke the neural network kernel once for the entire simulation, and have the host write and read data to/from the device’s global memory repeatedly during the kernel’s lifetime: potentially more efficient, but (correct me if I’m wrong) non-deterministic because it is not possible to synchronise at this level?

Don’t worry about saving the state of the GPU between kernel invocations unless your timescale is extremely detailed. The overhead is insignificant for kernel runs down to 0.3ms and virtually undetectable for kernels above 3ms.

EDIT: You will not need to “reallocate” memory between kernel calls. You reserve an area at startup, initialize it to a known state and then just let the kernel use that to save/restore persistent registers and shared memory at the beginning and end of each call.

Hello,

Can you jma explain to me more about your response:

You will not need to “reallocate” memory between kernel calls. You reserve an area at startup, initialize it to a known state and then just let the kernel use that to save/restore persistent registers and shared memory at the beginning and end of each call.

thanks a lot in advance

Hello,

The global arrays need to be allocated only once at the beginning. You are write the shared memory content is lost between calls, this means you need to save this data into a global array.

Hello,

The global arrays need to be allocated only once at the beginning. You are write the shared memory content is lost between calls, this means you need to save this data into a global array.

Thans a lot for this response, but my arrays will be different in each generation, so in this case i must always update this array because in each generation i must restart my neural network.

I think that it is not clear, so i will explain more:
I have creatures simulated in physic motor , this creatures are controlled by an RNN (RECURRENT NEURAL NETWORK) evolved by a genetic algorithm.

my first purpose is to port the genetic algorithm into the GPU, but when i am trying to do this i found my self obliged to look to the RNN in order to take it to the GPU too.

My question is : how can i minimize transfers of memory because my data are arrived from the host ( physic motor ),and if the response is the same what you said please illustrate it to with an example.

Thanks a lot for any device or suggestion.

If the data comes from the host so often you should check if you can ovelap the copying of data with calculations using streams. For more on the topic you should check CUDA BEST PRACTICES http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations on memory optimizations.

If the data comes from the host so often you should check if you can ovelap the copying of data with calculations using streams. For more on the topic you should check CUDA BEST PRACTICES http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations on memory optimizations.

Thanks a lot , I am tryine to undertand how it work an how i can use it. Have you any advice about it??

Hello pasoleatis,

i have read some documents and websites about how Can I make my host array “visible” from the device function directly in some way or do I have to copy it to the kernel first?

for that i found somthing about Default Pinned Memory Vs Zero-Copy Memory and Mapped Memory and Page-Locked Host Memory and i found a responce in forum which is:

CUDA 2.2. introduced “mapped pinned memory,” which is mapped into the CUDA address space so can be directly accessed by kernels; but the memory must be allocated by cudaHostAlloc() . CUDA 4.0 added the ability to page-lock existing virtual address ranges with cudaHostRegister() which like cudaHostAlloc() optionally can map the host memory into the device address space.

my question here is :
if cudaHostAloc can map the host memory into the device address space, can you please explian to more about its principe and how it work, because i found a small exemple in “cuda by exemple” and i don’t really understand the diffrence between malloc and cudahostmalloc.

float cuda_malloc_test( int size, bool up ) {
cudaEvent_t start, stop;
int a, dev_a;
float elapsedTime;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
a = (int
)malloc( size * sizeof( a ) );
HANDLE_NULL( a );
HANDLE_ERROR( cudaMalloc( (void
)&dev_a,
size * sizeof( *dev_a ) ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );
for (int i=0; i<100; i++) {
if (up)
HANDLE_ERROR( cudaMemcpy( dev_a, a,
size * sizeof( *dev_a ),
cudaMemcpyHostToDevice ) );
else
HANDLE_ERROR( cudaMemcpy( a, dev_a,
size * sizeof( *dev_a ),
cudaMemcpyDeviceToHost ) );
}
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
free( a );
HANDLE_ERROR( cudaFree( dev_a ) );
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );
return elapsedTime;
}

the second one which use cudahostalloc:

float cuda_host_alloc_test( int size, bool up ) {
cudaEvent_t start, stop;
int a, dev_a;
float elapsedTime;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaHostAlloc( (void
)&a,
size * sizeof( a ),
cudaHostAllocDefault ) );
HANDLE_ERROR( cudaMalloc( (void
*)&dev_a,
size * sizeof( *dev_a ) ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );
for (int i=0; i<100; i++) {
if (up)
HANDLE_ERROR( cudaMemcpy( dev_a, a,
size * sizeof( *a ),
cudaMemcpyHostToDevice ) );
else
HANDLE_ERROR( cudaMemcpy( a, dev_a,
size * sizeof( *a ),
cudaMemcpyDeviceToHost ) );
}
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
HANDLE_ERROR( cudaFreeHost( a ) );
HANDLE_ERROR( cudaFree( dev_a ) );
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );
return elapsedTime;
}

thanks a lot for any explanation

Hello,

My understanding about mapped memory is that it is only useful for integrated cards which do not have their own ram. In general I suggest to always cop the data from the cpu RAM to gpu RAM. If you do not want to sue this, there is also the possibility to use UVA (Universal Virtual (space) Address) http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space. With UVA you can use directly the cpu pointers as arguments for the kernel calls. Please keep in mind that the data is still physically in the system RAM which means that it still needs to be copied to the gpu RAM, but this is done automatically. I think the UVA mode is activated from nvidia driver. (nvidia-smi???)

The cuda hostalloc creates pinned memory. The details are here, but I do not fully understand how it works because I do not know does page-locked memory mean, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#page-locked-host-memory. This makes the transfers between cpu and gpu faster, but if you use too much will make you computer unresponsive. The mapped memory is useless unless you have some laptop.

Useful reading about UVA:

Check the simpleP2P Sample, SDK 4.0 (Demonstrates both P2P &UVA)

My reccomandention is to do manually the copying because you can control how is done and make optimization in a transparent way.

Hello,

I want to thank you alot for all these advices, i will look for UVA.

And for the page locked, if i undrestand them these pages just means that the the memory pages are locked and will not go and change physical adress or be swapped out to disk by the os. you can write them just like any pointer.

To be more clear about my problem , i just want to use a vector which is the result of CPU function , this vector will be an input parameter to a device kernel for that i found the idea of using mapped memory useful in my case but i don’t know if there another options to this without lost the performance.

if you need any other clarifications about my application there is no problem.

THanks for another time :)

best regards

At this point2 options will work for sure.

  1. copy the vector to gpu. If the copying is very often you will have to see what optimizations can you, such as overlapping calculations with copy
  2. UVA will work, also, but in this case might be or not be faster as the first option.

Any option you use you need optimization if the copying is very often. The idea is to saturate the pci bus while keeping the gpu with high occupancy in the same time.

The simplest optimization is to run 2 or more independent copy of the system in the same time which is faster to implement and very useful if you need to do averaging. It worked for my Monte Carlo problem.

My understanding about mapped memory is that can be used only when the card uses the main memory which happens on some laptops, otherwise can not be used.

Hello pasoleatis,

I want to thank you always.

my question about what ou said in this sentence:

The simplest optimization is to run 2 or more independent copy of the system in the same time which is faster to implement and very useful if you need to do averaging. It worked for my Monte Carlo problem.

   How can i do 2 or more independent copy of the system in the same time (exemple please)?

best regards

In my case there was no communication with cpu. An example of code:

// define array of pointers for the positions , each pointer corresponds to one stream (copy of the system)
double3 *pos[nstr];
double3 *dev_pos[nstr]; 

...


    for (int is = 0; is < nstr; is++)
    {
    cudaStreamCreate(&stream[is]); // create the streams
    cudaMalloc(&dev_pos[is],sizeof(double3)*Np); // allocate the memory for the positions on device
    cudaHostAlloc(&pos[is],sizeof(double3)*Np,cudaHostAllocDefault); // allocate the memory on host 
    }
....

for(int ist=0;ist<nstr;ist++)
			{
                              // kernel calls are asynchronous. 
			      newMCenergyarray<<<gss,2*bsl,0,stream[ist]>>>(dev_pos[ist],dev_newuuu[ist], Np,jxyz[ist],atom_i[ist]); // calling a kernel function is asynchronous. 
			}

You also need to use cudamcpyasync and handle copying without unnecessary synchronizations. The idea is that while 1 stream waits for data to come from cpu, the other streams can work. This also means you need to have you cpu to create the data for each stream.

THanks a lot for your help, i will see with it.

Best regards