How to handle big data with a machine learning algorithm on CUDA device if whole do not fit into De

I’ve leant CUDA new and I want to apply it to machine learning problems that I confront. Generally I am working on large scale data therefore device memory is the constraint in general. I was expecting to here some good approaches to large scale data problems as keeping the number precisions. I know that it is possible to convert double data into float data to reduce requirements a little but the losing the precision is the drawback. I am watching other solutions in order to keep the data precision well as keeping the efficiency of CUDA codes. What is the suggestion of some of the experts?

If the data does not fit in the device memory you need o overlap the copying with execution using streams. If you manage to saturate the bus and device you will get the maximum out of the card.

My typical strategies are:

  1. Convert the input data from double to float, but still do important parts of the calculation in double precision. Many kinds of input data do not have high precision due to limitations of sensors or the presence of noise, so there is no reason to store 16 significant figures of information. In fact, if the input data all have similar magnitude, it is often possible to use a 16-bit fixed point representation. Intermediate calculations and storage can be done at higher precision.

  2. Buy a card with more memory. It is now quite easy to find inexpensive 4 GB cards, and if you purchase a GeForce Titan, you get 6 GB.

  3. If the data can be split into “frequently used” and “infrequently used” segments, then infrequently used data can be stored in pinned memory on the CPU and accessed directly from the GPU. I have used this trick when I have a very large tree structure. The nodes near the root of the tree are stored on the GPU, and the leaves are stored on the CPU.

Of course, there are other options (such as the overlapping stream suggestion from pasoleatis), but I have not personally used them.