How to handle big data with a machine learning algorithm on CUDA device if whole do not fit into De

I’ve leant CUDA new and I want to apply it to machine learning problems that I confront. Generally I am working on large scale data therefore device memory is the constraint in general. I was expecting to here some good approaches to large scale data problems as keeping the number precisions. I know that it is possible to convert double data into float data to reduce requirements a little but the losing the precision is the drawback. I am watching other solutions in order to keep the data precision well as keeping the efficiency of CUDA codes. What is the suggestion of some of the experts?

A GTX Titan or K20 Tesla has 6GB of RAM onboard, so choose one of those GPUs if a small consumer card is too limiting. If you need even more RAM, GPU code can still transparently read and write from the host system memory, but with significantly lower bandwidth. (Read the CUDA C Programming Guide, search for “Unified Virtual Address”.) Host memory reads are still cached, but of course if that bandwidth is still a limitation you can start redesigning your algorithm to do its own copying to the device if you can predict the data which is most likely to be used. There are even fancier methods of managing RAM (changing the data representation itself, compressing the data and expanding it on the fly, using multiple GPUs and having them share each other’s memory) but they’re very dependent on your actual algorithm implementation.