I’m processing a data using cuda fortran in tesla 1060. The size of data is around 100G. Of course it exceed the size of gpu memory. So how can i process this big fle? Is there any good strategy to read the file, copy the memory and eliminate time delay?
Can you process the data in chunks? Is so, then you just need an outer loop that iterates over smaller N sized chunks where N is the amount of data that can fit on the device (Note you can use the cudaGetDeviceProperties routine to get the amount of memory on a device at runtime). Even better, you can write your code to use multiple GPUs. (See: http://www.pgroup.com/lit/articles/insider/v3n3a2.htm)
Another interesting article is about asynchronous data copies ( http://www.pgroup.com/lit/articles/insider/v3n1a4.htm ). So as one chunk is being processed on the GPU, the next chunks data can be streamed to the GPU.
Hope this helps,