Let’s say I have bunch of data stored inside a file on a hard drive. To move this data, right now, I open the file and then cudaMemcpy it to the GPU. Is there a way to copy data directly from the hard drive to the GPU and if so, would it be faster? Thanks!
…but, if possible on your operating system and if you’re going to use file contents only to copy it to the GPU memory, be sure to use memory mapping of the file - this way the whole operation should be faster than if you would be going say through standard C library sequence of fopen()/fread()/fclose().
The fastest way would be to use DMA to transfer the file from hard drive to RAM, then from RAM to GPU. You can do that using asynchronous I/O (IO completion port on Windows, AIO on Linux), allocating a pinned buffer using CUDA, giving it to the async I/O operation and upon completion, giving it to CUDA’s MemcpyAsync. Memory mapping might work fine in Linux, but it performs poorly under Windows for streaming reads, and it would prevent you from using asynchronous DMA without the overhead of additional copies.
Here is a short example of memory mapping for POSIX compliant platform - the program is for copying files:
/* The program is primitive cp command clone (file names should be
* provided as command line arguments), built around mmap() syscall. */
main(int argc, char **argv)
copy; /* File descriptors for source and
* destination files. */
*dst; /* Addresses of memory regions
* corresponding to abovce files. */
struct stat stats; /* The data structure needed for stat()
* syscall. */
/* Check that file names are provided in the command line. */
assert(argc == 3);
/* Open the first file. */
assert((file = open(argv, O_RDONLY)) >= 0);
/* Open the second file. */
open(argv, O_RDWR | O_CREAT | O_TRUNC,
S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)) >= 0);
/* Read information about the first file. */
assert(fstat(file, &stats) == 0);
/* Write a byte at the end of second file, so that its size is
* appropriate for the memory mapping operation. */
assert(lseek(copy, stats.st_size - 1, SEEK_SET) >= 0);
assert(write(copy, " ", 1) == 1);
/* Setup memory mapping for files. */
mmap(0, stats.st_size, PROT_READ, MAP_FILE | MAP_SHARED, file,
0)) != MAP_FAILED);
mmap(0, stats.st_size, PROT_READ | PROT_WRITE,
MAP_FILE | MAP_SHARED, copy, 0)) != MAP_FAILED);
/* Copy the contents of the first file into the second file. */
memcpy(dst, src, stats.st_size);
/* Finish the program. */
Under Windows, syscalls would be completely different, and note also comments above from Oxydius - I really don’t know much about Windows, and I was thinking memory mapping would be fastest way to access file contents on Windows either, but it seems that may not be the case…
Use cgorac’s approach, it’s definitely the easiest to copy straight from disk to GPU. If you’re using Windows, replace open and mmap by CreateFile, CreateFileMapping, MapViewOfFile, cudaMemcpy and when you’re done UnmapViewOfFile, CloseHandle(mapping), CloseHandle(file). File mapping on Windows will let you read at around 40MB/s regardless of drive speed, unless you’re reading from a slower drive of course. If you need something faster (hundreds of MB/s with no CPU usage), read about asynchronous direct I/O.
I still fail to see in the above discussion how the file is gettin copied directly to the GPU memory. blahCuda wanted to copy the file directly from hard drive to GPU without using cudamemcpy n stuff. but the code given above makes a copy of the file in the host itself. how do u get it to GPU?
If you use memory mapping, data transfer will be synchronous and performed by the CPU, as data is moved from disk to system cache and from system cache to GPU.
If you use asynchronous I/O, the CPU will allocate a non-pageable buffer and let the disk controller handle data transfer (DMA) from disk to RAM, then the CPU will wake-up to tell the GPU it can copy the same buffer onto its video RAM, again through DMA (GPU). This minimizes copies and layers of software, allowing data to move from disk controller to GPU at full speed and without involving the CPU to do actual reads and stores. To use the GPU’s DMA engine, use cudaMemcpyAsync with a non-zero CUDA stream.