Move data to GPU from hard drive

Greetings.

Let’s say I have bunch of data stored inside a file on a hard drive. To move this data, right now, I open the file and then cudaMemcpy it to the GPU. Is there a way to copy data directly from the hard drive to the GPU and if so, would it be faster? Thanks!

There isn’t any way you can do that sort of I/O that I know of.

…but, if possible on your operating system and if you’re going to use file contents only to copy it to the GPU memory, be sure to use memory mapping of the file - this way the whole operation should be faster than if you would be going say through standard C library sequence of fopen()/fread()/fclose().

Can you provide me with an example of this? I’m not really sure how this would work in practice.

The fastest way would be to use DMA to transfer the file from hard drive to RAM, then from RAM to GPU. You can do that using asynchronous I/O (IO completion port on Windows, AIO on Linux), allocating a pinned buffer using CUDA, giving it to the async I/O operation and upon completion, giving it to CUDA’s MemcpyAsync. Memory mapping might work fine in Linux, but it performs poorly under Windows for streaming reads, and it would prevent you from using asynchronous DMA without the overhead of additional copies.

Here is a short example of memory mapping for POSIX compliant platform - the program is for copying files:

#include <assert.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <sys/types.h>

#include <sys/mman.h>

#include <sys/stat.h>

#include <fcntl.h>

#include <unistd.h>

/* The program is primitive cp command clone (file names should be

 * provided as command line arguments), built around mmap() syscall. */

int

main(int argc, char **argv)

{

	int			 file,

					copy;		/* File descriptors for source and

								 * destination files. */

	char		   *src,

				   *dst;		/* Addresses of memory regions

								 * corresponding to abovce files. */

	struct stat	 stats;		/* The data structure needed for stat()

								 * syscall. */

	/* Check that file names are provided in the command line. */

	assert(argc == 3);

	/* Open the first file. */

	assert((file = open(argv[1], O_RDONLY)) >= 0);

	/* Open the second file. */

	assert((copy =

			open(argv[2], O_RDWR | O_CREAT | O_TRUNC,

				 S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)) >= 0);

	/* Read information about the first file. */

	assert(fstat(file, &stats) == 0);

	/* Write a byte at the end of second file, so that its size is

	 * appropriate for the memory mapping operation. */

	assert(lseek(copy, stats.st_size - 1, SEEK_SET) >= 0);

	assert(write(copy, " ", 1) == 1);

	/* Setup memory mapping for files. */

	assert((src =

			mmap(0, stats.st_size, PROT_READ, MAP_FILE | MAP_SHARED, file,

				 0)) != MAP_FAILED);

	assert((dst =

			mmap(0, stats.st_size, PROT_READ | PROT_WRITE,

				 MAP_FILE | MAP_SHARED, copy, 0)) != MAP_FAILED);

	/* Copy the contents of the first file into the second file. */

	memcpy(dst, src, stats.st_size);

	/* Finish the program. */

	exit(EXIT_SUCCESS);

}

Under Windows, syscalls would be completely different, and note also comments above from Oxydius - I really don’t know much about Windows, and I was thinking memory mapping would be fastest way to access file contents on Windows either, but it seems that may not be the case…

I am not understanding much of this. Here is what I have thus far (in pseudocode)

  • dynamically allocate memory for an array (e.g. malloc)
  • open file function such that after the call, the data stored in the files are all copied (read) to the previously dynamically allocated array
  • cudaMalloc (same size as the dynamically allocated array)
  • cudaMemcpy (from host to device)

What should I change here to expedite the whole process? Thanks.

Use cgorac’s approach, it’s definitely the easiest to copy straight from disk to GPU. If you’re using Windows, replace open and mmap by CreateFile, CreateFileMapping, MapViewOfFile, cudaMemcpy and when you’re done UnmapViewOfFile, CloseHandle(mapping), CloseHandle(file). File mapping on Windows will let you read at around 40MB/s regardless of drive speed, unless you’re reading from a slower drive of course. If you need something faster (hundreds of MB/s with no CPU usage), read about asynchronous direct I/O.

So what would be your estimate on the ratio of the transfer rate with what I am doing (Linux) vs what you suggest (asynchronous direct I/O)?

Where do I read about “asynchronous direct I/O”? Also DMA transfers? Thanks!

If you are asking about Linux asynchronous I/O, then this is a pretty reasonable introduction.

I still fail to see in the above discussion how the file is gettin copied directly to the GPU memory. blahCuda wanted to copy the file directly from hard drive to GPU without using cudamemcpy n stuff. but the code given above makes a copy of the file in the host itself. how do u get it to GPU?

cudaMemcpy the data from the host memory to the device memory.

If you use memory mapping, data transfer will be synchronous and performed by the CPU, as data is moved from disk to system cache and from system cache to GPU.

If you use asynchronous I/O, the CPU will allocate a non-pageable buffer and let the disk controller handle data transfer (DMA) from disk to RAM, then the CPU will wake-up to tell the GPU it can copy the same buffer onto its video RAM, again through DMA (GPU). This minimizes copies and layers of software, allowing data to move from disk controller to GPU at full speed and without involving the CPU to do actual reads and stores. To use the GPU’s DMA engine, use cudaMemcpyAsync with a non-zero CUDA stream.

I have tried doing cudaMemcpyAsync from GPU to a mmap’ed address, and have the following question.

  • In such an operation does the GPU DMA writes into the system cache asynchronously, and upon completion CPU takes over and writes to the file?

  • Would that be slow if multiple streams are launched to do this as one CPU needs to take care of each stream’s cudaMemcpyAsync?

  • If it is slow, can I resolve this issue by spawning multiple thread, each of which has a stream to perform this operation?

Thank you