Move data to GPU from hard drive

BlahCuda · January 27, 2010, 5:30pm

Greetings.

Let’s say I have bunch of data stored inside a file on a hard drive. To move this data, right now, I open the file and then cudaMemcpy it to the GPU. Is there a way to copy data directly from the hard drive to the GPU and if so, would it be faster? Thanks!

avidday · January 27, 2010, 5:46pm

There isn’t any way you can do that sort of I/O that I know of.

cgorac · January 27, 2010, 5:54pm

…but, if possible on your operating system and if you’re going to use file contents only to copy it to the GPU memory, be sure to use memory mapping of the file - this way the whole operation should be faster than if you would be going say through standard C library sequence of fopen()/fread()/fclose().

BlahCuda · January 27, 2010, 6:08pm

Can you provide me with an example of this? I’m not really sure how this would work in practice.

Oxydius · January 27, 2010, 6:09pm

The fastest way would be to use DMA to transfer the file from hard drive to RAM, then from RAM to GPU. You can do that using asynchronous I/O (IO completion port on Windows, AIO on Linux), allocating a pinned buffer using CUDA, giving it to the async I/O operation and upon completion, giving it to CUDA’s MemcpyAsync. Memory mapping might work fine in Linux, but it performs poorly under Windows for streaming reads, and it would prevent you from using asynchronous DMA without the overhead of additional copies.

cgorac · January 27, 2010, 8:00pm

Here is a short example of memory mapping for POSIX compliant platform - the program is for copying files:

#include <assert.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <sys/types.h>

#include <sys/mman.h>

#include <sys/stat.h>

#include <fcntl.h>

#include <unistd.h>

/* The program is primitive cp command clone (file names should be

 * provided as command line arguments), built around mmap() syscall. */

int

main(int argc, char **argv)

{

	int			 file,

					copy;		/* File descriptors for source and

								 * destination files. */

	char		   *src,

				   *dst;		/* Addresses of memory regions

								 * corresponding to abovce files. */

	struct stat	 stats;		/* The data structure needed for stat()

								 * syscall. */

	/* Check that file names are provided in the command line. */

	assert(argc == 3);

	/* Open the first file. */

	assert((file = open(argv[1], O_RDONLY)) >= 0);

	/* Open the second file. */

	assert((copy =

			open(argv[2], O_RDWR | O_CREAT | O_TRUNC,

				 S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)) >= 0);

	/* Read information about the first file. */

	assert(fstat(file, &stats) == 0);

	/* Write a byte at the end of second file, so that its size is

	 * appropriate for the memory mapping operation. */

	assert(lseek(copy, stats.st_size - 1, SEEK_SET) >= 0);

	assert(write(copy, " ", 1) == 1);

	/* Setup memory mapping for files. */

	assert((src =

			mmap(0, stats.st_size, PROT_READ, MAP_FILE | MAP_SHARED, file,

				 0)) != MAP_FAILED);

	assert((dst =

			mmap(0, stats.st_size, PROT_READ | PROT_WRITE,

				 MAP_FILE | MAP_SHARED, copy, 0)) != MAP_FAILED);

	/* Copy the contents of the first file into the second file. */

	memcpy(dst, src, stats.st_size);

	/* Finish the program. */

	exit(EXIT_SUCCESS);

}

Under Windows, syscalls would be completely different, and note also comments above from Oxydius - I really don’t know much about Windows, and I was thinking memory mapping would be fastest way to access file contents on Windows either, but it seems that may not be the case…

BlahCuda · January 28, 2010, 5:24pm

I am not understanding much of this. Here is what I have thus far (in pseudocode)

dynamically allocate memory for an array (e.g. malloc)
open file function such that after the call, the data stored in the files are all copied (read) to the previously dynamically allocated array
cudaMalloc (same size as the dynamically allocated array)
cudaMemcpy (from host to device)

What should I change here to expedite the whole process? Thanks.

Oxydius · January 28, 2010, 6:13pm

Use cgorac’s approach, it’s definitely the easiest to copy straight from disk to GPU. If you’re using Windows, replace open and mmap by CreateFile, CreateFileMapping, MapViewOfFile, cudaMemcpy and when you’re done UnmapViewOfFile, CloseHandle(mapping), CloseHandle(file). File mapping on Windows will let you read at around 40MB/s regardless of drive speed, unless you’re reading from a slower drive of course. If you need something faster (hundreds of MB/s with no CPU usage), read about asynchronous direct I/O.

BlahCuda · January 28, 2010, 8:50pm

So what would be your estimate on the ratio of the transfer rate with what I am doing (Linux) vs what you suggest (asynchronous direct I/O)?

Kit_Richards · February 8, 2010, 9:18pm

Where do I read about “asynchronous direct I/O”? Also DMA transfers? Thanks!

avidday · February 8, 2010, 9:23pm

If you are asking about Linux asynchronous I/O, then this is a pretty reasonable introduction.

sidzonline85 · February 9, 2010, 3:21pm

I still fail to see in the above discussion how the file is gettin copied directly to the GPU memory. blahCuda wanted to copy the file directly from hard drive to GPU without using cudamemcpy n stuff. but the code given above makes a copy of the file in the host itself. how do u get it to GPU?

MisterAnderson42 · February 9, 2010, 3:40pm

cudaMemcpy the data from the host memory to the device memory.

Oxydius · February 9, 2010, 10:02pm

If you use memory mapping, data transfer will be synchronous and performed by the CPU, as data is moved from disk to system cache and from system cache to GPU.

If you use asynchronous I/O, the CPU will allocate a non-pageable buffer and let the disk controller handle data transfer (DMA) from disk to RAM, then the CPU will wake-up to tell the GPU it can copy the same buffer onto its video RAM, again through DMA (GPU). This minimizes copies and layers of software, allowing data to move from disk controller to GPU at full speed and without involving the CPU to do actual reads and stores. To use the GPU’s DMA engine, use cudaMemcpyAsync with a non-zero CUDA stream.

y22ma · February 28, 2012, 4:10am

I have tried doing cudaMemcpyAsync from GPU to a mmap’ed address, and have the following question.

In such an operation does the GPU DMA writes into the system cache asynchronously, and upon completion CPU takes over and writes to the file?
Would that be slow if multiple streams are launched to do this as one CPU needs to take care of each stream’s cudaMemcpyAsync?
If it is slow, can I resolve this issue by spawning multiple thread, each of which has a stream to perform this operation?

Thank you

Topic		Replies	Views
Accessing the files on the hard drive from the GPU CUDA Programming and Performance	9	3885	April 19, 2012
faster copying to gpu CUDA Programming and Performance	1	2408	January 31, 2008
Memory from peripheral devices to GPU DMA directly to another device... CUDA Programming and Performance	6	4168	August 16, 2009
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	550	March 30, 2024
A little help with Multi-GPU example please :) How do I pass data to each GPU? CUDA Programming and Performance	8	28005	March 4, 2012
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8103	June 30, 2010
Copies between CPU and GPU CUDA Programming and Performance	8	5351	November 3, 2009
Mapping PCIe memory in user-space Mapping video memory in user-space to avoid DMA transfers CUDA Programming and Performance	3	16310	December 14, 2009
file i/o CUDA Programming and Performance	2	2124	March 19, 2008
CUDA device memory access? CUDA Programming and Performance	11	15694	August 5, 2011

Move data to GPU from hard drive

Related topics