Read contents of file to perform operation on that buffer takes time

So I did a tryout like this where I am reading entire file contents chunk by chunk and perform some operation on the data received:
ReadFile(1024 Bytes);
Copy buffer to GPU memory;
Perform GPU computation like XOR each byte of this buffer read from file; // cuda gpu function called here
Copy Result back to CPU memory;

After running this algorithm for file size half GB, it is observed that it takes around minutes to complete such computation whereas if same algorithm is implemented for computations only on CPU (,ie, no offloading to GPU), it is completed in few seconds. Any particular reason for such issue.

Welcome to the wonderful world of the PCIe bus bottleneck. Copying the data to the GPU and back is bandwidth limited.

The only way to get a performance benefit is if the time savings of GPU processing (vs the CPU) are bigger than the cost of copying the data to the GPU and the results back.

In other words: your XOR is too simple to provide a benefit. Try multiplying matrices with millions of elements.


Thank you. I did a tryout of only CPU to GPU data copying thousands of time and it is indeed a heavy operation.

Depending file size and the specifications of your mass storage subsystem, file processing can easily become I/O limited, creating an even more severe bottleneck than that represented by PCIe.

Your best bet is to minimize the number of transfers so the overhead becomes a smaller percentage of the overall time. For example, read the entire file into one big buffer and then transfer that to the GPU in one operation. If it is done in pieces then you will have the overhead in each piece.

You may be doing this already but it isn’t clear because you wrote ReadFile[1024 Bytes] and then the file is a half GB. I advocate reading that entire half GB into one buffer and transferring all of that in one go to the GPU. That will reduce the overhead to a minimum.

Also using pinned (page locked) host memory will speed things up

Thanks. This works with the same tryout for me.
The problem of performance issue comes to picture if after computation on GPU we’re waiting for results before going for next read.

Is that a true dependency (you need the result from the GPU to decide what to read next) or a false dependency caused by the structure of your code?

GPU kernel launches are asynchronous, so you should be able to overlap GPU kernel work with host-side work such as loading the next chunk of a file. Ideally you would build a nice pipeline that streams data through the GPU as fast as the slowest pipeline portion will allow. A double-buffering scheme might help with this, for example.