So I did a tryout like this where I am reading entire file contents chunk by chunk and perform some operation on the data received:
Copy buffer to GPU memory;
Perform GPU computation like XOR each byte of this buffer read from file; // cuda gpu function called here
Copy Result back to CPU memory;
After running this algorithm for file size half GB, it is observed that it takes around minutes to complete such computation whereas if same algorithm is implemented for computations only on CPU (,ie, no offloading to GPU), it is completed in few seconds. Any particular reason for such issue.