Special case overlapping execution and transfer times with streams

Hello all,

I had a question regarding using CUDA streams. Particularly for my application, I was wondering if there was a way to reduce the amount of things I am actually transferring between CPU and GPU. In my current implementation, the flow of my code is as follows:

  1. I transfer an array of “weights” to the GPU
  2. I transfer my inputs to the GPU
  3. I launch the kernel, which calculates a dot product with the weights and the inputs, then updates the weights
  4. I transfer the updated weights and the outputs BACK to the CPU.
  5. I make this sequence of events for 1000’s of inputs

What I am particularly wondering, is there a way I could just STREAM my inputs and outputs, without copying back the “weights” every time? I really only need to copy the weights BACK from the GPU on the last iteration of my kernel call.

Does anyone have any better ideas how to organize/do this? Thanks!

It sounds like you have the answer in the question. Why do you need to perform step 4 at all except at the end?

I don’t think you need streams to accomplish your goal. Global memory retains its values between kernel calls, so the updated weights can stay on the GPU.

Yup, you guys are right. Thanks for the help!