I had a question regarding using CUDA streams. Particularly for my application, I was wondering if there was a way to reduce the amount of things I am actually transferring between CPU and GPU. In my current implementation, the flow of my code is as follows:
- I transfer an array of “weights” to the GPU
- I transfer my inputs to the GPU
- I launch the kernel, which calculates a dot product with the weights and the inputs, then updates the weights
- I transfer the updated weights and the outputs BACK to the CPU.
- I make this sequence of events for 1000’s of inputs
What I am particularly wondering, is there a way I could just STREAM my inputs and outputs, without copying back the “weights” every time? I really only need to copy the weights BACK from the GPU on the last iteration of my kernel call.
Does anyone have any better ideas how to organize/do this? Thanks!