I want to improve the performance of code using stream, my idea is as follow, and I don’t know if it’s right. Can anybody help to check it? and any suggestions is appreciated:
There are one model calling 3 kernels run in the GPU in order, and another model transferring memory from host to device. These 2 models can be paralleld. So I want to bind the 3 kernels of one model to one stream(stream A ), and bind the memory transfer from host to device of another model to second stream(stream B). This may overlap the kernel implementation and memory transfer.
Now, the condition is as follow:
Stream A Stream B
kernel 1 memory transfer from host to device
( there are no relationship between A and B, and they can be paralled)