a question about the asynchronous mechanism and stream

Hi, all:
I want to improve the performance of code using stream, my idea is as follow, and I don’t know if it’s right. Can anybody help to check it? and any suggestions is appreciated:
There are one model calling 3 kernels run in the GPU in order, and another model transferring memory from host to device. These 2 models can be paralleld. So I want to bind the 3 kernels of one model to one stream(stream A ), and bind the memory transfer from host to device of another model to second stream(stream B ). This may overlap the kernel implementation and memory transfer.
Now, the condition is as follow:

Stream A ( model A )
kernel 1
kernel 2
kernel 3

Stream B ( model B )
memory transfer from host to device

( there are no relationship between model A and model B, and they can be paralled)

:rolleyes:

Yes, you can run model A and model B in to separate streams.

There is a problem now, the model A also have a memory copy from host to device, I’m afraid it will be conflict with moddel B.

model A:

kernel1 < < <grid, threads, 0, streamA>>>

CUDA_SAFE_CALL( cudaMemcpy(X, Y, totalSize, cudaMemcpyHostToDevice));

kernel2 < < <grid, threads, 0, streamA>>>

kernel3 < < <grid, threads, 0, streamA>>>

model B:

CUDA_SAFE_CALL(cudaMemcpyAsync((void*)XX, YY, totalSize, cudaMemcpyHostToDevice, streamB));

Is it also right?(will the red part of A be conflict with model B?)

thank you!

Allocate the host memory with cudaMallocHost and use cudaMemcpyAsync().