I am working on a multi-threaded application where the whole memory pipeline is set up on the GPU as follows.
Process1 -> Process2 -> Process3 -> Process4 and so on…
There are multiple processes running in parallel and each CPU thread is associated with each process. All the CPU threads share the same cudaContext as there is GPU memory that needs to be shared between two processes for instance the output of the Process1 is input to Process2 and so I just pass output memory pointer of Process1 to Process2. I am ensuring that Process1 writes to the buffer before Process2 reads from the Buffer using CPU semaphores. Also each process has multiple kernel executions say Process1 has kernel_11, kernel_12, kernel_13 etc. There is also dependence between the memories used by different kernels like the output of kernel_11 is input to kernel_12.
kernel_11(writes to memory1);
kernel_12(reads from memory1 and writes to memory2);
kernel_13(reads from memory2 and writes to memory3);
Similarly for other processes.
I am not using any streams at the moment so I believe all the kernel are associated to a default stream. The following are my queries
Do I need to have a cudaThreadSynchronize between each kernel launch of the same process i.e. should there be a kernel_11 followed by cudaThreadSynchronize followed by kernel_12 or does the driver internally take care that the threads are executed in order.(The manual says that if it is default stream the execution is in-order but my doubt is do I need to ensure memory1 is written by kernel_11 “expliclitly” before kernel_12 starts reading it). The number of GPU threads of kernel_11 may be different from number of threads of kernel_12.
Since I am ensuring multiple CPU threads are using Semaphores for Synchronization, can I just launch the kernel is the Second GPU without any issue because the Process2 waits until the write by Process1 (since I am just passing pointers, also the memory is only on GPU, I dont need to transfer it to CPU) is complete.
Will the usage of Streams by different Processes give any speedup?
I am asking the first two questions since I am getting unexpected results but I want to be sure that this is not because CUDA expects me to do something that I am not doing.
PS: Graphics Card I am using is GTX 580