Hi Everybody,
I have to carry out multiplication of a large block structured matrix, M, with a vector, V. I am able to do this, with a single kernel. I plan not to use multiple kernels. Now I have to do it for different Vs, one V for each time step. One way to do this is:
- copy M, V to global memory from Host.
start (time step loop) - call kernel that multiplies
- the kernel threads load M and V from global memory to shared memory, and multiply them
- Wait for the kernel to complete with cudaThreadSynchronize();
- copy result R from device to host
- copy new V into the global memory
end (time step loop)
This looks fairly strightforward, it should work. But the matrix is the same in every time step. Is there a way to avoid the copy of M from global memory to shared memory in step (3)?
I am trying the exploit the fact that the kernel runs asynchronously. Is the following possible?
- Copy M to global memory from host. Make the vectors V and R shared with cudaHostAlloc(…,cudaHostAllocMapped). Copy V. Set R to, say 0s.
- Create a shared int variable called flag, flag = 1 with cudaHostAlloc(…,cudaHostAllocMapped).
- Call the new kernel.
start (time loop) - Check if flag = 0, else sleep. If flag = 0 (changed by the kernel), display the result vector R (populated by the kernel).
- Copy the new V, for the next time step, into V.
- Set flag = 1
end (time loop)
The new kernel in turn executes another time loop
- Copy M from global memory to shared memory
start (time loop) - Check if flag = 1. If no, keep checking. If yes, read V to shared memory, multiply M with V, and store result in R (which is on the host. Hope the memory controller takes care of the copy).
- Reset flag=0. (which is again on the host. Hope the DMA has its operations queued, and finishes step 2 before it updates the flag variable)
end (time loop)
This variation has the one kernel always running after one initial call. Advantage: Step 1. of the new kernel, copying the big matrix M from global memory to shared memory takes place just once. I don’t know if there are overheads in starting kernels, which is saved too.
I tried such IPC with simple addition programs, it does not work. Step2 of the new kernel, a kernel initiated memory copy, happens only when the host calls cudaThreadSynchronize(); But that will never happen if the kernel is always ‘on’, expecting new data to be processed.
- Should the above work, which means I only have to debug, or it is simply not possible? The documentation says that the mapped memory helps to amortize one-time copy latencies, without delving into what and what not are possible.
- Is there any other mechnism offered by CUDA to achive the aforementioned objective?
Thank you very much,
Elan.