Cuda Streams with Pycuda

I found an article about Parallel Seam Carving CPU-GPU and I am trying to follow the process. CPU-GPU Hybrid and Concurrency method for seam carving Figure 7. I have already create the 2 streams. But, I have an issue when I try to make an asynchronous copy of the indexMap from device to host. It seems like the data are not updated in the indexMap. I have always zeros array as the initialization.

Please find here my code.

import pycuda.driver as cuda
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
from pycuda.compiler import SourceModule
import numpy

stream1 = cuda.Stream()
stream2 = cuda.Stream()

energyMap = numpy.random.randn(4,4).astype(numpy.float32)
indexMap = numpy.zeros((3,4)).astype(numpy.float32)
sumMap = numpy.zeros((3,4)).astype(numpy.float32)

energyMap_gpu = gpuarray.to_gpu_async(energyMap, stream = stream2)
indexMap_gpu = cuda.mem_alloc(indexMap.nbytes)
sumMap_gpu = gpuarray.to_gpu_async(sumMap, stream = stream2)

cuda.memcpy_htod(indexMap_gpu, indexMap)

mod = SourceModule("""
__global__ void getIndexMap(float *energyMap, float *indexMap){
      some code....

__global__ void getSumMap(float *indexMap, float *sumMap){
      some code....

getIndexMap = mod.get_function("getIndexMap")
getSumMap = mod.get_function("getSumMap")

getIndexMap(energyMap_gpu, indexMap_gpu, block=(4,4,1), stream=stream2)
cuda.memcpy_dtoh_async(indexMap, indexMap_gpu, stream=stream1)
getSumMap(sumMap_gpu, indexMap_gpu, block=(4,4,1), stream=stream2)


When I change cuda.memcpy_dtoh_async(indexMap, indexMap_gpu, stream=stream1) by cuda.memcpy_dtoh(indexMap, indexMap_gpu) I get the good result.

Should I use the CudaWaitEvent? If so, how should I use it?
Thanks !

If you are modifying indexMap_gpu in line 34 on stream2, why would you do a copy of indexMap_gpu to indexMap in line 35 using an async copy on stream1? That seems like a broken use of streams to me, assuming you want the values updated in line 34 to show up on your host copy of the data.

So it seems to me like the first thing you would need to do is change line 35 to read stream=stream2, so that it does not begin until line 34 is complete.

After that, if you want the host to wait on the device->host copy to be finished, one way to do it is to just use the non-async version of the dtoh copy as you have already indicated. Another way is to use a cuda stream synchronize.


Yes, you could use an event and then wait on the event, (cuda event synchronize, not cuda stream wait event), but that seems like overkill to me.

Thanks for your answer.

But, I am just trying to proceed as shown in the quote below.

After the index map is made, stream1 starts to
copy the index map to CPU, and stream2 computes the sum
map and finds the index value of the optimal seam that has
the minimum energy sum at the same time.
By doing this, the transfer time of index map is almost hidden. Then, the index
value is transferred to CPU, and CPU starts to trace the path
of the optimal seam based on the index map. CPU also needs
to access DDR memory the times as much as the height of
the image, but it takes much less time than GPU. Finally, the
optimal seam path is copied into GPU to use it for resizing
energy map.

As you can see, I start to transfer indexMap from device to host with stream1. At the same time, I continue to compute indexMap with stream2.

Fair enough. I haven’t studied the whole thing, nor have you provided a complete code (based on the code you provided, I have no idea if getIndexMap modifies indexMap_gpu or not, but it certainly seemed that it might).

If you believe the stream organization is correct, then simply add the stream synchronize on stream 1, before printing out the host data, and you should certainly get whatever was copied by the dtoh async copy. If it is still all zeros then the problem is somewhere else in your overall algorithm (the source data was also zero, in that case.)

And, yes, if you want the async dtoh copy on stream 1 to wait for the getIndexMap results from stream2 to be complete, you will need a stream wait event or something to enforce that. The stream 1 activity will not wait for stream 2 activity, otherwise.

event1 = cuda.Event()
getIndexMap(energyMap_gpu, indexMap_gpu, block=(4,4,1), stream=stream2)
cuda.memcpy_dtoh_async(indexMap, indexMap_gpu, stream=stream1)

something like that. coded in browser, not tested

Even with the above mods, you will still need a cuda stream synchronize, on stream1, before printing out indexMap, or you may get stale data on the host.

You are right, I didn’t provide a complete code.

I just tried yours and it works!

Thank you a lot and happy New Year, by anticipation ! :)