Two streams are not working asynchronously

Hi there,

Two images are sequentially inferred from the Jetson nano.
The source code is roughly structured like this:

array<vector<float*>, 2> buffer;
buffer[0].resize(3);
buffer[1].resize(3);
for(int i = 0; i < 2; ++i){
    cudaMallocManaged(&buffer[i][0], inputSize, cudaMemAttachHost);
    cudaMallocManaged(&buffer[i][1], output0Size, cudaMemAttachGlobal);
    cudaMallocManaged(&buffer[i][2], output1Size, cudaMemAttachGlobal);
}

array<cudaStream_t, 2> stream;
for(int i = 0; i < 2; ++i)
    cudaStreamCreate(&stream[i]);

for(int i = 0; i < 2; ++i){
    Mat img = readImage();
    float* data = (float*)malloc(img.rows*img.cols*sizeof(float));
    normalizeImage(img, data);
    cudaMemcpyAsync(buffer[i][0], data, img.rows*img.cols*sizeof(float), cudaMemcpyDefault, stream[i]);
    free(data);

    cudaStreamAttachMemAsync(stream[i], buffer[i][0], inputSize*sizeof(float), cudaMemAttachGlobal);
    context->enqueueV2((void**)buffer[i].data(), stream[i], nullptr);
    cudaStreamAttachMemAsync(stream[i], buffer[i][1], output0Size*sizeof(float), cudaMemAttachHost);
    cudaStreamAttachMemAsync(stream[i], buffer[i][2], output1Size*sizeof(float), cudaMemAttachHost);
}

for(int i = 0; i < 2; ++i){
    cudaStreamSynchronize(stream[i]);
    postprocess(buffer[i][1], buffer[i][2]);
}

If you do as above, buffer[i][1] and buffer[i][2] are NaN (inference failure).
But if I replace cudaMemcpyAsync with cudaMemcpy it works fine.
cudaMemcpyAsync(buffer[i][0], data, img.rows*img.cols*sizeof(float), cudaMemcpyDefault, stream[i])
cudaMemcpy(buffer[i][0], data, img.rows*img.cols*sizeof(float), cudaMemcpyDefault)

Why don’t the two streams work in parallel either?

You are not allocating the buffers correctly. cudaMalloc* takes a pointer to pointer. You need to pass &buffer[i][0] instead of buffer[i][0]

Streams in an error-free cuda program are not required to work in parallel.

There was a typo in the writing.
i just edited that part.
In my source code, i passed &buffer[i][0]

It doesn’t make sense to me that it doesn’t have to work in parallel.
My understanding is that it can run in parallel as many as there are streams.(if have enough resources)

That is the point. How big are your images? A Jetson nano has a GPU with a single Maxwell SM. That SM can process at most 2048 threads at a time. A CUDA processing task on an image larger than about 50x50 could easily use all the resources of that GPU.

I understood that my image is larger than 50x50 and the parallel stream doesn’t work.
But I want (cudaMemcpyAsync, cudaStreamAttachMemAsync, enqueueV2, cudaStreamAttachMemAsync, cudaStreamAttachMemAsync) to work on each stream queue like above source code.
Why does cudaMemcpyAsync not work and cudaMemcpy works?
I don’t want to synchronize using cudaMemcpy, I want to synchronize in cudaStreamSynchronize(stream[i]).

This is incorrect use of cudaMemcpyAsync. Your code tries to free data which is transfered to the device, without waiting until the transfer is finished. cudaMemcpy instead always waits until the transfer is finished.

I modified it as follows and tested it.
But the symptoms were still there.

array<vector<float*>, 2> buffer;
buffer[0].resize(3);
buffer[1].resize(3);

for(int i = 0; i < 2; ++i){
    cudaMallocManaged(&buffer[i][0], inputSize, cudaMemAttachHost);
    cudaMallocManaged(&buffer[i][1], output0Size, cudaMemAttachGlobal);
    cudaMallocManaged(&buffer[i][2], output1Size, cudaMemAttachGlobal);
}

array<float*, 2> data;
array<cudaStream_t, 2> stream;
for(int i = 0; i < 2; ++i){
    data[i] = (float*)malloc(IMAGE_SIZE * IMAGE_SIZE * sizeof(float));
    cudaStreamCreate(&stream[i]);
}

for(int i = 0; i < 2; ++i){
    Mat img = readImage();
    normalizeImage(img, data[i]);
    cudaMemcpyAsync(buffer[i][0], data[i], img.rows*img.cols*sizeof(float), cudaMemcpyDefault, stream[i]);

    cudaStreamAttachMemAsync(stream[i], buffer[i][0], inputSize*sizeof(float), cudaMemAttachGlobal);
    context->enqueueV2((void**)buffer[i].data(), stream[i], nullptr);
    cudaStreamAttachMemAsync(stream[i], buffer[i][1], output0Size*sizeof(float), cudaMemAttachHost);
    cudaStreamAttachMemAsync(stream[i], buffer[i][2], output1Size*sizeof(float), cudaMemAttachHost);
}

for(int i = 0; i < 2; ++i){
    cudaStreamSynchronize(stream[i]);
    postprocess(buffer[i][1], buffer[i][2]);
}

for(int i = 0; i < 2; ++i){
    free(data[i]);
    cudaStreamDestroy(stream[i]);
}

With symptoms you mean wrong results with cudaMemcpyAsync? You may not need it in the first place. Since managed memory is accessible from both host and device, you should be able to do normalizeImage(img, buffer[i][0])

I don’t think I can help you any further. There could be cuda errors which you do not check for (at least the code you have shown does not do error checking). The mistake could also be in the code which you have not shown.