Two images are sequentially inferred from the Jetson nano.
The source code is roughly structured like this:
array<vector<float*>, 2> buffer;
buffer[0].resize(3);
buffer[1].resize(3);
for(int i = 0; i < 2; ++i){
cudaMallocManaged(&buffer[i][0], inputSize, cudaMemAttachHost);
cudaMallocManaged(&buffer[i][1], output0Size, cudaMemAttachGlobal);
cudaMallocManaged(&buffer[i][2], output1Size, cudaMemAttachGlobal);
}
array<cudaStream_t, 2> stream;
for(int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
for(int i = 0; i < 2; ++i){
Mat img = readImage();
float* data = (float*)malloc(img.rows*img.cols*sizeof(float));
normalizeImage(img, data);
cudaMemcpyAsync(buffer[i][0], data, img.rows*img.cols*sizeof(float), cudaMemcpyDefault, stream[i]);
free(data);
cudaStreamAttachMemAsync(stream[i], buffer[i][0], inputSize*sizeof(float), cudaMemAttachGlobal);
context->enqueueV2((void**)buffer[i].data(), stream[i], nullptr);
cudaStreamAttachMemAsync(stream[i], buffer[i][1], output0Size*sizeof(float), cudaMemAttachHost);
cudaStreamAttachMemAsync(stream[i], buffer[i][2], output1Size*sizeof(float), cudaMemAttachHost);
}
for(int i = 0; i < 2; ++i){
cudaStreamSynchronize(stream[i]);
postprocess(buffer[i][1], buffer[i][2]);
}
If you do as above, buffer[i][1] and buffer[i][2] are NaN (inference failure).
But if I replace cudaMemcpyAsync with cudaMemcpy it works fine. cudaMemcpyAsync(buffer[i][0], data, img.rows*img.cols*sizeof(float), cudaMemcpyDefault, stream[i])
→ cudaMemcpy(buffer[i][0], data, img.rows*img.cols*sizeof(float), cudaMemcpyDefault)
Why don’t the two streams work in parallel either?
There was a typo in the writing.
i just edited that part.
In my source code, i passed &buffer[i][0]
It doesn’t make sense to me that it doesn’t have to work in parallel.
My understanding is that it can run in parallel as many as there are streams.(if have enough resources)
That is the point. How big are your images? A Jetson nano has a GPU with a single Maxwell SM. That SM can process at most 2048 threads at a time. A CUDA processing task on an image larger than about 50x50 could easily use all the resources of that GPU.
I understood that my image is larger than 50x50 and the parallel stream doesn’t work.
But I want (cudaMemcpyAsync, cudaStreamAttachMemAsync, enqueueV2, cudaStreamAttachMemAsync, cudaStreamAttachMemAsync) to work on each stream queue like above source code.
Why does cudaMemcpyAsync not work and cudaMemcpy works?
I don’t want to synchronize using cudaMemcpy, I want to synchronize in cudaStreamSynchronize(stream[i]).
This is incorrect use of cudaMemcpyAsync. Your code tries to free data which is transfered to the device, without waiting until the transfer is finished. cudaMemcpy instead always waits until the transfer is finished.
With symptoms you mean wrong results with cudaMemcpyAsync? You may not need it in the first place. Since managed memory is accessible from both host and device, you should be able to do normalizeImage(img, buffer[i][0])
I don’t think I can help you any further. There could be cuda errors which you do not check for (at least the code you have shown does not do error checking). The mistake could also be in the code which you have not shown.