Multiple cameras cause running slow of API

xxxss · December 25, 2017, 9:47am

We built a object detection system of 4 cameras with TX1.

If we run application with only 1 camera active and detection. the whole loop of detection can be run 21 times in 1 second.

When four cameras are streaming and only one of them are used for detection algorithm, the detection can only be done 11 times in 1 second.

I tried nvprof. Below is result of 1 camera streaming:

==7317== Profiling application: ./detection -c1
==7317== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 21.90%  33.1770s      3747  8.8543ms  7.4923ms  14.441ms  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>,
 float const *, float const *, float, float, int)
 15.23%  23.0770s     26229  879.83us  70.940us  2.1807ms  maxwell_sgemm_128x64_raggedMn_nn
 12.04%  18.2380s     33723  540.82us  59.430us  3.5146ms  im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*)
  7.26%  10.9959s     93676  117.38us  1.8750us  1.9100ms  fill_kernel(int, float, float*, int)
  7.07%  10.7105s     29976  357.30us  28.282us  3.2339ms  normalize_kernel(int, float*, float*, float*, int, int, int)
  6.70%  10.1537s     71193  142.62us  1.6660us  2.0321ms  activate_array_kernel(float*, int, ACTIVATION)
  6.08%  9.21664s     33723  273.30us  5.0520us  2.0859ms  add_bias_kernel(float*, float*, int, int, int)
  5.77%  8.74433s      3747  2.3337ms  2.0663ms  3.9664ms  void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *,
 int, float*, int, int, int, float const *, float const *, float, float, int)
  5.39%  8.16327s     33723  242.07us  2.5000us  2.1818ms  copy_kernel(int, float*, int, int, float*, int, int)
  5.36%  8.11565s     29976  270.74us  28.908us  2.4257ms  scale_bias_kernel(float*, float*, int, int)
  4.91%  7.43280s     22482  330.61us  57.034us  1.7648ms  forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*)
  1.25%  1.89259s      3747  505.09us  353.56us  1.0201ms  convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int)
  1.04%  1.57767s     11451  137.78us     207ns  39.722ms  [CUDA memcpy HtoH]
  0.02%  29.780ms      3747  7.9470us  7.2930us  12.344us  softmax_kernel(float*, int, int, int, int, int, int, float, float*)
  0.00%  1.1460us         1  1.1460us  1.1460us  1.1460us  [CUDA memcpy HtoD]

==7317== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 77.96%  124.254s     11452  10.850ms  37.033us  241.49ms  cudaMemcpy
 15.53%  24.7526s    389689  63.518us  29.950us  14.282ms  cudaLaunch
  1.66%  2.64968s      3747  707.15us  47.606us  7.9432ms  cudaStreamSynchronize
  1.15%  1.83160s   2577940     710ns     416ns  2.7011ms  cudaSetupArgument
  1.00%  1.59649s      3748  425.96us  12.032us  1.46191s  cudaFree
  0.87%  1.38763s      3747  370.33us  156.93us  5.0750ms  cuGraphicsEGLRegisterImage
  0.30%  482.49ms      3747  128.77us  56.618us  2.9200ms  cuGraphicsUnregisterResource
  0.27%  423.85ms    389689  1.0870us     469ns  3.0148ms  cudaConfigureCall
  0.24%  374.82ms    435021     861ns     468ns  2.6683ms  cudaGetLastError
  0.22%  357.68ms         2  178.84ms  160.33ms  197.35ms  cuCtxCreate
  0.21%  341.01ms    318496  1.0700us     572ns  2.1004ms  cudaPeekAtLastError
  0.13%  211.42ms      7494  28.211us  7.7090us  2.1873ms  cudaBindTexture
  0.12%  188.53ms       157  1.2008ms  54.743us  78.482ms  cudaMallocManaged
  0.09%  144.28ms     33724  4.2780us  1.6150us  2.7024ms  cudaGetDevice
  0.07%  111.71ms         1  111.71ms  111.71ms  111.71ms  cuCtxDestroy
  0.05%  82.356ms      3747  21.979us  11.511us  1.5091ms  cudaStreamCreate
  0.04%  67.692ms      3747  18.065us  8.3340us  2.3707ms  cudaStreamDestroy
  0.03%  45.516ms      7494  6.0730us  1.7190us  1.7276ms  cudaUnbindTexture
  0.02%  31.411ms      3749  8.3780us  4.2710us  1.0104ms  cudaSetDevice
  0.01%  23.511ms         2  11.755ms  562.53us  22.948ms  cudaMallocHost
  0.01%  12.465ms      3747  3.3260us  1.6660us  560.76us  cuEGLStreamProducerPresentDevicePtr
  0.00%  7.5535ms         1  7.5535ms  7.5535ms  7.5535ms  cudaDeviceSynchronize
  0.00%  1.5153ms         2  757.64us  595.29us  919.99us  cudaFreeHost
  0.00%  1.1524ms         3  384.14us  41.565us  681.65us  cudaMalloc
  0.00%  266.52us       261  1.0210us     364ns  53.649us  cuDeviceGetAttribute
  0.00%  50.316us        16  3.1440us  1.8230us  14.272us  cudaEventCreateWithFlags
  0.00%  45.993us         3  15.331us  8.4900us  27.137us  cuDeviceTotalMem
  0.00%  42.086us         1  42.086us  42.086us  42.086us  cudaGetDeviceProperties
  0.00%  16.044us        11  1.4580us     938ns  5.2610us  cudaDeviceGetAttribute
  0.00%  14.064us         4  3.5160us  1.7190us  6.0940us  cuInit
  0.00%  13.751us         7  1.9640us     677ns  6.3030us  cuDeviceGetCount
  0.00%  10.417us         1  10.417us  10.417us  10.417us  cudaSetDeviceFlags
  0.00%  7.3440us         4  1.8360us  1.0420us  2.6040us  cuDeviceGetName
  0.00%  6.9280us         7     989ns     573ns  1.8230us  cuDeviceGet
  0.00%  6.3550us         1  6.3550us  6.3550us  6.3550us  cudaGetDeviceCount
  0.00%  5.5210us         3  1.8400us     885ns  3.3860us  cuDriverGetVersion
  0.00%  3.2810us         2  1.6400us  1.3020us  1.9790us  cuCtxSetCurrent

Here is 4 cameras streaming, only 1 detecting:

==7615== Profiling application: ./detection -c0 -c1 -c2 -c3
==7615== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 22.71%  42.5082s      2695  15.773ms  7.6831ms  18.547ms  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>,
 float const *, float const *, float, float, int)
 13.19%  24.6859s     18865  1.3086ms  73.074us  5.0093ms  maxwell_sgemm_128x64_raggedMn_nn
 12.61%  23.6057s     24255  973.23us  59.064us  6.5400ms  im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*)
  7.13%  13.3473s     67376  198.10us  1.9260us  4.4152ms  fill_kernel(int, float, float*, int)
  6.50%  12.1739s     24255  501.91us  2.3440us  5.0512ms  copy_kernel(int, float*, int, int, float*, int, int)
  6.44%  12.0468s      2695  4.4701ms  2.0662ms  7.6630ms  void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *,
 int, float*, int, int, int, float const *, float const *, float, float, int)
  6.25%  11.7044s     21560  542.88us  28.593us  5.5883ms  normalize_kernel(int, float*, float*, float*, int, int, int)
  6.09%  11.3933s     21560  528.45us  29.062us  4.7549ms  scale_bias_kernel(float*, float*, int, int)
  6.03%  11.2827s     51205  220.34us  1.6660us  4.8346ms  activate_array_kernel(float*, int, ACTIVATION)
  5.95%  11.1430s     24255  459.41us  5.4690us  4.8695ms  add_bias_kernel(float*, float*, int, int, int)
  4.89%  9.16334s     16170  566.69us  58.492us  3.7779ms  forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*)
  1.30%  2.43517s      2695  903.59us  353.23us  3.2210ms  convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int)
  0.90%  1.67766s      8295  202.25us     208ns  39.597ms  [CUDA memcpy HtoH]
  0.02%  30.554ms      2695  11.337us  7.6560us  17.345us  softmax_kernel(float*, int, int, int, int, int, int, float, float*)
  0.00%  1.3550us         1  1.3550us  1.3550us  1.3550us  [CUDA memcpy HtoD]

==7615== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 74.58%  149.874s      8296  18.066ms  37.501us  240.87ms  cudaMemcpy
 18.24%  36.6611s    280281  130.80us  30.209us  33.917ms  cudaLaunch
  2.29%  4.59812s      2695  1.7062ms  58.699us  15.252ms  cudaStreamSynchronize
  1.14%  2.29234s      2695  850.59us  155.37us  24.756ms  cuGraphicsEGLRegisterImage
  0.94%  1.88761s   1854164  1.0180us     417ns  9.6395ms  cudaSetupArgument
  0.85%  1.71233s      2696  635.14us  12.188us  1.54340s  cudaFree
  0.39%  776.94ms      2695  288.29us  58.439us  13.236ms  cuGraphicsUnregisterResource
  0.28%  555.74ms    280281  1.9820us     469ns  9.6958ms  cudaConfigureCall
  0.22%  444.38ms    312989  1.4190us     468ns  5.6845ms  cudaGetLastError
  0.21%  421.82ms    229076  1.8410us     572ns  5.0321ms  cudaPeekAtLastError
  0.21%  417.74ms         2  208.87ms  157.93ms  259.80ms  cuCtxCreate
  0.14%  271.40ms      5390  50.352us  8.6460us  9.5489ms  cudaBindTexture
  0.11%  217.05ms     24256  8.9480us  1.7180us  4.3617ms  cudaGetDevice
  0.09%  173.35ms       157  1.1042ms  53.074us  64.627ms  cudaMallocManaged
  0.08%  152.08ms         8  19.010ms  70.574us  73.692ms  cudaMallocHost
  0.07%  131.61ms      2695  48.833us  11.979us  8.6015ms  cudaStreamCreate
  0.06%  119.68ms      2695  44.407us  8.4890us  6.9904ms  cudaStreamDestroy
  0.06%  115.08ms         1  115.08ms  115.08ms  115.08ms  cuCtxDestroy
  0.03%  58.050ms      5390  10.770us  1.7180us  4.1501ms  cudaUnbindTexture
  0.03%  52.572ms      2697  19.492us  4.1150us  5.7802ms  cudaSetDevice
  0.01%  15.497ms      2695  5.7500us  1.6660us  2.8814ms  cuEGLStreamProducerPresentDevicePtr
  0.00%  7.6157ms         8  951.96us  317.24us  1.7291ms  cudaFreeHost
  0.00%  7.4092ms         1  7.4092ms  7.4092ms  7.4092ms  cudaDeviceSynchronize
  0.00%  918.92us         3  306.31us  42.293us  481.52us  cudaMalloc
  0.00%  297.92us       261  1.1410us     364ns  50.886us  cuDeviceGetAttribute
  0.00%  49.060us        16  3.0660us  1.8750us  14.791us  cudaEventCreateWithFlags
  0.00%  43.021us         3  14.340us  8.9580us  24.323us  cuDeviceTotalMem
  0.00%  40.887us         1  40.887us  40.887us  40.887us  cudaGetDeviceProperties
  0.00%  16.249us        11  1.4770us     937ns  5.4160us  cudaDeviceGetAttribute
  0.00%  14.323us         4  3.5800us  2.5520us  6.1980us  cuInit
  0.00%  12.448us         7  1.7780us     677ns  5.4690us  cuDeviceGetCount
  0.00%  8.0720us         4  2.0180us  1.4580us  2.3440us  cuDeviceGetName
  0.00%  7.4480us         1  7.4480us  7.4480us  7.4480us  cudaSetDeviceFlags
  0.00%  7.1360us         1  7.1360us  7.1360us  7.1360us  cudaGetDeviceCount
  0.00%  6.3030us         7     900ns     573ns  1.3030us  cuDeviceGet
  0.00%  5.8850us         3  1.9610us     937ns  3.3850us  cuDriverGetVersion
  0.00%  2.7610us         2  1.3800us  1.3550us  1.4060us  cuCtxSetCurrent

We can see the Avg cudaMemcpy with 1 cam activate is 10ms and 4cam activate is 18ms.

I thought maybe it cause by initial phase, so I use visual profile to check cost of each loop.

1 cam streaming and detecting:

External Image

4 cam streaming and only 1 detecting:

External Image

From the result we can see the cost of cudaMemcpy in each loop increase from 33ms to 53ms.

My question is Why multi streaming cameras cause API slow?

DaneLLL · December 26, 2017, 1:23am

Hi xxxss,
Have you tried to run in max performance? By running ‘sudo ./jetson_clocks.sh’, the CPU/GPU/EMC frequency keeps at max clocks. Should help to get better performance.

Also could you share the full pipeline of your case? You will get YUV420 from Argus. Do you convert it to RGBA via NvVideoConverter or CUDA? For the other three cameras not running object detection, cudaMemcpy is performed? Or you just get frames and do nothing?

xxxss · December 26, 2017, 6:55am

Yes, I have tried jetson_clocks.sh. The result does not change much. The cost difference between 1 cam and 4 cam remain still.

For the other three cameras not running object detection, cudaMemcpy is not performed.

bool ConsumerThread::converterCapturePlaneDqCallback(
    struct v4l2_buffer *v4l2_buf,
    NvBuffer * buffer,
    NvBuffer * shared_buffer,
    void *arg)
{
    ConsumerThread *thiz = (ConsumerThread*)arg;

if (!v4l2_buf)
    {
        REPORT_ERROR("Failed to dequeue buffer from conv capture plane");
        thiz->abort();
        return false;
    }

    if (v4l2_buf->m.planes[0].bytesused == 0)
    {
        return false;
    }

    pthread_mutex_lock(&ctx.net_Lock);
   
    thiz->egl_image = NvEGLImageFromFd(egl_display, buffer->planes[0].fd);
    if (thiz->egl_image == NULL)
        printf("egl_image NULL %s%d\n", __func__, __LINE__);
    mapEGLImage2Float(&thiz->egl_image, p_ctx->net_w, p_ctx->net_h, (char *)thiz->fbuf, thiz->showImg->imageData);
    NvDestroyEGLImage(egl_display, thiz->egl_image);

	object_detect_run(mdata,  thiz->fbuf);

	pthread_mutex_unlock(&ctx.net_Lock);

    e = thiz->m_ImageConverter->capture_plane.qBuffer(*v4l2_buf, NULL);
    if (e < 0)
        ORIGINATE_ERROR("qBuffer failed");

    return true;
}

bool ConsumerThread::converterOutputPlaneDqCallback(
    struct v4l2_buffer *v4l2_buf,
    NvBuffer * buffer,
    NvBuffer * shared_buffer,
    void *arg)
{
    ConsumerThread *thiz = (ConsumerThread*)arg;

    if (!v4l2_buf)
    {
        REPORT_ERROR("Failed to dequeue buffer from conv capture plane");
        thiz->abort();
        return false;
    }

    if (v4l2_buf->m.planes[0].bytesused == 0)
    {
        return false;
    }

    NvBufferDestroy(shared_buffer->planes[0].fd);

    //CONSUMER_PRINT("releaseFd %d (%d frames)\n", shared_buffer->planes[0].fd, thiz->m_numPendingFrames);
    pthread_mutex_lock(&thiz->m_queueLock);
    thiz->m_numPendingFrames--;
    thiz->m_ConvOutputPlaneBufQueue->push(buffer);
    pthread_cond_broadcast(&thiz->m_queueCond);
    pthread_mutex_unlock(&thiz->m_queueLock);

    return true;
}

bool ConsumerThread::createImageConverter()
{
    int ret = 0;
    char cname[10];

    sprintf(cname, "conv%d", cam_idx);
    // YUV420 --> RGB32 converter
    m_ImageConverter = NvVideoConverter::createVideoConverter(cname);
    if (!m_ImageConverter)
        ORIGINATE_ERROR("Could not create m_ImageConverteroder");

    if (DO_STAT)
        m_ImageConverter->enableProfiling();

m_ImageConverter->capture_plane.
        setDQThreadCallback(converterCapturePlaneDqCallback);
    m_ImageConverter->output_plane.
        setDQThreadCallback(converterOutputPlaneDqCallback);

ret = m_ImageConverter->setOutputPlaneFormat(V4L2_PIX_FMT_YUV420M, m_pContext->width,
                                    m_pContext->height, V4L2_NV_BUFFER_LAYOUT_BLOCKLINEAR);
    if (ret < 0)
        ORIGINATE_ERROR("Could not set output plane format");

    ret = m_ImageConverter->setCapturePlaneFormat(V4L2_PIX_FMT_ABGR32, m_pContext->net_w,
                                    m_pContext->net_h, V4L2_NV_BUFFER_LAYOUT_PITCH);
    if (ret < 0)
        ORIGINATE_ERROR("Could not set capture plane format");

    ret = m_ImageConverter->setCropRect(962, 159, 726, 726);
    if (ret < 0)
        ORIGINATE_ERROR("Could not set crop  rect");

    // Query, Export and Map the output plane buffers so that we can read
    // raw data into the buffers
    ret = m_ImageConverter->output_plane.setupPlane(V4L2_MEMORY_DMABUF, conv_buf_num, false, false);
    if (ret < 0)
        ORIGINATE_ERROR("Could not setup output plane");

    // Query, Export and Map the output plane buffers so that we can write
    // m_ImageConverteroded data from the buffers
    ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, true, false);
    if (ret < 0)
        ORIGINATE_ERROR("Could not setup capture plane");

    // Add all empty conv output plane buffers to m_ConvOutputPlaneBufQueue
    for (uint32_t i = 0; i < m_ImageConverter->output_plane.getNumBuffers(); i++)
    {
        m_ConvOutputPlaneBufQueue->push(
            m_ImageConverter->output_plane.getNthBuffer(i));
    }

    // conv output plane STREAMON
    ret = m_ImageConverter->output_plane.setStreamStatus(true);
    if (ret < 0)
        ORIGINATE_ERROR("fail to set conv output stream on");

    // conv capture plane STREAMON
    ret = m_ImageConverter->capture_plane.setStreamStatus(true);
    if (ret < 0)
        ORIGINATE_ERROR("fail to set conv capture stream on");

    // Start threads to dequeue buffers on conv capture plane,
    // conv output plane and capture plane
    m_ImageConverter->capture_plane.startDQThread(this);
    m_ImageConverter->output_plane.startDQThread(this);

    // Enqueue all empty conv capture plane buffers
    for (uint32_t i = 0; i < m_ImageConverter->capture_plane.getNumBuffers(); i++)
    {
        struct v4l2_buffer v4l2_buf;
        struct v4l2_plane planes[MAX_PLANES];

        memset(&v4l2_buf, 0, sizeof(v4l2_buf));
        memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane));

        v4l2_buf.index = i;
        v4l2_buf.m.planes = planes;

        ret = m_ImageConverter->capture_plane.qBuffer(v4l2_buf, NULL);
        if (ret < 0) {
            abort();
            ORIGINATE_ERROR("Error queueing buffer at conv capture plane");
        }
        printf(" i: %d\n", i);
    }

    printf("create vidoe converter return true\n");
    return true;
}

/*******************************************************************************
 * Argus Producer thread:
 *   Opens the Argus camera driver, creates an OutputStream to output to a
 *   FrameConsumer, then performs repeating capture requests for CAPTURE_TIME
 *   seconds before closing the producer and Argus driver.
 ******************************************************************************/
static void *capture_thread_3(void *arg)
{
   -
    UniqueObj<CaptureSession> captureSession(
            iCameraProvider->createCaptureSession(cameraDevices[cam_idx]));
    ICaptureSession *iCaptureSession = interface_cast<ICaptureSession>(captureSession);
    if (!iCaptureSession)
        ORIGINATE_ERROR("Failed to get ICaptureSession interface");

        printf("%s, %s, %d\n", __FILE__, __func__, __LINE__);
    // Create the OutputStream.
    PRODUCER_PRINT("Creating output stream\n");
    UniqueObj<OutputStreamSettings> streamSettings(iCaptureSession->createOutputStreamSettings());
    IOutputStreamSettings *iStreamSettings = interface_cast<IOutputStreamSettings>(streamSettings);
    if (iStreamSettings)
    {
        iStreamSettings->setPixelFormat(PIXEL_FMT_YCbCr_420_888);
        iStreamSettings->setResolution(Size(ctx.width, ctx.height));
//        iStreamSettings->setResolution(Size(p_ctx->, 1080));
#ifdef RENDER
        iStreamSettings->setEGLDisplay(m_renderer->getEGLDisplay());
#endif
        printf("%s, %s, %d\n", __FILE__, __func__, __LINE__);
    }
    else
    {
        ORIGINATE_ERROR("NULL for output stream settings!");
    }
    UniqueObj<OutputStream> outputStream(iCaptureSession->createOutputStream(streamSettings.get()));

    // Launch the FrameConsumer thread to consume frames from the OutputStream.
    PRODUCER_PRINT("Launching consumer thread\n");
    ConsumerThread frameConsumerThread(outputStream.get(), cam_idx);
    PROPAGATE_ERROR(frameConsumerThread.initialize());

    // Wait until the consumer is connected to the stream.
    PROPAGATE_ERROR(frameConsumerThread.waitRunning());

    // Create capture request and enable output stream.
    UniqueObj<Request> request(iCaptureSession->createRequest());
    IRequest *iRequest = interface_cast<IRequest>(request);
    if (!iRequest)
        ORIGINATE_ERROR("Failed to create Request");
    iRequest->enableOutputStream(outputStream.get());

    ISourceSettings *iSourceSettings = interface_cast<ISourceSettings>(iRequest->getSourceSettings());
    if (!iSourceSettings)
        ORIGINATE_ERROR("Failed to get ISourceSettings interface");

    iSourceSettings->setFrameDurationRange(Range<uint64_t>(1e9/DEFAULT_FPS));

    // Submit capture requests.
    PRODUCER_PRINT("Starting repeat capture requests.\n");
    if (iCaptureSession->repeat(request.get()) != STATUS_OK)
        ORIGINATE_ERROR("Failed to start repeat capture request");
        
    for (int i = 0; i < CAPTURE_TIME && !frameConsumerThread.isInError(); i++ )
		sleep(1);

    // Stop the repeating request and wait for idle.
    iCaptureSession->stopRepeat();
    iCaptureSession->waitForIdle();

    // Destroy the output stream to end the consumer thread.
    outputStream.reset();

    // Wait for the consumer thread to complete.
    PROPAGATE_ERROR(frameConsumerThread.shutdown());

PRODUCER_PRINT("Done -- exiting.\n");

}

int main(int argc, const char *argv[])
{
  
  // Create the CameraProvider object and get the core interface.
    UniqueObj<CameraProvider> cameraProvider = UniqueObj<CameraProvider>(CameraProvider::create());
    iCameraProvider = interface_cast<ICameraProvider>(cameraProvider);
    if (!iCameraProvider)
        ORIGINATE_ERROR("Failed to create CameraProvider");

    // Get the camera devices.
    iCameraProvider->getCameraDevices(&cameraDevices);
    if (cameraDevices.size() == 0)
        ORIGINATE_ERROR("No cameras available");

    for (int i = 0; i < 4; i++) {
        printf("i: %d, cam:%d\n", i, p_ctx->cam[i]);
        if (cameraDevices[i] && p_ctx->cam[i]) {
            if (i == 0)
            pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_0, (void *)&p_ctx->cam[i]);

            if (i == 1)
            pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_1, (void *)&p_ctx->cam[i]);

            if (i == 2)
            pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_2, (void *)&p_ctx->cam[i]);

            if (i == 3)
            pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_3, (void *)&p_ctx->cam[i]);
            sleep(3);
        }
#if 0
        if (!ArgusSamples::execute(p_ctx))
            return EXIT_FAILURE;
#endif
    }
    return ret;
}

DaneLLL · December 27, 2017, 1:57am

Hi xxxss,
Can you check tegrastats of both cases?

xxxss · December 27, 2017, 2:13am

1 cam:

RAM 2488/3995MB (lfb 19x4MB) cpu [60%,25%,60%,29%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [20%,35%,57%,71%]@1734 GR3D 65%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [20%,28%,28%,99%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [26%,30%,23%,99%]@1734 GR3D 77%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [28%,29%,78%,47%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [27%,29%,99%,30%]@1734 GR3D 88%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [29%,28%,97%,31%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [32%,32%,73%,47%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [34%,37%,41%,61%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [34%,50%,64%,30%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [87%,37%,32%,29%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [68%,35%,43%,34%]@1734 GR3D 65%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [26%,30%,97%,30%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [36%,33%,73%,31%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [27%,55%,40%,51%]@1734 GR3D 94%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [26%,30%,24%,97%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [44%,27%,32%,77%]@1734 GR3D 86%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [34%,27%,28%,99%]@1734 GR3D 66%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [34%,31%,31%,86%]@1734 GR3D 42%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [32%,34%,34%,83%]@1734 GR3D 41%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [37%,43%,24%,75%]@1734 GR3D 53%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [49%,35%,34%,65%]@1734 GR3D 82%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [54%,44%,38%,36%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [30%,38%,27%,89%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [30%,33%,27%,95%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [31%,36%,56%,62%]@1734 GR3D 60%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [71%,43%,38%,24%]@1734 GR3D 64%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [97%,32%,23%,20%]@1734 GR3D 86%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [59%,44%,41%,42%]@1734 GR3D 78%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [27%,35%,43%,73%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [31%,30%,69%,53%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [38%,29%,92%,32%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [37%,33%,98%,21%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [35%,37%,92%,23%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [34%,40%,86%,24%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [42%,25%,82%,33%]@1734 GR3D 57%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [33%,27%,23%,86%]@1734 GR3D 29%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [40%,46%,31%,69%]@1734 GR3D 47%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [51%,36%,54%,38%]@1734 GR3D 65%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [28%,34%,27%,91%]@1734 GR3D 59%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [35%,35%,49%,71%]@1734 GR3D 87%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [42%,47%,47%,41%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [29%,30%,89%,35%]@1734 GR3D 75%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [47%,50%,52%,35%]@1734 GR3D 67%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [35%,33%,94%,19%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [30%,25%,96%,28%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [27%,22%,97%,28%]@1734 GR3D 94%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [39%,64%,50%,26%]@1734 GR3D 75%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [29%,37%,84%,32%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [27%,34%,97%,23%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [82%,28%,33%,27%]@1734 GR3D 77%@998 EDP limit 0

4 cam but only 1 of them detection:

RAM 3329/3995MB (lfb 18x4MB) cpu [69%,74%,69%,86%]@1734 GR3D 81%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [76%,81%,69%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [68%,73%,73%,86%]@1734 GR3D 99%@998 EDP limit 0
RAM 3328/3995MB (lfb 18x4MB) cpu [84%,75%,69%,76%]@1734 GR3D 5%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [84%,81%,71%,64%]@1734 GR3D 99%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [74%,76%,73%,78%]@1734 GR3D 99%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [79%,79%,86%,66%]@1734 GR3D 95%@998 EDP limit 0
RAM 3332/3995MB (lfb 18x4MB) cpu [75%,80%,75%,78%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [76%,76%,81%,71%]@1734 GR3D 30%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [76%,79%,82%,68%]@1734 GR3D 97%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,71%,73%,70%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [85%,78%,70%,75%]@1734 GR3D 96%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [76%,74%,72%,79%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,70%,73%,81%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [72%,77%,74%,81%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [74%,77%,71%,79%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [82%,78%,77%,68%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [68%,74%,71%,78%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [72%,71%,74%,87%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [62%,82%,74%,79%]@1734 GR3D 8%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [82%,77%,70%,78%]@1734 GR3D 41%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [70%,85%,74%,75%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [73%,70%,77%,76%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,72%,78%,75%]@1734 GR3D 0%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [85%,76%,72%,67%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [84%,74%,73%,66%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [90%,68%,76%,71%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [86%,80%,74%,66%]@1734 GR3D 87%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [73%,86%,78%,76%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [75%,88%,73%,68%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [67%,76%,76%,86%]@1734 GR3D 51%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [69%,82%,72%,77%]@1734 GR3D 3%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [65%,80%,78%,72%]@1734 GR3D 40%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,76%,80%,73%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [86%,70%,79%,74%]@1734 GR3D 90%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [67%,84%,77%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,77%,76%,76%]@1734 GR3D 3%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [84%,79%,78%,67%]@1734 GR3D 3%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [66%,64%,75%,95%]@1734 GR3D 19%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [73%,76%,72%,78%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,75%,76%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [67%,70%,83%,83%]@1734 GR3D 3%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [70%,72%,78%,82%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [70%,75%,79%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [74%,73%,89%,71%]@1734 GR3D 11%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [69%,80%,74%,81%]@1734 GR3D 38%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [72%,71%,76%,80%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [78%,77%,70%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [80%,77%,80%,67%]@1734 GR3D 8%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [84%,74%,75%,72%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [74%,78%,78%,75%]@1734 GR3D 99%@998 EDP limit 0

DaneLLL · December 27, 2017, 3:10am

Hi xxxss,
Please try not to map buffers to CPU:

ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, <b>false</b>, false);

xxxss · December 27, 2017, 5:30am

Could you tell me how to fix this? I don’t know how and why.

DaneLLL · December 27, 2017, 5:42am

Hi xxxsss,
We don’t have test app to reproduce the issue, but from the code you attached, it looks like you can disable mmap in capture_plane for a try. Could you give it a try?

xxxss · December 28, 2017, 7:29am

I don’t know how to disable mmap in capture_plane, so I tried remove this line. It’s not work.
Could you give me an example how to disable it ?

DaneLLL · December 28, 2017, 7:41am

Hi xxxss,

/**
     * Helper method that encapsulates all the method calls required to
     * set up the plane for streaming.
     *
     * Calls reqbuf internally. Then, for each of the buffers, calls #queryBuffer,
     * #exportBuffer and maps the buffer/allocates the buffer memory depending
     * on the memory type.
     *
     * @sa deinitPlane
     *
     * @param[in] mem_type V4L2 Memory to use on the buffer.
     * @param[in] num_buffers Number of buffer to request on the plane.
     * @param[in] map boolean value indicating if the buffers should be mapped to
                      memory (Only for V4L2_MEMORY_MMAP).
     * @param[in] allocate boolean valued indicating whether the buffers should be
                           allocated memory (Only for V4L2_MEMORY_USERPTR).
     * @return 0 for success, -1 otherwise.
     */
    int setupPlane(enum v4l2_memory mem_type, uint32_t num_buffers, bool map, bool allocate);

The following line sets bool map=false:

ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, <b>false</b>, false);

xxxss · December 28, 2017, 8:40am

Sorry for that. I tried this. The result of nvprofile doesn’t change much.

DaneLLL · December 28, 2017, 9:12am

Hi xxxss,
We will try to simulate and reproduce the issue, but we don’t have the test app. It will take some time.

So you see cudaMemcpy() is slow in running 4 cameras like:
Argus camera 1 → NvVideoConverter → cudaMemcpy()
Argus camera 2 → NvVideoConverter → do nothing
Argus camera 3 → NvVideoConverter → do nothing
Argus camera 4 → NvVideoConverter → do nothing

We don’t have your detection algorithm, so it will be skipped in the simulation.

xxxss · December 29, 2017, 7:47am

We create a sample base on our project. You can try this.
If start with 4 cameras, the second cudaMemcpy cost 700us~1200us. (profile_4cam_201712291135)
If run with only 1 camera, the second cudaMemcpy cost 400us~500us. (profile_1cam_201712291137)

I run the nvprof by this command:

nvprof --print-api-trace -o profile_4cam_201712291135 --log-file 201712291135.log ./detection -c0 -c1 -c2 -c3

or

nvprof --print-api-trace -o profile_4cam_201712291137 --log-file 201712291137.log ./detection -c1

xxxss_sample.tar.gz (15.4 MB)

DaneLLL · January 2, 2018, 4:46am

Hi xxxss,
What is the purpose of copying data from GPU to CPU( cudaMemcpy(output, x_gpu, size, cudaMemcpyDeviceToHost) )? Your detection algorithm is running on CPU?

xxxss · January 2, 2018, 6:07am

after feature abstract from network, there is a copy from gpu to cpu side to do nms for prediction, it finished on cpu.

DaneLLL · January 2, 2018, 6:14am

Hi xxxss,
So the cudaMemcpy() is not to copy the Argus frames to CPU? Looks independent in the following code:

if (thiz->cam_idx == 1) {
	    thiz->egl_image = NvEGLImageFromFd(egl_display, buffer->planes[0].fd);
	    if (thiz->egl_image == NULL)
		    printf("egl_image NULL %s%d\n", __func__, __LINE__);
	    float *x_gpu;
	    size_t size = sizeof(float)*67600;
	    float *output;
	    output = (float *)calloc(67600, sizeof(float));
	    cudaError_t status = cudaMallocManaged((void **)&x_gpu, size, cudaMemAttachGlobal);
	    if(status) printf("%s\n", status);
	    status = cudaMemcpy(x_gpu, output, size, cudaMemcpyHostToDevice);
	    if(status) printf("%s\n", status);
	    if(!x_gpu) printf("Cuda malloc failed\n");
	    status = cudaMemcpy(output, x_gpu, size, cudaMemcpyDeviceToHost);
	    if(status) printf("%s\n", status);
	    //m_buff_h=(int*)malloc(1000*1024*1024*100);
	    //cudaMalloc((void**)m_buff_d,1000*1024*1024*100);
	    //cudaMemcpy(m_buff_d,m_buff_h,1000*1024*1024*100,cudaMemcpyHostToDevice);
	    //cudaFree(m_buff_d);
	    //free(m_buff_h);
	    cudaFree(x_gpu);
	    free(output);

	    mapEGLImage2Float(&thiz->egl_image, p_ctx->net_w, p_ctx->net_h, (char *)thiz->fbuf);

	    //printf("%s, %s, %d\n", __FILE__, __func__, __LINE__);

	    NvDestroyEGLImage(egl_display, thiz->egl_image);
    }

xxxss · January 2, 2018, 8:59am

It’s just a simple demo code to illustrate the different time of cudaMemcpy in those two situations.

We may optimize the code later to remove this cudaMemcpy.

But the point is, if you check the result of nvprof of two settings in the quote(1 cam vs 4 cam but 1 detection), most API spend double time in 4 cam than 1 cam.

API call                         1 cam       4 cam(only 1 detection)
cudaMemcpy:                      10.850ms    18.066ms
cudaLaunch:                      63.518us    130.80us
cudaStreamSynchronize:           707.15us    1.7062ms
cudaSetupArgument:               710ns       1.0180us
cudaFree:                        425.96us    635.14us
cuGraphicsEGLRegisterImage:      370.33us    850.59us
cuGraphicsUnregisterResource:    128.77us    288.29us
cudaConfigureCall:               1.0870us    1.9820us
cudaGetLastError:                861ns       1.4190us     
cudaPeekAtLastError:             1.0700us    1.8410us
cudaBindTexture:                 28.211us    50.352us
cudaGetDevice:                   4.2780us    8.9480us
cudaMallocHost:                  11.755ms    19.010ms

But some are almost same:
cuCtxDestroy:                    111.71ms    115.08ms
cudaMallocManaged:               1.2008ms    1.1042ms

We built a object detection system of 4 cameras with TX1.

If we run application with only 1 camera active and detection. the whole loop of detection can be run 21 times in 1 second.

When four cameras are streaming and only one of them are used for detection algorithm, the detection can only be done 11 times in 1 second.

I tried nvprof. Below is result of 1 camera streaming:

==7317== Profiling application: ./detection -c1
==7317== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 21.90%  33.1770s      3747  8.8543ms  7.4923ms  14.441ms  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>,
 float const *, float const *, float, float, int)
 15.23%  23.0770s     26229  879.83us  70.940us  2.1807ms  maxwell_sgemm_128x64_raggedMn_nn
 12.04%  18.2380s     33723  540.82us  59.430us  3.5146ms  im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*)
  7.26%  10.9959s     93676  117.38us  1.8750us  1.9100ms  fill_kernel(int, float, float*, int)
  7.07%  10.7105s     29976  357.30us  28.282us  3.2339ms  normalize_kernel(int, float*, float*, float*, int, int, int)
  6.70%  10.1537s     71193  142.62us  1.6660us  2.0321ms  activate_array_kernel(float*, int, ACTIVATION)
  6.08%  9.21664s     33723  273.30us  5.0520us  2.0859ms  add_bias_kernel(float*, float*, int, int, int)
  5.77%  8.74433s      3747  2.3337ms  2.0663ms  3.9664ms  void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *,
 int, float*, int, int, int, float const *, float const *, float, float, int)
  5.39%  8.16327s     33723  242.07us  2.5000us  2.1818ms  copy_kernel(int, float*, int, int, float*, int, int)
  5.36%  8.11565s     29976  270.74us  28.908us  2.4257ms  scale_bias_kernel(float*, float*, int, int)
  4.91%  7.43280s     22482  330.61us  57.034us  1.7648ms  forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*)
  1.25%  1.89259s      3747  505.09us  353.56us  1.0201ms  convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int)
  1.04%  1.57767s     11451  137.78us     207ns  39.722ms  [CUDA memcpy HtoH]
  0.02%  29.780ms      3747  7.9470us  7.2930us  12.344us  softmax_kernel(float*, int, int, int, int, int, int, float, float*)
  0.00%  1.1460us         1  1.1460us  1.1460us  1.1460us  [CUDA memcpy HtoD]

==7317== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 77.96%  124.254s     11452  10.850ms  37.033us  241.49ms  cudaMemcpy
 15.53%  24.7526s    389689  63.518us  29.950us  14.282ms  cudaLaunch
  1.66%  2.64968s      3747  707.15us  47.606us  7.9432ms  cudaStreamSynchronize
  1.15%  1.83160s   2577940     710ns     416ns  2.7011ms  cudaSetupArgument
  1.00%  1.59649s      3748  425.96us  12.032us  1.46191s  cudaFree
  0.87%  1.38763s      3747  370.33us  156.93us  5.0750ms  cuGraphicsEGLRegisterImage
  0.30%  482.49ms      3747  128.77us  56.618us  2.9200ms  cuGraphicsUnregisterResource
  0.27%  423.85ms    389689  1.0870us     469ns  3.0148ms  cudaConfigureCall
  0.24%  374.82ms    435021     861ns     468ns  2.6683ms  cudaGetLastError
  0.22%  357.68ms         2  178.84ms  160.33ms  197.35ms  cuCtxCreate
  0.21%  341.01ms    318496  1.0700us     572ns  2.1004ms  cudaPeekAtLastError
  0.13%  211.42ms      7494  28.211us  7.7090us  2.1873ms  cudaBindTexture
  0.12%  188.53ms       157  1.2008ms  54.743us  78.482ms  cudaMallocManaged
  0.09%  144.28ms     33724  4.2780us  1.6150us  2.7024ms  cudaGetDevice
  0.07%  111.71ms         1  111.71ms  111.71ms  111.71ms  cuCtxDestroy
  0.05%  82.356ms      3747  21.979us  11.511us  1.5091ms  cudaStreamCreate
  0.04%  67.692ms      3747  18.065us  8.3340us  2.3707ms  cudaStreamDestroy
  0.03%  45.516ms      7494  6.0730us  1.7190us  1.7276ms  cudaUnbindTexture
  0.02%  31.411ms      3749  8.3780us  4.2710us  1.0104ms  cudaSetDevice
  0.01%  23.511ms         2  11.755ms  562.53us  22.948ms  cudaMallocHost
  0.01%  12.465ms      3747  3.3260us  1.6660us  560.76us  cuEGLStreamProducerPresentDevicePtr
  0.00%  7.5535ms         1  7.5535ms  7.5535ms  7.5535ms  cudaDeviceSynchronize
  0.00%  1.5153ms         2  757.64us  595.29us  919.99us  cudaFreeHost
  0.00%  1.1524ms         3  384.14us  41.565us  681.65us  cudaMalloc
  0.00%  266.52us       261  1.0210us     364ns  53.649us  cuDeviceGetAttribute
  0.00%  50.316us        16  3.1440us  1.8230us  14.272us  cudaEventCreateWithFlags
  0.00%  45.993us         3  15.331us  8.4900us  27.137us  cuDeviceTotalMem
  0.00%  42.086us         1  42.086us  42.086us  42.086us  cudaGetDeviceProperties
  0.00%  16.044us        11  1.4580us     938ns  5.2610us  cudaDeviceGetAttribute
  0.00%  14.064us         4  3.5160us  1.7190us  6.0940us  cuInit
  0.00%  13.751us         7  1.9640us     677ns  6.3030us  cuDeviceGetCount
  0.00%  10.417us         1  10.417us  10.417us  10.417us  cudaSetDeviceFlags
  0.00%  7.3440us         4  1.8360us  1.0420us  2.6040us  cuDeviceGetName
  0.00%  6.9280us         7     989ns     573ns  1.8230us  cuDeviceGet
  0.00%  6.3550us         1  6.3550us  6.3550us  6.3550us  cudaGetDeviceCount
  0.00%  5.5210us         3  1.8400us     885ns  3.3860us  cuDriverGetVersion
  0.00%  3.2810us         2  1.6400us  1.3020us  1.9790us  cuCtxSetCurrent

Here is 4 cameras streaming, only 1 detecting:

==7615== Profiling application: ./detection -c0 -c1 -c2 -c3
==7615== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 22.71%  42.5082s      2695  15.773ms  7.6831ms  18.547ms  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>,
 float const *, float const *, float, float, int)
 13.19%  24.6859s     18865  1.3086ms  73.074us  5.0093ms  maxwell_sgemm_128x64_raggedMn_nn
 12.61%  23.6057s     24255  973.23us  59.064us  6.5400ms  im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*)
  7.13%  13.3473s     67376  198.10us  1.9260us  4.4152ms  fill_kernel(int, float, float*, int)
  6.50%  12.1739s     24255  501.91us  2.3440us  5.0512ms  copy_kernel(int, float*, int, int, float*, int, int)
  6.44%  12.0468s      2695  4.4701ms  2.0662ms  7.6630ms  void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *,
 int, float*, int, int, int, float const *, float const *, float, float, int)
  6.25%  11.7044s     21560  542.88us  28.593us  5.5883ms  normalize_kernel(int, float*, float*, float*, int, int, int)
  6.09%  11.3933s     21560  528.45us  29.062us  4.7549ms  scale_bias_kernel(float*, float*, int, int)
  6.03%  11.2827s     51205  220.34us  1.6660us  4.8346ms  activate_array_kernel(float*, int, ACTIVATION)
  5.95%  11.1430s     24255  459.41us  5.4690us  4.8695ms  add_bias_kernel(float*, float*, int, int, int)
  4.89%  9.16334s     16170  566.69us  58.492us  3.7779ms  forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*)
  1.30%  2.43517s      2695  903.59us  353.23us  3.2210ms  convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int)
  0.90%  1.67766s      8295  202.25us     208ns  39.597ms  [CUDA memcpy HtoH]
  0.02%  30.554ms      2695  11.337us  7.6560us  17.345us  softmax_kernel(float*, int, int, int, int, int, int, float, float*)
  0.00%  1.3550us         1  1.3550us  1.3550us  1.3550us  [CUDA memcpy HtoD]

==7615== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 74.58%  149.874s      8296  18.066ms  37.501us  240.87ms  cudaMemcpy
 18.24%  36.6611s    280281  130.80us  30.209us  33.917ms  cudaLaunch
  2.29%  4.59812s      2695  1.7062ms  58.699us  15.252ms  cudaStreamSynchronize
  1.14%  2.29234s      2695  850.59us  155.37us  24.756ms  cuGraphicsEGLRegisterImage
  0.94%  1.88761s   1854164  1.0180us     417ns  9.6395ms  cudaSetupArgument
  0.85%  1.71233s      2696  635.14us  12.188us  1.54340s  cudaFree
  0.39%  776.94ms      2695  288.29us  58.439us  13.236ms  cuGraphicsUnregisterResource
  0.28%  555.74ms    280281  1.9820us     469ns  9.6958ms  cudaConfigureCall
  0.22%  444.38ms    312989  1.4190us     468ns  5.6845ms  cudaGetLastError
  0.21%  421.82ms    229076  1.8410us     572ns  5.0321ms  cudaPeekAtLastError
  0.21%  417.74ms         2  208.87ms  157.93ms  259.80ms  cuCtxCreate
  0.14%  271.40ms      5390  50.352us  8.6460us  9.5489ms  cudaBindTexture
  0.11%  217.05ms     24256  8.9480us  1.7180us  4.3617ms  cudaGetDevice
  0.09%  173.35ms       157  1.1042ms  53.074us  64.627ms  cudaMallocManaged
  0.08%  152.08ms         8  19.010ms  70.574us  73.692ms  cudaMallocHost
  0.07%  131.61ms      2695  48.833us  11.979us  8.6015ms  cudaStreamCreate
  0.06%  119.68ms      2695  44.407us  8.4890us  6.9904ms  cudaStreamDestroy
  0.06%  115.08ms         1  115.08ms  115.08ms  115.08ms  cuCtxDestroy
  0.03%  58.050ms      5390  10.770us  1.7180us  4.1501ms  cudaUnbindTexture
  0.03%  52.572ms      2697  19.492us  4.1150us  5.7802ms  cudaSetDevice
  0.01%  15.497ms      2695  5.7500us  1.6660us  2.8814ms  cuEGLStreamProducerPresentDevicePtr
  0.00%  7.6157ms         8  951.96us  317.24us  1.7291ms  cudaFreeHost
  0.00%  7.4092ms         1  7.4092ms  7.4092ms  7.4092ms  cudaDeviceSynchronize
  0.00%  918.92us         3  306.31us  42.293us  481.52us  cudaMalloc
  0.00%  297.92us       261  1.1410us     364ns  50.886us  cuDeviceGetAttribute
  0.00%  49.060us        16  3.0660us  1.8750us  14.791us  cudaEventCreateWithFlags
  0.00%  43.021us         3  14.340us  8.9580us  24.323us  cuDeviceTotalMem
  0.00%  40.887us         1  40.887us  40.887us  40.887us  cudaGetDeviceProperties
  0.00%  16.249us        11  1.4770us     937ns  5.4160us  cudaDeviceGetAttribute
  0.00%  14.323us         4  3.5800us  2.5520us  6.1980us  cuInit
  0.00%  12.448us         7  1.7780us     677ns  5.4690us  cuDeviceGetCount
  0.00%  8.0720us         4  2.0180us  1.4580us  2.3440us  cuDeviceGetName
  0.00%  7.4480us         1  7.4480us  7.4480us  7.4480us  cudaSetDeviceFlags
  0.00%  7.1360us         1  7.1360us  7.1360us  7.1360us  cudaGetDeviceCount
  0.00%  6.3030us         7     900ns     573ns  1.3030us  cuDeviceGet
  0.00%  5.8850us         3  1.9610us     937ns  3.3850us  cuDriverGetVersion
  0.00%  2.7610us         2  1.3800us  1.3550us  1.4060us  cuCtxSetCurrent

We can see the Avg cudaMemcpy with 1 cam activate is 10ms and 4cam activate is 18ms.

I thought maybe it cause by initial phase, so I use visual profile to check cost of each loop.

1 cam streaming and detecting:

Pasteboard - Uploaded Image

4 cam streaming and only 1 detecting:

Pasteboard - Uploaded Image

From the result we can see the cost of cudaMemcpy in each loop increase from 33ms to 53ms.

My question is Why multi streaming cameras cause API slow?

DaneLLL · January 3, 2018, 6:04am

Hi xxxss,
We can see difference of calling cudaMemcpy(cudaMemcpyDeviceToHost). It is ~4ms in one camera and ~6.5ms in 4 cameras. The reason is that Argus is EGLStream based implementation. Each stream would take some GPU resources.

A few suggestions for your case is that you can get RGBA by the following code:

if (m_dmabuf < 0) {
    m_dmabuf = iNativeBuffer->createNvBuffer(STREAM_SIZE,
                                       NvBufferColorFormat_ABGR32,
                                       NvBufferLayout_Pitch);
    if (VERBOSE_ENABLE)
        CONSUMER_PRINT("Acquired Frame. %d\n", m_dmabuf);
} else
    iNativeBuffer->copyToNvBuffer(m_dmabuf);

And the buffer allocated with

cudaError_t status = cudaMallocManaged((void **)&x_gpu, size, cudaMemAttachGlobal);

is a share buffer between CPU and GPU. x_gpu is a glocal pointer to CPU and GPU so you do not need cudaMemcpy() here.
main.cpp (12.4 KB)

Topic		Replies	Views
Sony camera module cannot be opened with OpenCV on Xavier Jetson AGX Xavier	19	8202	October 18, 2021
ARGUS API for 6 Cameras TX1 Jetson TX1	8	1027	October 18, 2021
CUDA consumption from ARGUS API 6 Cameras Jetson TX1	5	1049	December 20, 2017
Error generated while running the code after connecting the camera Jetson Xavier NX gstreamer , nvbugs	45	1915	January 2, 2024
Gstreamer / DeepStream nvcamerasrc has high CPU usage (around 60 %) Jetson TX1	6	670	October 18, 2021
Multi-Cam Image Capturing Gstreamer Jetson Xavier Jetson AGX Xavier	11	4151	October 18, 2021
Video converter with many cameras (Tegra APIs) Jetson TX2	18	2805	October 18, 2021
Finding the bottleneck in video stitching application Jetson TX1	42	8534	October 18, 2021
Jetson Nano only support streaming on two cameras Jetson Nano camera , kernel	23	2255	May 9, 2023
[TX2] yolo-v2-tiny handles video slowly ? DeepStream SDK	11	1646	October 12, 2021

Multiple cameras cause running slow of API

Related topics