Efficiency of the GPU

anon79190853 · April 14, 2020, 6:39am

Hi !

Actually, I need to do a sample operation. I have 5 images of 48Mp, and I need to calculate the mean image.
I have a two methods, one with GPU and one with CPU.

Using CPU, I have :

#pragma omp parallel for simd
    for (int i=0 ; i<height*width ; i++){
        img_sum[i] = (unsigned char)(((unsigned short)(img1[i])+img2[i]+img3[i]+img4[i]+img5[i])/5);
    }

I use multithreading, with openmp. The calculation time is around 25ms.

Using the GPU, I have :

global void
vectorAdd(const unsigned char *A, const unsigned char *B, const unsigned char *C, const unsigned char *D, const unsigned char *E, unsigned short *F, const int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < numElements)
    {
        F[i] = (unsigned short)(A[i] + B[i] + C[i] + D[i] + E[i])/5;
    }
} 
....
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, d_D, d_E, d_S, numElements);

The calculation time is about 35ms.

I expected to have better result using GPU, but apparently that is not the case. It is a very simple operation, and the GPU should be more efficient than the CPU.
Is it because of the number a core, only 256 ? Or maybe I don’t use block and thread correctly.
Does jetson’s GPU is efficient for image processing ? Or that GPU is more efficient only for task like AI ? I ask because I also code a sample median filter using share memory, and the result was better using CPU.

Thank you !

ShaneCCC · April 14, 2020, 6:46am

Could you investigate with boost the system to performance mode.

sudo nvpmode -m 0
sudo jetson_clocks

anon79190853 · April 14, 2020, 6:58am

Hi,

Thank you for your answer. I can’t launch your first command…

jetson@jetson-desktop:~$ sudo nvpmode -m 0
sudo: nvpmode: command not found

Furthermore, I am using MAXN power mode. I have installed the last L4T version.

ShaneCCC · April 14, 2020, 7:07am

typo, should be nvpmodel -m 0

anon79190853 · April 14, 2020, 7:29am

Ok thank you, now the command line works. Is it the same if I change the power mode from the desktop directly ?

And the result is the same, this is no more efficient. I need to understand why the GPU is less efficient fir sample task.

Thank you !

AastaLLL · April 24, 2020, 2:46am

Hi,

nvpmodel is a Jetson specific tool.
To figure out the bottleneck, could you profile your sample with nvvp first?

1. Setup root password for TX2

2. Launch nvvp

$ /usr/local/cuda-10.0/bin/nvvp

If you meet error when launching it, please check this comment for help.

3. Create
File → New Session

  Connection: Type your device info(User name must be 'root')
  Toolkit/Script: click detect
  File: Choose your app
  Working directory: Choose your working directory
  Argument: Type if any

→ Finish

Then you can find the GPU utilization and the bottleneck on the nvvp timeline.
Thanks.

anon79190853 · April 24, 2020, 7:22am

Hi AastaLLL,

Thank you for your response.
I profiled the code using nsight eclipse as well as nvprof through command line.

The compute utilization is very weak, around 6%, and the kernel overlap is 0% … Is there over parameters than number of thread and number of block to handle those parameters ?

Thank you

AastaLLL · April 27, 2020, 2:16am

Hi,

This is implementation dependent.
Would you mind to share the nvvp timeline with us?

Thanks.

anon79190853 · April 27, 2020, 7:01am

Hi,

Find below the nvpp timeline :

I noted that the compute utilization is note a constant number.

Thank you !

AastaLLL · April 28, 2020, 9:30am

Hi,

The app has some dependency between memory copy and kernel code.
Since Jetson is shared memory system, you can try to use zero copy memory to remove the dependency.

Please find our document for more details:

Thanks.

anon79190853 · April 28, 2020, 2:34pm

Hi,

I am already using zero copy memory. Using unified memory, the result is the same.

I’ve red a lot of things about zero-copy and unified memory. Sometimes, users advise unified memory for jetson’s user, and sometimes zero copy memory …
As the RAM is shared between the CPU and the GPU, does the unified memory is better ?
Moreother, in user guide, I red that unified memory is not supported by embedded OS, but it works with L4T …

Thank you !

AastaLLL · April 29, 2020, 2:27am

Hi,

Zero copy is a name for shared memory.
You can use either unified memory (GPU-based) or pinned memory (CPU-based) to achieve this.
Usually, we recommends user to use pinned memory for one time read-only job.
Otherwise, you can use unified memory to reduce the latency of pinned page.

By the way, if you are using a zero-copy memory, why do you still need a memcpy function?
The GPU kernel code is waiting for the memcpy that leads to low usage rate.
A zero copy buffer is shared via CPU and GPU.

Thanks.

Topic		Replies	Views
GPU data speed Jetson TX2 cuda	8	1000	October 18, 2021
does opencv_dnn use gpu? Jetson TX2	11	3097	October 18, 2021
TensorFlow GPU runtime worse than CPU - TX2 Jetson TX2	14	4189	October 18, 2021
GPU does not work when running SSD Jetson TX2	26	2663	August 15, 2017
jetson tx2 not using gpu for my the opencv caffe-model Jetson TX2	7	937	October 18, 2019
Cuda programming Jetson TX2 cuda	4	640	October 18, 2021
Kernel performance when switching compute capability from 3.0 to 6.2 on Jetson Tx2 Jetson TX2 cuda , kernel , performance , nvcc , jetson	8	732	April 25, 2023
Anomalous performance behavior of nvp-models when encoding h264-video Jetson TX2	14	1486	October 18, 2021
What consumes GPU? Jetson Xavier NX encoder	7	476	March 18, 2024
Performance of nvargus-daemon Jetson TX2 camera	7	4772	October 18, 2021

Efficiency of the GPU

Related topics