# Efficiency of the GPU

Hi !

Actually, I need to do a sample operation. I have 5 images of 48Mp, and I need to calculate the mean image.
I have a two methods, one with GPU and one with CPU.

Using CPU, I have :

``````#pragma omp parallel for simd
for (int i=0 ; i<height*width ; i++){
img_sum[i] = (unsigned char)(((unsigned short)(img1[i])+img2[i]+img3[i]+img4[i]+img5[i])/5);
}
``````

I use multithreading, with openmp. The calculation time is around 25ms.

Using the GPU, I have :

global void
vectorAdd(const unsigned char *A, const unsigned char *B, const unsigned char *C, const unsigned char *D, const unsigned char *E, unsigned short *F, const int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;

``````    if (i < numElements)
{
F[i] = (unsigned short)(A[i] + B[i] + C[i] + D[i] + E[i])/5;
}
}
....
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, d_D, d_E, d_S, numElements);
``````

The calculation time is about 35ms.

I expected to have better result using GPU, but apparently that is not the case. It is a very simple operation, and the GPU should be more efficient than the CPU.
Is it because of the number a core, only 256 ? Or maybe I don’t use block and thread correctly.
Does jetson’s GPU is efficient for image processing ? Or that GPU is more efficient only for task like AI ? I ask because I also code a sample median filter using share memory, and the result was better using CPU.

Thank you !

Could you investigate with boost the system to performance mode.

sudo nvpmode -m 0
sudo jetson_clocks

Hi,

Thank you for your answer. I can’t launch your first command…

``````jetson@jetson-desktop:~\$ sudo nvpmode -m 0
sudo: nvpmode: command not found
``````

Furthermore, I am using MAXN power mode. I have installed the last L4T version.

typo, should be nvpmodel -m 0

Ok thank you, now the command line works. Is it the same if I change the power mode from the desktop directly ?

And the result is the same, this is no more efficient. I need to understand why the GPU is less efficient fir sample task.

Thank you !

Hi,

nvpmodel is a Jetson specific tool.
To figure out the bottleneck, could you profile your sample with nvvp first?

1. Setup root password for TX2

2. Launch nvvp

``````\$ /usr/local/cuda-10.0/bin/nvvp
``````

If you meet error when launching it, please check this comment for help.

3. Create
File → New Session

``````  Connection: Type your device info(User name must be 'root')
Toolkit/Script: click detect
File: Choose your app
Working directory: Choose your working directory
Argument: Type if any
``````

→ Finish

Then you can find the GPU utilization and the bottleneck on the nvvp timeline.
Thanks.

Hi AastaLLL,

Thank you for your response.
I profiled the code using nsight eclipse as well as nvprof through command line.

The compute utilization is very weak, around 6%, and the kernel overlap is 0% … Is there over parameters than number of thread and number of block to handle those parameters ?

Thank you

Hi,

This is implementation dependent.
Would you mind to share the nvvp timeline with us?

Thanks.

Hi,

Find below the nvpp timeline :

I noted that the compute utilization is note a constant number.

Thank you !

Hi,

The app has some dependency between memory copy and kernel code.
Since Jetson is shared memory system, you can try to use zero copy memory to remove the dependency.

Please find our document for more details:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

Thanks.

Hi,

I am already using zero copy memory. Using unified memory, the result is the same.

I’ve red a lot of things about zero-copy and unified memory. Sometimes, users advise unified memory for jetson’s user, and sometimes zero copy memory …
As the RAM is shared between the CPU and the GPU, does the unified memory is better ?
Moreother, in user guide, I red that unified memory is not supported by embedded OS, but it works with L4T …

Thank you !

Hi,

Zero copy is a name for shared memory.
You can use either unified memory (GPU-based) or pinned memory (CPU-based) to achieve this.
Usually, we recommends user to use pinned memory for one time read-only job.
Otherwise, you can use unified memory to reduce the latency of pinned page.

By the way, if you are using a zero-copy memory, why do you still need a memcpy function?
The GPU kernel code is waiting for the memcpy that leads to low usage rate.
A zero copy buffer is shared via CPU and GPU.

Thanks.