Jetson xavier GPU time consuming SERIOUSLY running CUDA algorithm?

hi nvidia,
i test xavier gpu computing power and computing speed with a standard bilinear demosaic algorithm, and found that it seems to does not match the claimed 32TOPS (INT8)11TFOPS(FP16)

          My test code is pasted as follows,and the gpu cuda time consuming is far beyond my acceptable range. and i found that it quiet consumes time when running the divider .

template
device void demosaic_bilinear_kernel(int width, int height,
uint16_t* R_apci, uint16_t* G_apci, uint16_t* B_apci,
int x, int y, bool xEven, bool yEven)
{


/************START/
if (!isBlue && !isRed)
{
G_apci[y * width + x] = current;
if (isGreenInRedRow){
R_apci[y * width + x] = (left + right) / 2;
B_apci[y * width + x] = (up + down) / 2;
} else {
B_apci[y * width + x] = (left + right) / 2;
R_apci[y * width + x] = (up + down) / 2 ;
}
} else if (isBlue) {
G_apci[y * width + x] = (up + down + left + right) / 4;
B_apci[y * width + x] = current;
R_apci[y * width + x] = (up_left + up_right + down_left + down_right) / 4;
} else if (isRed) {
G_apci[y * width + x] = (up + down + left + right) / 4;
R_apci[y * width + x] = current;
B_apci[y * width + x] = (up_left + up_right + down_left + down_right) / 4;
}
/************STOP/

}

THIS START STOP PART IN STEP COST NEARLY 8ms,
BUT I change the code avoid to use the diver as as follows, actually, it just nearly1ms
the difference is too big.

template
device void demosaic_bilinear_kernel(int width, int height,
uint16_t* R_apci, uint16_t* G_apci, uint16_t* B_apci,
int x, int y, bool xEven, bool yEven)
{


/************START/
if (!isBlue && !isRed)
{
G_apci[y * width + x] = current;
if (isGreenInRedRow){
R_apci[y * width + x] = (left + right) >> 1;
B_apci[y * width + x] = (up + down) >> 1;
} else {
B_apci[y * width + x] = (left + right) >> 1;
R_apci[y * width + x] = (up + down) >> 1;
}
} else if (isBlue) {
G_apci[y * width + x] = (up + down + left + right) >> 2;
B_apci[y * width + x] = current;
R_apci[y * width + x] = (up_left + up_right + down_left + down_right) >> 2;
} else if (isRed) {
G_apci[y * width + x] = (up + down + left + right) >> 2;
R_apci[y * width + x] = current;
B_apci[y * width + x] = (up_left + up_right + down_left + down_right) >> 2;
}
/************STOP/

}

THIS START STOP PART IN STEP COST NEARLY 8ms,
BUT I change the code avoid to use the diver as as follows, actually, it just nearly1ms
the difference is too big.

Is the jetson xavier gpu performance of the divider that so bad?
Is there any way to improve the gpu divider performance and speed?

any help to improve the performance and speed of xavier Gpu would be very appreciated
thank you so much.

Hi,

Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

More, could you monitor the system to see if the GPU is fully occupied?

$ sudo tegrastats

Thanks.

i would like to say thank you for your reply, but the command doesn’t seem to take effect,
the Time consumption is not reduced.
firstly i run the command tegrastats, and then run my test case, after finding that the time-consuming has not decreased, i exit the running of the program. the processing print log attached as follows.
tegrastats.txt (18.5 KB)

Hi,

Based on the log, the GPU resource is not fully occupied.
Maybe you can try to profile the application with the below tool to find the bottleneck first.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.