I’m currently performing a series of tests on the TegraX2, where I’m using the board to extract features using the first layers of pre-trained DNN.
The test scenario is the following:
*pre-trained VGG16 model
*input image size of 224x224x3
*Tensorflow 2.0 as backend.
The first layer produces 12 MB of data, as for the second one. The third layer, precisely a pool layer reduce the amount of data to 3 MB.
So, I noticed that the performances are highly correlated to the amount of data produced, since computing the network until the third layer takes one-third of the time compared to the computation time of the first two layers.
I’m asking if there is any information available regarding the memory time access, the memory latency and the bus specification of the TegraX2 that I can use to understand this behavior.