Hello, also sharing my results with VGG16 (from Keras applications, last layer is flatten)
with DLA (FP16): ~16.5 ms
without DLA (FP16): ~5.0 ms
I assume these results, because almost half of network is still running on GPU.
******************************
Layers running on DLA:.
block1_conv1/convolution, block1_conv1/BiasAdd, block1_conv1/Relu, block1_conv2/convolution, block1_conv2/BiasAdd, block1_conv2/Relu, block1_pool/MaxPool, block2_conv1/convolution, block2_conv1/BiasAdd, block2_conv1/Relu, block2_conv2/convolution, block2_conv2/BiasAdd, block2_conv2/Relu, b
lock2_pool/MaxPool, block3_conv1/convolution, block3_conv1/BiasAdd, block3_conv1/Relu, block3_conv2/convolution, block3_conv2/BiasAdd, block3_conv2/Relu, block3_conv3/convolution, block3_conv3/BiasAdd, block3_conv3/Relu, block3_pool/MaxPool, block4_conv1/convolution, block4_conv1/BiasAdd,.
block4_conv1/Relu, block4_conv2/convolution, block4_conv2/BiasAdd, block4_conv2/Relu, block4_conv3/convolution, block4_conv3/BiasAdd, block4_conv3/Relu, block4_pool/MaxPool, block5_conv1/convolution, block5_conv1/BiasAdd, block5_conv1/Relu, block5_conv2/convolution, block5_conv2/BiasAdd, b
lock5_conv2/Relu, block5_conv3/convolution, block5_conv3/BiasAdd, block5_conv3/Relu, block5_pool/MaxPool,.
******************************
******************************
Layers running on GPU:.
block1_conv1/kernel, block1_conv1/bias, block1_conv2/kernel, block1_conv2/bias, block2_conv1/kernel, block2_conv1/bias, block2_conv2/kernel, bloc
k2_conv2/bias, block3_conv1/kernel, block3_conv1/bias, block3_conv2/kernel, block3_conv2/bias, block3_conv3/kernel, block3_conv3/bias, block4_con
v1/kernel, block4_conv1/bias, block4_conv2/kernel, block4_conv2/bias, block4_conv3/kernel, block4_conv3/bias, block5_conv1/kernel, block5_conv1/b
ias, block5_conv2/kernel, block5_conv2/bias, block5_conv3/kernel, block5_conv3/bias, reshape_1/strided_slice/stack, reshape_1/strided_slice/stack
_1, reshape_1/Reshape/shape/1, (Unnamed Layer* 73) [Shuffle], reshape_1/Reshape,.
******************************