Bluefield-2 DOCA SHA Engine Performance

Hello,
I have been trying to use SHA engine on Bluefield-2 using OpenSSL but I’ve noticed the performance of the engine is considerably lower than software and it get worse as I lower the message size (7x lower for 256 byte data).
I am following the instruction in here:

I have tested this with DOCA Bluefield 2.10 which is the latest version at the moment.
I want to know what I am doing wrong and what performance I should expect.

ubuntu@localhost:~$ openssl speed -evp sha256 -bytes 10000 -elapsed --engine /opt/mellanox/doca/tools/doca_sha_offload_engine/libdoca_sha_offload_engine.so -async_jobs 256
Engine "doca_sha_offload_engine" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing sha256 for 3s on 10000 size blocks: 259980 sha256's in 3.00s
version: 3.0.2
built on: Tue Aug 20 17:27:32 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-BW0rDL/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_armcap=0xbf
The 'numbers' are in 1000s of bytes per second processed.
type          10000 bytes
sha256          866600.00k


ubuntu@localhost:~$ openssl speed -evp sha256 -bytes 10000 -elapsed 
You have chosen to measure elapsed time instead of user CPU time.
Doing sha256 for 3s on 10000 size blocks: 393892 sha256's in 3.00s
version: 3.0.2
built on: Tue Aug 20 17:27:32 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-BW0rDL/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_armcap=0xbf
The 'numbers' are in 1000s of bytes per second processed.
type          10000 bytes
sha256         1312973.33k

Thanks

hi amohammadrez

There are many factors that can affect OpenSSL performance , you can contact networking-support@nvidia.com for further support

Thank you
Quanying Sun