Why GPU version of Zstd is slower than CPU version of Zstd?

[configuration]
CUDA Version: 11.4
OS in Docker container: Ubuntu 20.04.2 LTS
GPU card: A10 x 1
CPU: AMD EPYC 7413 24-Core Processor
nvcomp binary version: nvcomp_2.4.1_x86_64_11.x

[testing data]
image data size: 675 x 78

[Zstd decompression time]
On CPU (Meta API): 0.000069 s
On GPU (nvcomp API): 0.000038 s

[problem description]
I’m creating a small program that invokes Zstd APIs of nvcomp, I simply copy & paste the low_level_quickstart_example into my code, so I can do both compression and decompression. But I find that the function cudaStreamSynchronize() takes 0.014081 s which makes GPU version of Zstd slower than CPU version of Zstd.
I also use nsight system to generate a profile to confirm this behavior. May I ask what is wrong?

Thanks in advance!