Gds tools gdsio ,the Throughput is less then 500M

I use gdsio test GPU Storage performance. The underlying SSD is Samsung 970 pro and GPU is Tesla10 . I performed seq read test and transferred 4K from the file whose size is 4GB to the GPU GDS buffer. why the Throughput is less then 500M whatever the threadnum i used.
How can i improve the performance?

The test is below:

  1. ./gdsio -x 6 -f /mnt/gdsio.001 -d 0 -w 8 -s 5G -i 4k -I 1

IoType: WRITE XferType: GPU_BATCH Threads: 1 IoDepth: 8 DataSetSize: 5242880/5242880(KiB) IOSize: 4(KiB) Throughput: 0.363297 GiB/sec, Avg_Latency: 84.000275 usecs ops: 163840 total_time 13.762840 secs

2)./gdsio -x 6 -f /mnt/gdsio.001 -d 0 -w 4 -s 5G -i 4k -I 1

IoType: WRITE XferType: GPU_BATCH Threads: 1 IoDepth: 4 DataSetSize: 5242880/5242880(KiB) IOSize: 4(KiB) Throughput: 0.241293 GiB/sec, Avg_Latency: 63.237042 usecs ops: 327680 total_time 20.721715 secs

  1. ./gdsio -x 6 -f /mnt/gdsio.001 -d 0 -w 16 -s 5G -i 4k -I 1

IoType: WRITE XferType: GPU_BATCH Threads: 1 IoDepth: 16 DataSetSize: 5242880/5242880(KiB) IOSize: 4(KiB) Throughput: 0.435547 GiB/sec, Avg_Latency: 140.130884 usecs ops: 81920 total_time 11.479828 secs

4)./gdsio -x 6 -f /mnt/gdsio.001 -d 0 -w 32 -s 5G -i 4k -I 1

IoType: WRITE XferType: GPU_BATCH Threads: 1 IoDepth: 32 DataSetSize: 5242880/5242880(KiB) IOSize: 4(KiB) Throughput: 0.473013 GiB/sec, Avg_Latency: 258.058984 usecs ops: 40960 total_time 10.570538 secs

  1. ./gdsio -x 6 -f /mnt/gdsio.001 -d 0 -w 64 -s 5G -i 4k -I 1

IoType: WRITE XferType: GPU_BATCH Threads: 1 IoDepth: 64 DataSetSize: 5242880/5242880(KiB) IOSize: 4(KiB) Throughput: 0.460453 GiB/sec, Avg_Latency: 530.181250 usecs ops: 20480 total_time 10.858860 secs

  1. ./gdsio -x 6 -f /mnt/gdsio.001 -d 0 -w 100 -s 10G -i 4k -I 1

INFO: Truncated down the data size to lower size which is a multiple of batch sizes * no of batches 10737254400

IoType: WRITE XferType: GPU_BATCH Threads: 1 IoDepth: 100 DataSetSize: 10485600/10485760(KiB) IOSize: 4(KiB) Throughput: 0.486352 GiB/sec, Avg_Latency: 784.306744 usecs ops: 26214 total_time 20.560912 secs

  1. ./gdsio -x 6 -f /mnt/gdsio.001 -d 0 -w 128 -s 10G -i 4k -I 1

IoType: WRITE XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: 10485760/10485760(KiB) IOSize: 4(KiB) Throughput: 0.487373 GiB/sec, Avg_Latency: 1001.797021 usecs ops: 20480 total_time 20.518156 secs

You have new mail in /var/spool/mail/root

8)./gdsio -x 6 -f /mnt/gdsio.001 -d 0 -w 128 -s 5G -i 4k -I 1

IoType: WRITE XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: 5242880/5242880(KiB) IOSize: 4(KiB) Throughput: 0.477510 GiB/sec, Avg_Latency: 1022.418457 usecs ops: 10240 total_time 10.470994 secs

Not my area of expertise. What throughput did you expect, and why?

Based on mass storage tests I have seen over the past twenty years, people generally use two test modes, reflecting different usage patterns: (1) Random blocks, using small transfer sizes, like 4K (2) Sequential blocks, using large transfer sizes >= 64K.

Using large transfer sizes is designed to minimize the impact of per I/O overhead. Because SSDs have much smaller overhead per I/O than HDDs, a transfer size of 64K is likely sufficient, while for a HDD one might need something like 1M transfer size to maximize sequential throughput.

In general, I would expect throughput to be limited by SSD performance if the GPU is attached using a PCIe x16 link. From published tests of the Samsung 970 pro, it seems that measured throughput at 64K transfer size is 1.5+ GB/sec (the numbers may differ somewhat by capacity variant).

It seems all your tests were performed with a small transfer size of 4K, which is good to determine IOPS, but too small to achieve peak equential throughput. Try increasing the transfer size for throughput tests.