Howdy
Summary: I am trying to achieve maximum random read throughput using GDS batch read APIs.
I am working on a research project to accelerate random reads from NVMe SSDs and I’m currently testing out NVIDIA’s GPUDirect storage. I have a single node machine with the following specification.
CPU: 1x Intel® Xeon® W-2255, GPU: 2x NVIDIA® RTX™ A5000, PCIe generation: 3.0
Storage: 1x Samsung PM983U.2 NVMe SSD (Seq read: 3.2GB/s, Rand read: 540K IOPS) local storage connected to PCIe x4
I am trying to saturate the read bandwidth while reading 4-16 MB data from random 4K blocks from the NVMe SSD. I would hope to achieve a throughput close to 540K IOPS * 4KB = 2.16 GB/s. I need to read approximately 1024-4096 random locations. I get the following results with gdsio tool while testing batch reads.
IoType: RANDREAD XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: **1024/1024(**KiB) IOSize: 4(KiB) Throughput: 0.669796 GiB/sec, Avg_Latency: 768.000000 usecs ops: 257 total_time 0.001458 secs
IoType: RANDREAD XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: 4096/4096(KiB) IOSize: 4(KiB) Throughput: 1.142847 GiB/sec, Avg_Latency: 777.000000 usecs ops: 1025 total_time 0.003418 secs
IoType: RANDREAD XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: 8192/8192(KiB) IOSize: 4(KiB) Throughput: 1.276344 GiB/sec, Avg_Latency: 786.000000 usecs ops: 2049 total_time 0.006121 secs
IoType: RANDREAD XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: 16384/16384(KiB) IOSize: 4(KiB) Throughput: 1.329787 GiB/sec, Avg_Latency: 789.000000 usecs ops: 4097 total_time 0.011750 secs
The read throughput increases with the data set size and this makes sense since the SSD bandwidth hasn’t been saturated yet. I tried to replicate this by modifying cufile_sample_022.cc from the sample codes in /cuda-12.1/gds/samples/. You can find my code here. GDS/gds_batch at main · susavlsh10/GDS · GitHub
I tried changing the cufile.json to increase to io_batchsize but it seem we are limited to 256 io per batch? Going beyond this gave me an error. I created multiple batch_ids and io_batch_params and submitted all of them in a loop as shown below.
for (j = 0; j<NUM_BATCH; j++){
errorBatch[j] = cuFileBatchIOSubmit(batch_id[j], batch_size, io_batch_params[j], flags);
if(errorBatch[j].err != 0) {
std::cerr << "Error in IO Batch Submit" << std::endl;
goto out3;
}
std::cout<< "Batch " << j << " submitted at " << (double) (clock() - start) / CLOCKS_PER_SEC << " s" <<std::endl;
}
The following tests are performed with the default io_batchsize=128. I got the following results.
IO size = 4096 Bytes
Number of batches to read = 8
Reading from file /home/grads/s/sls7161/nvme/float_save/float_4194304_a.dat
Batch 0 submitted at 0.000832 s
Batch 1 submitted at 0.001605 s
Batch 2 submitted at 0.002347 s
Batch 3 submitted at 0.003074 s
Batch 4 submitted at 0.003791 s
Batch 5 submitted at 0.004506 s
Batch 6 submitted at 0.005206 s
Batch 7 submitted at 0.005807 s
Total Data size = 4 MB
Time taken = 0.005837
Read Bandwidth = 0.669222 GB/s
It seems that multiple batches cannot be submitted at the same time. I believe subsequent batches are submitted after previous batch read is complete. Furthemore, increasing the data size does not increase the throughput. I’m getting approximately 0.7 GB/s for all data sizes. How can I achieve the same read throughput as the gdsio tool while having to reading more than 128 blocks? The reason I’m using batch reads is because async io is currently unavailable and batch read is giving me the highest throughput.
Thank you so much for your help.