Getting the best performance from NVIDIA GPUDirect storage APIs (batch io)

Howdy

Summary: I am trying to achieve maximum random read throughput using GDS batch read APIs.

I am working on a research project to accelerate random reads from NVMe SSDs and I’m currently testing out NVIDIA’s GPUDirect storage. I have a single node machine with the following specification.

CPU: 1x Intel® Xeon® W-2255, GPU: 2x NVIDIA® RTX™ A5000, PCIe generation: 3.0
Storage: 1x Samsung PM983U.2 NVMe SSD (Seq read: 3.2GB/s, Rand read: 540K IOPS) local storage connected to PCIe x4

I am trying to saturate the read bandwidth while reading 4-16 MB data from random 4K blocks from the NVMe SSD. I would hope to achieve a throughput close to 540K IOPS * 4KB = 2.16 GB/s. I need to read approximately 1024-4096 random locations. I get the following results with gdsio tool while testing batch reads.

IoType: RANDREAD XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: **1024/1024(**KiB) IOSize: 4(KiB) Throughput: 0.669796 GiB/sec, Avg_Latency: 768.000000 usecs ops: 257 total_time 0.001458 secs
IoType: RANDREAD XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: 4096/4096(KiB) IOSize: 4(KiB) Throughput: 1.142847 GiB/sec, Avg_Latency: 777.000000 usecs ops: 1025 total_time 0.003418 secs
IoType: RANDREAD XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: 8192/8192(KiB) IOSize: 4(KiB) Throughput: 1.276344 GiB/sec, Avg_Latency: 786.000000 usecs ops: 2049 total_time 0.006121 secs
IoType: RANDREAD XferType: GPU_BATCH Threads: 1 IoDepth: 128 DataSetSize: 16384/16384(KiB) IOSize: 4(KiB) Throughput: 1.329787 GiB/sec, Avg_Latency: 789.000000 usecs ops: 4097 total_time 0.011750 secs

The read throughput increases with the data set size and this makes sense since the SSD bandwidth hasn’t been saturated yet. I tried to replicate this by modifying cufile_sample_022.cc from the sample codes in /cuda-12.1/gds/samples/. You can find my code here. GDS/gds_batch at main · susavlsh10/GDS · GitHub

I tried changing the cufile.json to increase to io_batchsize but it seem we are limited to 256 io per batch? Going beyond this gave me an error. I created multiple batch_ids and io_batch_params and submitted all of them in a loop as shown below.

for (j = 0; j<NUM_BATCH; j++){
    errorBatch[j] = cuFileBatchIOSubmit(batch_id[j], batch_size, io_batch_params[j], flags);
    if(errorBatch[j].err != 0) {
        std::cerr << "Error in IO Batch Submit" << std::endl;
        goto out3;
    }

    std::cout<< "Batch " << j << " submitted at " << (double) (clock() - start) / CLOCKS_PER_SEC << " s" <<std::endl;

}

The following tests are performed with the default io_batchsize=128. I got the following results.

IO size = 4096 Bytes
Number of batches to read = 8
Reading from file /home/grads/s/sls7161/nvme/float_save/float_4194304_a.dat
Batch 0 submitted at 0.000832 s
Batch 1 submitted at 0.001605 s
Batch 2 submitted at 0.002347 s
Batch 3 submitted at 0.003074 s
Batch 4 submitted at 0.003791 s
Batch 5 submitted at 0.004506 s
Batch 6 submitted at 0.005206 s
Batch 7 submitted at 0.005807 s
Total Data size = 4 MB
Time taken = 0.005837
Read Bandwidth = 0.669222 GB/s

It seems that multiple batches cannot be submitted at the same time. I believe subsequent batches are submitted after previous batch read is complete. Furthemore, increasing the data size does not increase the throughput. I’m getting approximately 0.7 GB/s for all data sizes. How can I achieve the same read throughput as the gdsio tool while having to reading more than 128 blocks? The reason I’m using batch reads is because async io is currently unavailable and batch read is giving me the highest throughput.

Thank you so much for your help.

I would appreciate any feedback in improving the concurrent batch submission and batch io throughput.

Hi, thanks for the detailed information. It seems that you are submitting sequentially one after another in a loop. Each submission takes approximately .0008 secs and therefore it is showing the cumulative submission timings. If you want to submit multiple batches in parallel, I would suggest to use multiple threads each performing one batch I/O.

Thank you for your response. I have tried submitting batches using pthreads but the throughput seems to go down even further. I’m not sure if there is an internal lock that’s preventing the concurrent submissions. In the example below, I created 8 threads and submitted one batch per thread. I created the following thread function.
The code can be found here: https://github.com/susavlsh10/GDS/tree/main/gds_batch in the file cu_bread.cc.

Thread functions:

typedef struct thread_data
{
    CUfileIOParams_t *io_batch_params;
    CUfileBatchHandle_t batch_id;
    int j; int batch_offset; int size; int batch_size;
    clock_t start;

}thread_data_t;

static void *thread_batch_io(void *data)
{
    thread_data_t *t = (thread_data_t *)data;

    cudaSetDevice(0);

    CUfileError_t errorBatch = cuFileBatchIOSubmit(t->batch_id, t->batch_size, t->io_batch_params, 0);
    std::cout<< "Thread " << t->j << "submitted batch at "<< (double) (clock() - t->start) / CLOCKS_PER_SEC << " s" <<std::endl;
    if(errorBatch.err != 0) {
        std::cerr << "Error in IO Batch Submit" << std::endl;
    }    
    pthread_exit(NULL);
}

In the main function

    start = clock();
    for (j =0; j < NUM_BATCH; j++){
        t[j].batch_offset = batch_offset;
        t[j].batch_size = batch_size;
        t[j].io_batch_params = io_batch_params[j];
        t[j].batch_id = batch_id[j];
        t[j].size = size;
        t[j].j = j;
        t[j].start = start;
        pthread_create(&threads[j], NULL, &thread_batch_io, &t[j]);
    }
    std::cout << "Waiting " << std::endl;
    std::cout << "All threads created " <<(double) (clock() - start) / CLOCKS_PER_SEC << std::endl;

    for (j = 0; j < NUM_BATCH; j++) {
		pthread_join(threads[j], NULL);
	}
    std::cout << "All threads joined " <<(double) (clock() - start) / CLOCKS_PER_SEC << std::endl;

Output:

Waiting
All threads created 0.000843
Thread 6 submitted batch at 0.012364 s
Thread 4 submitted batch at 0.014712 s
Thread 0 submitted batch at 0.017258 s
Thread 7 submitted batch at 0.019806 s
Thread 2 submitted batch at Thread 1 submitted batch at 0.02498 s
0.024997 s
Thread 3 submitted batch at Thread 5 submitted batch at 0.030134 s
0.030143 s
All threads joined 0.030535
Total Data size = 4 MB
Time taken = 0.030573
Read Bandwidth = 0.127768 GB/s

As we can see, the read throughput decreased from ~0.7GB/s to 0.127GB/s. Is there a right way to submit batches using threads? Are there any examples of submitting multiple batches concurrently?

Thank you for your time. Any helpful feedback would be appreciated.

Thanks for sharing the code. I see that you have made submission to be multi-threaded, however, cuFileBatchIOGetStatus call is still single threaded (as part of the main thread). Just to give an idea, I would probably do it in the following way:
thread_batch_io(…)
{

Call cuFileBatchIOSubmit(…)
Call cuFileBatchIOGetStatus(…)

}
In the main function:

for (j = 0; j < NUM_BATCH; j++) {
pthread_create(…,thread_batch_io, );
}
pthread_join()

The main idea here is that each threads will be performing I/Os (submit + wait until done) operating on their individual set of parameters and the main function will create threads and wait for them. Unfortunately, we do not have any multi-threaded samples, otherwise I could have provided. Please note that, pthread calls have their own cpu overheads.

Just to add, in case you are running any older gds version, I would suggest to upgrade to the latest (12.1 or later). They have more performance optimizations.

Hope this helps.

Thank you for your response. I did try these a few weeks ago but did not get better results. Here is the updated thread function.

static void *thread_batch_io(void *data)
{
thread_data_t *t = (thread_data_t *)data;
CUfileIOEvents_t io_batch_events[t->batch_size];

cudaSetDevice(0);
std::cout<< "Thread " << t->j << " beginning at "<< (double) (clock() - t->start) / CLOCKS_PER_SEC << " s" <<std::endl;
CUfileError_t errorBatch = cuFileBatchIOSubmit(t->batch_id, t->batch_size, t->io_batch_params, 0);
std::cout<< "Thread " << t->j << " submitted at "<< (double) (clock() - t->start) / CLOCKS_PER_SEC << " s" <<std::endl;
if(errorBatch.err != 0) {
    std::cerr << "Error in IO Batch Submit" << std::endl;
}
    unsigned int nr = 0;
    int num_completed;
    while(num_completed != t->batch_size) 
    {
        memset(io_batch_events, 0, sizeof(*io_batch_events));
        nr = t->batch_size;
        errorBatch = cuFileBatchIOGetStatus(t->batch_id, t->batch_size, &nr, io_batch_events, NULL);	
        if(errorBatch.err != 0) {
            std::cerr << "Error in IO Batch Get Status" << std::endl;
            //goto out4;
        }
        //std::cout << "Got events " << nr << std::endl;
        num_completed += nr;
  }        

pthread_exit(NULL);

}

The main function does the following.

start = clock();
for (j =0; j < NUM_BATCH; j++){
    t[j].batch_offset = batch_offset;
    t[j].batch_size = batch_size;
    t[j].io_batch_params = io_batch_params[j];
    t[j].batch_id = batch_id[j];
    t[j].size = size;
    t[j].j = j;
    t[j].start = start;
    pthread_create(&threads[j], NULL, &thread_batch_io, &t[j]);
}
std::cout << "Waiting " << std::endl;
std::cout << "All threads created " <<(double) (clock() - start) / CLOCKS_PER_SEC << std::endl;

for (j = 0; j < NUM_BATCH; j++) {
  pthread_join(threads[j], NULL);

}
end = clock();
std::cout << "All threads joined " <<(double) (clock() - start) / CLOCKS_PER_SEC << std::endl;

Here is the sample of the output in my machine.

Waiting 
Thread 2 beginning at Thread 1 beginning at 0.000377 s
0.00044 sThread 
0 beginning at Thread 4 beginning at 0.000653 s
0.000707 s
Thread 3
All threads created 0.000876
 beginning at Thread 5 beginning at 0.000933 s
Thread 7 beginning at 0.001032 s
0.000951 s
Thread 6 beginning at 0.001069 s
Thread 7 submitted at Thread 2 submitted at 0.0211220.019723 s s

Thread 4 submitted at 0.022591 s
Thread 0 submitted at 0.024836 s
Thread 6 submitted at 0.026554 s
Thread 5 submitted at 0.02846 s
Thread 1 submitted at 0.030719 s
Thread 3 submitted at 0.033191 s
All threads joined 0.033994
Total Data size = 4 MB
Time taken  = 0.033991
Read Bandwidth = 0.114920 GB/s

The single threaded GDS batch reads with a for loop is significantly faster than the threaded code. I am not sure why this is the case- the overhead of creating all the threads is ~1ms. I believe they are stalling at cuFileBatchIOSubmit. I recorded the the time before each thread reaches cuFileBatchIOSubmit and immediately after. The output shows that almost all thread reaches cuFileBatchIOSubmit around ~1ms but takes ~30ms for all threads to submit the batches. The single threaded code with a for loop took around ~5ms to submit all the batches. Is there an internal lock inside cuFileBatchIOSubmit that I’m unaware of?

Is this how the gdsio tool is submitting multiple batches? Is there a way to submit larger than 256 io per batch? I tried changing the cufile.json file but increasing it larger than 256 gave errors. I have updated the cu_bread.cc code in github if you would like to check it out. https://github.com/susavlsh10/GDS/blob/main/gds_batch/cu_bread.cc

Thank you so much for your help.