About performance of create cufft plan

I used CUDA11.1 and VS2019 before, and the performance of CUFFT can basically meet the requirements. A while ago, I updated CUDA12.6 and VS2022, and I found that the performance of creating plans has significantly decreased. I cannot create plans properly when the program starts because the image size is related to the dynamic input parameters. There are three questions

  1. Why does the creation performance of cufft change?
  2. Can we create a plan using the largest image size and reuse images of all sizes
  3. What should I do in a situation like mine?

Can you provide some actual numbers, so discussion does not have to take place in a vacuum without any data? Please confirm that the “before” and “after” numbers you will provide were generated with the exact same physical system, with only CUDA and MSVS updates applied, and utilize FFTs with the exact same configuration settings.

It is unfortunate that both MSVS and CUDA were updated at the same time, which means this change is not really a controlled experiment, where only one variable is changed in any one step.

I have read this sentence multiple times now and it is not clear (to me at least) how this observation ties in with the decreased speed of plan generation mentioned earlier. Can you clarify how these two points are related?

Generally speaking, CUFFT plan generation is an activity that takes place on the host system. Changes to host hardware, changes to host software, and load on the host system can therefore impact the performance of plan generation. The first order of business is to track down which change actually caused a negative impact on CUFFT plan generation time. While it seems plausible that this was caused by an update of the CUDA software stack, there is not enough information provided here to refute or confirm this hypothesis.

I am saying the impact of CUDA is plausible because the CUFFT plan generation is driven by heuristics from what I understand, and newer versions of CUFFT may use more complex heuristics or require more input data than older versions, making plan generation slower. If you can demonstrate significant slowdown in an apples-to-apples comparison (i.e. with fixed FFT configuration parameters) between two CUDA versions, you would probably want to file a performance bug with NVIDIA.

If there is a slowdown in plan generation due to more complex heuristics being used, it may be unavoidable. In that case you would want to use a faster host system, in particular paying attention to baseline single-thread performance. My long-standing recommendation for systems designated to run CUDA-accelerated applications is to select host systems whose CPUs operate with >= 3.5 GHz baseline frequency.

questions about CUFFT usage belong on this forum. This possibly related topic discusses that the CUFFT team is/was aware of issues and changes in CUFFT plan creation. It seems evident from the description there that CUFFT plan creation (now) may also cause module loading. This is consistent with a general trend in CUDA towards lazy loading, which have a variety of reasons that support the idea, but is not without some associated issues.

Certainly plan reuse is a good option. Also, as described there, cufftDestroy can cause a situation where module reloading takes place at the next plan creation, therefore as suggested there, another option to consider is storing all your plans in a vector and not destroying them until performance is no longer a concern. Obviously that will have some limits as well, from a workaround perspective. You would not want to store a vector of trillions of plans.

The plan expects a certain size. You can reuse a plan on a smaller size if the data set is padded to the size the plan expects. Padding of FFT data is a common scenario (in my view) but may not fit your needs. It will require you to pad the data and it will also affect the output numerically.

Unless there is some objection, I’ll plan to move this topic over to the other forum I referenced, shortly.

What do you want to do with the Fourier transformed data? Is it used for convolution or correlation and the FFT size is flexible or do you need a specifically sized FFT?

Do you have a small set of sizes or is any size possible?

As those are images (at least or exactly 2D), is the number of rows or columns constant?

Are there parameters to create a possibly slightly slower, but simpler plan in cuFFT?

I use the VS 2019, and tested CUDA version 11.1 and 12.6 respectively. the time of create cufft plan by cuda11.1 is about 0.002 ms, while 1.1 ms in cuda 12.6. the driver is 580.97 and gpu is RTX A4000.

this is the test code.

#include <cstdio>
#include <cstdlib>
#include <cufft.h>
#include <cuda_runtime.h>

// 检查CUDA错误
#define CHECK_CUDA_ERROR(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA error at %s:%d - %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// 检查cuFFT错误
#define CHECK_CUFFT_ERROR(call) \
    do { \
        cufftResult result = call; \
        if (result != CUFFT_SUCCESS) { \
            fprintf(stderr, "cuFFT error at %s:%d - code %d\n", __FILE__, __LINE__, result); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// 创建CUDA事件并记录时间
float create_and_record_event(cudaEvent_t &start, cudaEvent_t &end) {
    CHECK_CUDA_ERROR(cudaEventCreate(&start));
    CHECK_CUDA_ERROR(cudaEventCreate(&end));
    CHECK_CUDA_ERROR(cudaEventRecord(start));
    return 0.0f;
}

// 计算事件时间差
float get_elapsed_time(cudaEvent_t start, cudaEvent_t end) {
    CHECK_CUDA_ERROR(cudaEventSynchronize(end));
    float milliseconds = 0.0f;
    CHECK_CUDA_ERROR(cudaEventElapsedTime(&milliseconds, start, end));
    CHECK_CUDA_ERROR(cudaEventDestroy(start));
    CHECK_CUDA_ERROR(cudaEventDestroy(end));
    return milliseconds;
}

int main() {
    const int nx = 1000;
    const int ny = 1000;
    const int num_iterations = 10;
    
    printf("Testing cuFFT plan creation performance for %dx%d C2C 2D FFT\n", nx, ny);
    printf("Number of iterations: %d\n\n", num_iterations);
    
    // 预热:创建并销毁一个计划
    printf("Warming up...\n");
    cufftHandle warmup_plan;
    CHECK_CUFFT_ERROR(cufftPlan2d(&warmup_plan, nx, ny, CUFFT_C2C));
    CHECK_CUFFT_ERROR(cufftDestroy(warmup_plan));
    printf("Warmup completed.\n\n");
    
    // 存储计时结果
    float create_times[num_iterations];
    float destroy_times[num_iterations];
    float total_create_time = 0.0f;
    float total_destroy_time = 0.0f;
    
    // 主测试循环
    for (int i = 0; i < num_iterations; i++) {
        cufftHandle plan;
        cudaEvent_t start_create, end_create;
        cudaEvent_t start_destroy, end_destroy;
        
        // 测试创建计划时间
        create_and_record_event(start_create, end_create);
        CHECK_CUFFT_ERROR(cufftPlan2d(&plan, nx, ny, CUFFT_C2C));
        CHECK_CUDA_ERROR(cudaEventRecord(end_create));
        
        // 测试销毁计划时间
        create_and_record_event(start_destroy, end_destroy);
        CHECK_CUFFT_ERROR(cufftDestroy(plan));
        CHECK_CUDA_ERROR(cudaEventRecord(end_destroy));
        
        // 获取时间
        create_times[i] = get_elapsed_time(start_create, end_create);
        destroy_times[i] = get_elapsed_time(start_destroy, end_destroy);
        
        total_create_time += create_times[i];
        total_destroy_time += destroy_times[i];
        
        printf("Iteration %d: Create = %.3f ms, Destroy = %.3f ms\n", 
               i + 1, create_times[i], destroy_times[i]);
    }
    
    // 输出统计结果
    printf("\n=== Performance Summary ===\n");
    printf("Average creation time: %.3f ms\n", total_create_time / num_iterations);
    printf("Average destruction time: %.3f ms\n", total_destroy_time / num_iterations);
    printf("Total creation time: %.3f ms\n", total_create_time);
    printf("Total destruction time: %.3f ms\n", total_destroy_time);
    printf("Total time: %.3f ms\n", total_create_time + total_destroy_time);
    
    // 找出最小和最大时间
    float min_create = create_times[0];
    float max_create = create_times[0];
    float min_destroy = destroy_times[0];
    float max_destroy = destroy_times[0];
    
    for (int i = 1; i < num_iterations; i++) {
        if (create_times[i] < min_create) min_create = create_times[i];
        if (create_times[i] > max_create) max_create = create_times[i];
        if (destroy_times[i] < min_destroy) min_destroy = destroy_times[i];
        if (destroy_times[i] > max_destroy) max_destroy = destroy_times[i];
    }
    
    printf("\nCreation time - Min: %.3f ms, Max: %.3f ms\n", min_create, max_create);
    printf("Destruction time - Min: %.3f ms, Max: %.3f ms\n", min_destroy, max_destroy);
    
    return 0;
}

Using your benchmark code, I see very similar CUFFT “create” times (a bit over 1 millisecond) using CUDA 12.8.

You might want to follow-up on the point raised by Robert Crovella above: CUFFT apparently now defaults to lazy module loading, such that the first plan creation now also includes the time for that. To confirm or refute that this explains your observations, you would want to change your benchmark from a create-destroy cycle to a create-create- …. create .. destroy … configuration, and measure the time it takes for each create step separately. If lazy module loading is the issue, we would expect the first plan creation to be slow and all following plan creations to be very fast.

I do not know (1) whether there is an environment variable that allows a user to configure CUFFT module loading behavior analogous to how this can be controlled for the CUDA runtime (2) whether there is an innocuous CUFFT function one can call to trigger lazy module loading at a time that is more convenient to the programmer (for the CUDA runtime, cudaFree(0) used to do that; now it is supposedly a call to cudaSetDevice()).

If the issue is verified to be module loading time, my expectation is that this cost is always there, it’s just that the time is now accounted for at a different place in the overall software execution flow.

If you are unable to resolve the issue to your satisfaction, you may want to file a performance-regression bug with NVIDIA to have them sort it out.

thanks for your reply. @njuffa @Robert_Crovella

I have modified the timing method and retested it. I suspect that the timing of CUDAEvent may not be accurate。

vs2019+cuda11.1:0.12 ms

vs2019+cuda11.6:13.4 ms

vs2019+cuda12.6: 1.3 ms

the code is as follows. Then I will test the step create..create…create..without destroy.

#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <iomanip>
#include <cufft.h>
#include <cuda_runtime.h>

// 检查CUDA错误
#define CHECK_CUDA_ERROR(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            std::cerr << "CUDA error at " << __FILE__ << ":" << __LINE__ \
                      << " - " << cudaGetErrorString(err) << std::endl; \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// 检查cuFFT错误
#define CHECK_CUFFT_ERROR(call) \
    do { \
        cufftResult result = call; \
        if (result != CUFFT_SUCCESS) { \
            std::cerr << "cuFFT error at " << __FILE__ << ":" << __LINE__ \
                      << " - code " << result << std::endl; \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// 时间单位转换工具
class Timer {
public:
    static double toMilliseconds(const std::chrono::duration<double>& duration) {
        return duration.count() * 1000.0;
    }
    
    static double toMicroseconds(const std::chrono::duration<double>& duration) {
        return duration.count() * 1000000.0;
    }
};

int main() {
    const int nx = 1000;
    const int ny = 1000;
    const int num_iterations = 10;
    
    std::cout << "Testing cuFFT plan creation performance for " 
              << nx << "x" << ny << " C2C 2D FFT" << std::endl;
    std::cout << "Number of iterations: " << num_iterations << std::endl << std::endl;
    
    // 预热:创建并销毁一个计划,并计时
    std::cout << "Warming up..." << std::endl;
    
    cufftHandle warmup_plan;
    
    // 计时预热创建
    auto warmup_start_create = std::chrono::high_resolution_clock::now();
    CHECK_CUFFT_ERROR(cufftPlan2d(&warmup_plan, nx, ny, CUFFT_C2C));
    auto warmup_end_create = std::chrono::high_resolution_clock::now();
    
    // 计时预热销毁
    auto warmup_start_destroy = std::chrono::high_resolution_clock::now();
    CHECK_CUFFT_ERROR(cufftDestroy(warmup_plan));
    auto warmup_end_destroy = std::chrono::high_resolution_clock::now();
    
    // 计算预热时间
    std::chrono::duration<double> warmup_create_duration = warmup_end_create - warmup_start_create;
    std::chrono::duration<double> warmup_destroy_duration = warmup_end_destroy - warmup_start_destroy;
    
    double warmup_create_time_ms = Timer::toMilliseconds(warmup_create_duration);
    double warmup_destroy_time_ms = Timer::toMilliseconds(warmup_destroy_duration);
    
    std::cout << "Warmup completed." << std::endl;
    std::cout << "Warmup - Create: " << std::fixed << std::setprecision(3) 
              << warmup_create_time_ms << " ms, Destroy: " 
              << warmup_destroy_time_ms << " ms" << std::endl << std::endl;
    
    // 存储计时结果
    std::vector<double> create_times_ms;
    std::vector<double> destroy_times_ms;
    
    double total_create_time_ms = 0.0;
    double total_destroy_time_ms = 0.0;
    
    // 主测试循环
    for (int i = 0; i < num_iterations; i++) {
        cufftHandle plan;
        
        // 测试创建计划时间
        auto start_create = std::chrono::high_resolution_clock::now();
        CHECK_CUFFT_ERROR(cufftPlan2d(&plan, nx, ny, CUFFT_C2C));
        auto end_create = std::chrono::high_resolution_clock::now();
        
        std::chrono::duration<double> create_duration = end_create - start_create;
        double create_time_ms = Timer::toMilliseconds(create_duration);
        
        // 测试销毁计划时间
        auto start_destroy = std::chrono::high_resolution_clock::now();
        CHECK_CUFFT_ERROR(cufftDestroy(plan));
        auto end_destroy = std::chrono::high_resolution_clock::now();
        
        std::chrono::duration<double> destroy_duration = end_destroy - start_destroy;
        double destroy_time_ms = Timer::toMilliseconds(destroy_duration);
        
        // 保存结果
        create_times_ms.push_back(create_time_ms);
        destroy_times_ms.push_back(destroy_time_ms);
        
        total_create_time_ms += create_time_ms;
        total_destroy_time_ms += destroy_time_ms;
        
        std::cout << "Iteration " << std::setw(2) << (i + 1) << ": "
                  << "Create = " << std::setw(8) << std::fixed << std::setprecision(3) << create_time_ms << " ms, "
                  << "Destroy = " << std::setw(8) << std::fixed << std::setprecision(3) << destroy_time_ms << " ms"
                  << std::endl;
    }
    
    // 计算统计信息
    double avg_create_time = total_create_time_ms / num_iterations;
    double avg_destroy_time = total_destroy_time_ms / num_iterations;
    
    auto [min_create, max_create] = std::minmax_element(create_times_ms.begin(), create_times_ms.end());
    auto [min_destroy, max_destroy] = std::minmax_element(destroy_times_ms.begin(), destroy_times_ms.end());
    
    // 输出统计结果
    std::cout << std::endl << "=== Performance Summary ===" << std::endl;
    std::cout << std::fixed << std::setprecision(3);
    std::cout << "Warmup creation time:   " << std::setw(8) << warmup_create_time_ms << " ms" << std::endl;
    std::cout << "Warmup destruction time:" << std::setw(8) << warmup_destroy_time_ms << " ms" << std::endl;
    std::cout << "Average creation time:  " << std::setw(8) << avg_create_time << " ms" << std::endl;
    std::cout << "Average destruction time:" << std::setw(8) << avg_destroy_time << " ms" << std::endl;
    std::cout << "Total creation time:    " << std::setw(8) << total_create_time_ms << " ms" << std::endl;
    std::cout << "Total destruction time: " << std::setw(8) << total_destroy_time_ms << " ms" << std::endl;
    std::cout << "Total time:             " << std::setw(8) << (total_create_time_ms + total_destroy_time_ms) << " ms" << std::endl;
    
    std::cout << std::endl;
    std::cout << "Creation time  - Min: " << std::setw(8) << *min_create << " ms, "
              << "Max: " << std::setw(8) << *max_create << " ms" << std::endl;
    std::cout << "Destruction time - Min: " << std::setw(8) << *min_destroy << " ms, "
              << "Max: " << std::setw(8) << *max_destroy << " ms" << std::endl;
    
    // 计算预热与平均时间的差异
    double create_diff = warmup_create_time_ms - avg_create_time;
    double destroy_diff = warmup_destroy_time_ms - avg_destroy_time;
    
    std::cout << std::endl << "=== Warmup vs Average Comparison ===" << std::endl;
    std::cout << "Creation time difference (Warmup - Avg): " << std::setw(8) << create_diff << " ms" << std::endl;
    std::cout << "Destruction time difference (Warmup - Avg): " << std::setw(8) << destroy_diff << " ms" << std::endl;
    
    if (create_diff > 0) {
        std::cout << "Warmup creation was slower than average by " << std::abs(create_diff) << " ms" << std::endl;
    } else {
        std::cout << "Warmup creation was faster than average by " << std::abs(create_diff) << " ms" << std::endl;
    }
    
    if (destroy_diff > 0) {
        std::cout << "Warmup destruction was slower than average by " << std::abs(destroy_diff) << " ms" << std::endl;
    } else {
        std::cout << "Warmup destruction was faster than average by " << std::abs(destroy_diff) << " ms" << std::endl;
    }
    
    return 0;
}

I delete the cufftDestroy and onlay create cufft plan and the result as follows.

vs2019+cuda11.1: 0.16ms

vs2019+cuda12.6: 1.4ms

By the way, I did not set the system environment variable lazy loading mode during the testing process, so it defaults to EAGER. So I think this has nothing to do with lazy loading.

This is what I see on a L4 GPU, on linux, CUDA 13. A little less than 1 ms for subsequent plan creations when there is no intervening cufftDestroy:

# cat t413.cu
#include <cufft.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

    timeval tv;
    gettimeofday(&tv, 0);
    return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const int nx = 1024;
const int ny = 1024;

int main(){
  cufftHandle plan;
  for (int i = 0; i < 10; i++){
    unsigned long long dt = dtime_usec(0);
    cufftResult r = cufftPlan2d(&plan, nx, ny, CUFFT_C2C);
    if (r != CUFFT_SUCCESS) std::cout << "cufft error: " << (int)r << std::endl;
    dt = dtime_usec(dt);
    std::cout << "iteration: "  << i << " microseconds: " << dt << std::endl;}
}

# nvcc -o t413 t413.cu -lcufft
# ./t413
iteration: 0 microseconds: 1845381
iteration: 1 microseconds: 719
iteration: 2 microseconds: 675
iteration: 3 microseconds: 702
iteration: 4 microseconds: 654
iteration: 5 microseconds: 656
iteration: 6 microseconds: 670
iteration: 7 microseconds: 664
iteration: 8 microseconds: 672
iteration: 9 microseconds: 662
#

On godbolt, which has a cc7.5 GPU on linux, using the same code, I observe:

CUDA version/Plan creation time after first (us)
12.8/531
12.0/523
11.8/260
11.0/1786
10.0/1660

Starting with CUDA 12.2, the default is lazy, not eager, if you don’t set any env variables.

@Robert_Crovella

My test environment is win10 x64, RTX A4000 gpu.

I find the performance seems to have nothing to do with lazy loading, It may be related to win OS。

I tested it with your code, the file name with post suffix -e is EAGER, default is lazy load.

the result is as follow:

D:\>D:\cufftTest\VS2019-cuda11.1\cufftTest11-1.exe 
iteration: 0 milliseconds: 521.193 
iteration: 1 milliseconds: 0.207 
iteration: 2 milliseconds: 0.155 
iteration: 3 milliseconds: 0.389 
iteration: 4 milliseconds: 0.295 
iteration: 5 milliseconds: 0.158 
iteration: 6 milliseconds: 0.141 
iteration: 7 milliseconds: 0.148 
iteration: 8 milliseconds: 0.179 
iteration: 9 milliseconds: 0.38 

D:\>D:\cufftTest\VS2019-cuda11.1\cufftTest11-1-e.exe 
iteration: 0 milliseconds: 432.137 
iteration: 1 milliseconds: 0.182 
iteration: 2 milliseconds: 0.151 
iteration: 3 milliseconds: 0.137 
iteration: 4 milliseconds: 0.13 
iteration: 5 milliseconds: 0.128 
iteration: 6 milliseconds: 0.222 
iteration: 7 milliseconds: 0.148 
iteration: 8 milliseconds: 0.13 
iteration: 9 milliseconds: 0.127 

D:\>D:\cufftTest\VS2019-cuda12.6\cufftTest12-6.exe 
iteration: 0 milliseconds: 192.286 
iteration: 1 milliseconds: 1.403 
iteration: 2 milliseconds: 1.001 
iteration: 3 milliseconds: 1.078 
iteration: 4 milliseconds: 1.58 
iteration: 5 milliseconds: 1.043 
iteration: 6 milliseconds: 1.317 
iteration: 7 milliseconds: 1.02 
iteration: 8 milliseconds: 1.665 
iteration: 9 milliseconds: 1.438 

D:\>D:\cufftTest\VS2019-cuda12.6\cufftTest12-6-e.exe 
iteration: 0 milliseconds: 219.478 
iteration: 1 milliseconds: 1.201 
iteration: 2 milliseconds: 1.14 
iteration: 3 milliseconds: 1.131 
iteration: 4 milliseconds: 1.046 
iteration: 5 milliseconds: 1.101 
iteration: 6 milliseconds: 1.121 
iteration: 7 milliseconds: 1.061 
iteration: 8 milliseconds: 1.075 
iteration: 9 milliseconds: 1.16

@Robert_Crovella @njuffa

I would like to summarize that cufft may have slowed down due to a change in the heuristic method used to create the plan (not for the first time)? I can report bugs to Nvidia and ask them to help solve them

During the following testing process, I did not change the CPU, GPU, or graphics card driver

Unless I misunderstand the data posted by Robert Crovella above, it indicates that CUFFT got faster for both eager and non-eager initialization between CUDA 11.1 and CUDA 12.6.

If so, that would contradict your own observations. In any event you are always free to submit a bug report to NVIDIA. The first step in the handling of bug reports is the attempt by NVIDIA’s engineers to reproduce the reported behavior in house. From what I have seen, this may take multiple iterations depending on the supporting materials included with the bug report.

I retested the performance of VS2019+cuda11.1, cuda12.6, and cuda12.8. The OS is win10x64, CPU is Intel 4215R, GPU is A4000. Below are the test data:

VS2019+cuda11.1:

iteration: 0 milliseconds: 530.928

iteration: 1 milliseconds: 0.188

iteration: 2 milliseconds: 0.18

iteration: 3 milliseconds: 0.157

iteration: 4 milliseconds: 0.171

iteration: 5 milliseconds: 0.148

iteration: 6 milliseconds: 0.169

iteration: 7 milliseconds: 0.161

iteration: 8 milliseconds: 0.144

iteration: 9 milliseconds: 0.146

VS2019+cuda12.6:

iteration: 0 milliseconds: 190.786

iteration: 1 milliseconds: 1.211

iteration: 2 milliseconds: 1.128

iteration: 3 milliseconds: 1.13

iteration: 4 milliseconds: 1.118

iteration: 5 milliseconds: 1.135

iteration: 6 milliseconds: 1.184

iteration: 7 milliseconds: 1.062

iteration: 8 milliseconds: 1.079

iteration: 9 milliseconds: 1.115

VS2019+cuda12.68:

iteration: 0 milliseconds: 185.072

iteration: 1 milliseconds: 1.143

iteration: 2 milliseconds: 1.088

iteration: 3 milliseconds: 1.069

iteration: 4 milliseconds: 1.087

iteration: 5 milliseconds: 1.075

iteration: 6 milliseconds: 1.069

iteration: 7 milliseconds: 1.127

iteration: 8 milliseconds: 1.022

iteration: 9 milliseconds: 1.024

From the test data above, it can be seen that the performance of cuda12.6 and cuda12.8 is similar, but significantly worse compared to cuda11.1. this is the test code

include <cufft.h>
include
include

const int nx = 1024;
const int ny = 1024;

int main() {

cufftHandle plan;
for (int i = 0; i < 10; i++) {
auto start_create = std::chrono::high_resolution_clock::now();
cufftResult r = cufftPlan2d(&plan, nx, ny, CUFFT_C2C);
if (r != CUFFT_SUCCESS) std::cout << "cufft error: " << (int)r << std::endl;

auto end_create = std::chrono::high_resolution_clock::now();
double create_time_ms = std::chrono::duration_caststd::chrono::microseconds(end_create - start_create).count() / 1000.0;

std::cout << "iteration: " << i << " milliseconds: " << create_time_ms << std::endl;
}
}

It is not going to be possible to resolve the issue in this forum. If you would like to resolve it, I would suggest filing a bug report with NVIDIA.