cufftPlanMany CUFFT_INTERNAL_ERROR on previously working unit test

I have a unit test that has been working for years. Now, I take the code to a new machine and a new version of CUDA, and it suddenly fails. I read this thread, and the symptoms are similar, but I can’t believe I’m stressing the memory.

Image is based on nvidia/cuda:12.3.2-devel-ubi8
Driver version is 550.54.15
GPU is A100-PCIE-40GB
Compiler is GCC 12.2.1, compiling for -std=c++20

Simply,

    cufftHandle plan;
    std::vector<int> sizes = { 512 };

    fftCheck(cufftPlanMany(&plan,
                           sizes.size(),
                           sizes.data(),
                           (int*)nullptr,
                           1,
                           512,
                           (int*)nullptr,
                           1,
                           512,
                           CUFFT_C2C,
                           8));

ldd reports I’m linked against

        libcufft.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.11 (0x00007f2a5c60c000)
        libnvidia-ml.so.1 => /lib64/libnvidia-ml.so.1 (0x00007f2a5b462000)

Unfortunately CUFFT_INTERNAL_ERROR is not very informative. What should I be looking at?

CUFFT_INTERNAL_ERROR may sometimes be related to memory size or availability. I don’t think that is a universal explanation, however.

My suggestion would be to provide a complete test case, that others could use to observe the issue. To be clear, that is a code that I could copy, paste, compile, and run, and observe the issue, without having to add anything or change anything. When I build a complete case around what you have shown, it does not fail for me in two different environments, one based on CUDA 12.2 and one based on CUDA 12.4.

For example, if the memory is largely used up by the time you run this particular test, it’s possible that memory availability could be an issue.

I’m a little surprised not to see any indication of a dependency on cudart in your ldd output, but it is plausible. It’s remotely possible that you are making no use of the runtime API in this code, or perhaps you are statically linking against cudart.

For further diagnosis, I sometimes find it useful to run such codes with compute-sanitizer. Occasionally the output from that tool gives me additional clues as to whatever the problem is in the library in question.

You are of course, correct. It was not a MVRE. I unraveled my test framework into this example. The culprit is cudaDeviceReset(). In previous environments, this works (whether it is advisable or not, I defer to your judgement).

TEST(mvre, fft)
{
    int  count(-1);
    auto res = cudaGetDeviceCount(&count);

    ASSERT_EQ(res, cudaSuccess);
    ASSERT_GT(count, 0);

    ASSERT_EQ(cudaSetDevice(0), cudaSuccess);
    ASSERT_EQ(cudaDeviceReset(), cudaSuccess);

    cufftHandle plan;
    std::vector<int> sizes = { 512 };
    EXPECT_EQ((cufftPlanMany(&plan,
                           sizes.size(),
                           sizes.data(),
                           (int*)nullptr,
                           1,
                           512,
                           (int*)nullptr,
                           1,
                           512,
                           CUFFT_C2C,
                           8)), cudaSuccess);
}

If the device reset is removed, the test passes. If it is present, the test fails.

according to my testing, if you add another cudaSetDevice(0); after the cudaDeviceReset(); call, the problem goes away.

I’m not suggesting that should be necessary, or that use of cudaDeviceReset() like this should be a problem, but evidently it is in this case.

You could file a bug if this is a matter of concern for you.

As a general rule, I advise folks that there is no need ever to use cudaDeviceReset() and I personally never use it or recommend its use.

It is possible gain some additional info with some experimentation with cufft and compute-sanitizer.

If you omit all cuda runtime API calls in the test case (e.g. cudaSetDevice(), cudaDeviceReset(), etc. then the cufft call still works (returns a zero status) but compute-sanitizer reveals something curious: a call to cuCtxPopCurrent fails with an CUDA_ERROR_INVALID_CONTEXT result. This sort of makes sense. Depending on the preamble here, no CUDA context has been established, and apparently CUFFT needs that for its work. But in this case, the error report from CUFFT is zero, so CUFFT apparently knows how to handle that case (presumably it causes the CUDA runtime to create a new default context for its use).

But when we put the cudaDeviceReset() in there, after runtime context creation was triggered by the cudaSetDevice() call, then compute-sanitizer indicates CUDA_ERROR_CONTEXT_IS_DESTROYED on a call to cuMemGetInfo. This, apparently, cufft does not know how to handle, or assumes is an indicator of a serious problem, and so it returns error code 5 from the cufft plan call (CUFFT_INTERNAL_ERROR).

Adding another cudaSetDevice() call after the cudaDeviceReset() seems to cause the runtime to create another new valid default context, to replace the one that was destroyed. Now, when cufft checks context availability, it is happy.

Thank you. I think the intention was to ensure the unit test was uncorrupted by any previous (potentially failed) test, but I moved the cudaDeviceReset() to the TearDown() method after reading What is the role of cudaDeviceReset() in Cuda - Stack Overflow.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.