Has anyone seen the following issue?
## Summary
`cufftEstimate1d(nx=512, type=CUFFT_R2C, batch=1, &workSize)` — documented as a pure
host-side workspace-size lookup that does **not** touch the device — returns
`CUFFT_INTERNAL_ERROR (5)` on an H100 when `nvidia-smi conf-compute -f` reports
`CC status: ON`. The same call returns `CUFFT_SUCCESS` on the same H100, same driver,
same library binary when CC is OFF, and on an A100 (CC unavailable).
`cufftPlan1d`, `cufftMakePlanMany64`, and PyTorch’s `torch.fft.rfft` all fail with
the same status on the same machine in CC mode. The failure is independent of FFT
size (256, 512, 1024, 4096 all fail identically) and independent of cuFFT version
(both `12.0.0.61` from CUDA 13.0 and `11.3.3.83` from CUDA 12.8 fail).
## Environment
| Item | Value |
|—|—|
| GPU | NVIDIA H100 (sm_90, 100 GB HBM3) |
| Driver | 590.48.01 (Open Kernel Module, MIG off) |
| CUDA driver API | 13.0.0 (`Driver: 13000` per CUPTI) |
| CUDA runtime | 13.0.0 (libcudart.so.13.0.96) |
| cuFFT | **12.0.0.61** (libcufft.so.12.0.0.61) |
| cuBLAS / cuBLASLt | 13.1.0.3 — works correctly |
| NVRTC | 13.0 — works correctly |
| Confidential Computing | **ON** (verified by `nvidia-smi conf-compute -f`) |
| Host environment | Confidential VM (CVM) over a CC-capable hypervisor |
| Reproducer language | C (also reproduced via PyTorch `torch.fft.rfft` and via Python ctypes) |
For control: same `libcufft.so.12.0.0.61` binary works on:
- H100 in MAST/HPC environment with CC=OFF
- A100 (no CC) on a developer host
## Reproducer (minimal C)
Compile with the cuFFT shipped in CUDA 13.0:
```bash
nvcc cufft_cc_repro.c -lcufft -o cufft_cc_repro
./cufft_cc_repro
```
```c
include <cufft.h>
include <stdio.h>
static const char* status_name(cufftResult r) {
switch (r) {
case CUFFT_SUCCESS: return “CUFFT_SUCCESS”;
case CUFFT_INVALID_PLAN: return “CUFFT_INVALID_PLAN”;
case CUFFT_ALLOC_FAILED: return “CUFFT_ALLOC_FAILED”;
case CUFFT_INVALID_TYPE: return “CUFFT_INVALID_TYPE”;
case CUFFT_INVALID_VALUE: return “CUFFT_INVALID_VALUE”;
case CUFFT_INTERNAL_ERROR: return “CUFFT_INTERNAL_ERROR”;
case CUFFT_EXEC_FAILED: return “CUFFT_EXEC_FAILED”;
case CUFFT_SETUP_FAILED: return “CUFFT_SETUP_FAILED”;
case CUFFT_INVALID_SIZE: return “CUFFT_INVALID_SIZE”;
default: return “UNKNOWN”;
}
}
int main(void) {
int v = 0;
cufftGetVersion(&v);
printf(“cuFFT version: %d\n”, v);
size_t ws = 0;
cufftResult rc;
/\* (1) Pure host-side workspace estimation. Should always succeed for N=512 R2C. \*/
rc = cufftEstimate1d(512, CUFFT_R2C, 1, &ws);
printf(“cufftEstimate1d(512, R2C, 1) = %d (%s), ws=%zu\n”,
rc, status_name(rc), ws);
/\* (2) Simplest end-to-end plan creation. \*/
cufftHandle plan = 0;
rc = cufftPlan1d(&plan, 512, CUFFT_R2C, 1);
printf(“cufftPlan1d(N=512, R2C, batch=1) = %d (%s)\n”, rc, status_name(rc));
if (rc == CUFFT_SUCCESS) cufftDestroy(plan);
/\* (3) Same shape via the modern Many64 API. \*/
rc = cufftCreate(&plan);
printf(“cufftCreate = %d (%s)\n”, rc, status_name(rc));
rc = cufftSetAutoAllocation(plan, 0);
printf(“cufftSetAutoAllocation(plan, 0) = %d (%s)\n”, rc, status_name(rc));
long long n[1] = {512};
rc = cufftMakePlanMany64(plan, 1, n,
NULL, 1, 0,
NULL, 1, 0,
CUFFT_R2C, 1, &ws);
printf(“cufftMakePlanMany64(N=512, R2C) = %d (%s), ws=%zu\n”,
rc, status_name(rc), ws);
cufftDestroy(plan);
return rc == CUFFT_SUCCESS ? 0 : 1;
}
```
### Expected output (CC=OFF, any H100/A100)
```
cuFFT version: 12000
cufftEstimate1d(512, R2C, 1) = 0 (CUFFT_SUCCESS), ws=0
cufftPlan1d(N=512, R2C, batch=1) = 0 (CUFFT_SUCCESS)
cufftCreate = 0 (CUFFT_SUCCESS)
cufftSetAutoAllocation(plan, 0) = 0 (CUFFT_SUCCESS)
cufftMakePlanMany64(N=512, R2C) = 0 (CUFFT_SUCCESS), ws=0
```
### Actual output (CC=ON, H100 in CVM)
```
cuFFT version: 12000
cufftEstimate1d(512, R2C, 1) = 5 (CUFFT_INTERNAL_ERROR), ws=2048
cufftPlan1d(N=512, R2C, batch=1) = 5 (CUFFT_INTERNAL_ERROR)
cufftCreate = 0 (CUFFT_SUCCESS)
cufftSetAutoAllocation(plan, 0) = 0 (CUFFT_SUCCESS)
cufftMakePlanMany64(N=512, R2C) = 5 (CUFFT_INTERNAL_ERROR), ws=2048
```
`cufftCreate` and `cufftSetAutoAllocation` succeed; the failure is in any API that
needs to consult per-device FFT capability tables. We suspect the failure point is
that cuFFT is querying device attributes (e.g., shared-memory size, max threads/block)
through a path that returns no valid data when CC mode constrains the introspection
surface, and cuFFT then fails open with `CUFFT_INTERNAL_ERROR`.
## Sanity-check matrix already verified
| Component on the failing CVM | Status |
|—|—|
| `cudaMalloc` / `cudaMemcpy` | ✅ works |
| `cudaGetLastError` after each cuFFT failure | returns 0 (no driver error) |
| `cublasCreate`, `cublasSgemm` (256×256 fp32) | ✅ works |
| `nvrtcCompileProgram` (trivial kernel) | ✅ works (NVRTC v13.0) |
| `dmesg` during failure | no Xid, no NVRM messages |
| GPU memory free | 78 GB (not OOM) |
| `CUFFT_LOG_LEVEL=5` env var | no log output emitted (cuFFT logger appears disabled in this build) |
| `CUDA_MODULE_LOADING=EAGER` | no effect |
| FFT sizes attempted | 256, 512, 1024, 4096 all fail identically |
| Same binary on same H100 with CC=OFF | ✅ all sizes succeed |
| Same binary on A100 (no CC) | ✅ all sizes succeed |