Cublas fp8 cublasLtMatmulAlgoGetHeuristic returns 0 - nvcc issue

After hours of debugging I’m kind of lost, I have the cuda sample for fp8 running, it works.
Something causes the heuristic routine to respond with 0 on the same input in my real code as compared to the sample code

I tried passing it demo data that is exactly as the sample (see below) and I still get that error as if I was on an old card.
I have latest nvidia drivers, a 4090 and 12.2 cublas/cuda installed.

  cublasLtHandle_t ltHandle; 

  cublasLtMatmulDesc_t operationDesc;

  cublasLtMatrixLayout_t Adesc = NULL, Bdesc = NULL, Cdesc = NULL, Ddesc = NULL;

m=n=k=ldc=ldb=lda=64;
 transa = CUBLAS_OP_T;     transb = CUBLAS_OP_N;
    CUBLAS_CHECK(cublasLtMatmulDescCreate(&operationDesc, CUBLAS_COMPUTE_32F, CUDA_R_32F));
    CUBLAS_CHECK(cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSA, &transa, sizeof(transa)));
    CUBLAS_CHECK(cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSB, &transb, sizeof(transa)));
    CUBLAS_CHECK(cublasLtMatrixLayoutCreate(&Adesc, CUDA_R_8F_E4M3, transa == CUBLAS_OP_N ? m : k, transa == CUBLAS_OP_N ? k : m, lda));
    CUBLAS_CHECK(cublasLtMatrixLayoutCreate(&Bdesc, CUDA_R_8F_E4M3, transb == CUBLAS_OP_N ? k : n, transb == CUBLAS_OP_N ? n : k, ldb));
    CUBLAS_CHECK(cublasLtMatrixLayoutCreate(&Cdesc, CUDA_R_16BF, m, n, ldc));
    CUBLAS_CHECK(cublasLtMatrixLayoutCreate(&Ddesc, CUDA_R_8F_E4M3, m, n, ldc));

  cublasLtMatmulHeuristicResult_t heuristicResult = {};
  int returnedResults                             = 0;
  cublasLtMatmulPreference_t preference = NULL;
  
  CUBLAS_CHECK(cublasLtMatmulPreferenceCreate(&preference));

  CUBLAS_CHECK(cublasLtMatmulAlgoGetHeuristic((cublasLtHandle_t)ltHandle, operationDesc, Adesc, Bdesc, Cdesc, Ddesc, preference, 1, &heuristicResult, &returnedResults));
**>>> returnedResults == 0 <<<**

cublasGemmEx works fine, in case that matters.

I have the same code in the sample_cublasLt_LtFp8Matmul.cu sample
I removed the scales from the sample code, removed the workspace allocation. It’s identical.
I looked at how nvcc compiles the two projects, it’s similar. (no arch specifications etc)

Compilation of sample:

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\
  bin\HostX64\x64" -x cu   -I"C:\program files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" -I"C:\program files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\lib\x64" -IC:\temp\examples\CUDALibrarySamples\cuBLASLt\LtFp8Matmul\..\Common -I"C
  :\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include"     --keep-dir x64\Debug  -maxrregcount=0   --machine 64 --compile -cudart static -Xcompiler="/EHsc -Z
  i -Ob0" -g  -D_WINDOWS -D"CMAKE_INTDIR=\"Debug\"" -D_MBCS -DWIN32 -D_WINDOWS -D"CMAKE_INTDIR=\"Debug\"" -Xcompiler "/EHsc /W3 /nologo /Od /FS /Zi /RTC1 /MDd /GR"

Compilation of my cuda.cu:
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" -x cu -I"C:\program files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" -I"C:\program files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\lib\x64" -IC:\temp\. -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -Xcompiler="/EHsc -Zi -Ob0" -g

Solved …
The problem was that all “powershell” windows were flawed (I had installed CUDA 12.2 that day), some strange cuda/windows behavior.
12.2 compilation with cmake worked fine, paths were fine all pointing to the new cuda.
However, launching the application had a high chance to “launch it on low cuda compute mode” .
This was unpredictable behavior, I had powershell windows where it worked for a while and suddenly stopped working again.

The solution was to close every single command shell (window) before launching a new one, then the first new started shell worked right away.