Second cuCtxCreate() returns CUDA_ERROR_LAUNCH_FAILED with A2 GPU

Have you seen cuCtxCreate() succeed, then on the 2nd call fail with CUDA_ERROR_LAUNCH_FAILED?

Calling simple code like this (full error checking elided here for readability):

cuInit(0);
cuDeviceGet(&dev, 0);
cuCtxCreate(&ctx, 0, dev)
// create NVENC context, don't actually encode anything
cuCtxDestroy(ctx)
cuCtxCreate(&ctx, 0, dev) // returns 0x2cf / CUDA_ERROR_LAUNCH_FAILED

Any ideas of things to try?

  • Environment

    • Ubuntu 18.04.6 LTS on AMD EPYC 74F3
    • NVIDIA A2
    • Driver: 535.54.03
    • CUDA: 12.2
  • Only happens in particular process, not minimal test sample
    (So, yes, the obvious thing is bisect, except access to the broken system isn’t easy, thus I’m checking if there is perhaps some known issue I missed.)

  • Tried different driver versions with same result

    • CUDA 12.1 / driver 530.30.02
    • CUDA 12.3 / driver 545.23.08
  • Works on every other GPU type I’ve used (mainly Quadro/GRID).

(1) What happens when you run the application under control of Compute Sanitizer?
(2) Re “Only happens in particular process, not minimal test sample”. The first thing I would check is whether any memory corruption is taking place. Does valgrind have any complaints?

Good ideas to try if/when I get access to the system. Thanks.

I got access to the system and bisected between the good/bad processes ending up at setuid() being the trigger.

Scenario:

  1. System service runs as root, triggers process for user.
  2. This process drops privilege via setuid() to become that user.
  3. After setuid() is called cuCtxCreate() always fails with CUDA_ERROR_LAUNCH_FAILED, even if cuInit() is only called after setuid() also.
  4. Manually launching the same process initially as the user so setuid() isn’t called succeeds, so it isn’t about particular user permissions.
  5. Only seen on system with A2 GPU, never on K/M/P/RTX series Quadros, GRID, or Tesla cards.

I will file a bug report.