Rocky Linux 9, Vulkan troubles

I’ve migrated one of my workstations to the VFX Reference Standard, which runs on Rocky 9. (RHEL-ish for future searchers, which I always thought was a better name :)

This workstation has two RTX A5000.

Group 0:
	Properties:
		physicalDevices: count = 2
			NVIDIA RTX A5000 (ID: 0)
			NVIDIA RTX A5000 (ID: 1)
		subsetAllocation = 0

and is running

| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |

Some of my code seems to have a lot of trouble with vulkan memory allocation.

For example, UE complains a lot at startup re:

[2024.02.08-13.52.25:141][  0]LogVulkanRHI: Warning: Failed to allocate Device Memory, Requested=131072.00Kb MemTypeIndex=1

The vulkaninfo command core dumps part way through its report, right after the start of the Device Groups section.

I’ve tried various driver install strategies, without success. It seems that, if it’s crashing in the vulkininfo that I have troubles beyond the user code, and I can’t seem to find much out there re: what next steps to take when the basic Vulkan tests don’t work.

Thoughts?

I’d check first what icd files are installed in
/usr/share/vulkan/icd.d/
/etc/vulkan/icd.d/

1 Like

Good plan - didn’t think of that, because I’m honestly a little new to how Vulcan / RHEL / Rocky play together. Too many details, too little time.

As of this minute, there are none in the /etc/... pathway and a

{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.3.260"
    }
}

in the share

One interesting sideline - there are 4,320 different suggested ways to manage drivers in Rocky. I’ve tried a few different routes, and seem to have found one that doesn’t coredump, but I have to go back to my notes to see which it was.