For example, assuming there’s a CUDA-accelerated open-world game engine that streams necessary read-only data from RAM into VRAM, would it make it faster to simply use unified memory and let the CUDA driver decide when to load pages into VRAM? When a player moves to a specific point on 2D world frequently, does it re-load from RAM or use a LRU/LFU cached VRAM data instead? Since data is read-only, its not changed on host. But total data can’t fit into VRAM. So only the required data must be present in VRAM.
If it doesn’t support LRU/LFU, can cudaMemPrefetchAsync be called from a CUDA kernel (before player approaches that point to be loaded)? Also has anyone implemented a parallel LRU (to be used with a unified-mem as backing store and device-mem as cache) in CUDA for general purpose caching without strict copyright policies?
Assuming a non-procedural terrain with 1kB per 1 meter-square area with a bandwidth focus to load the exterior border 1km distance (circular maybe) around the player so if player walks 1 meter, then maybe an arc of ~3.14 km length of data elements should be loaded and kept in memory until player goes too far or evicted (if supported). So that would be a lot of memory even for the frequently accessed places on 2D map (with height, objects, etc on each point). But if visibility range is less than 0.5km, then latency wouldn’t be important because gpu would have enough time to load the data until player walks 0.5km. Again, assuming single player but if multiple cameras are used (such as if its a car racing game), then it would require multiple asynchronous loading of data, potentially overlapping areas with redundant loads(hence LRU/LFU caching would help imo).
I think that is quite unlikely. Compared to any automated heuristic, a programmer is likely to have more information available to them to decide when data should best be moved. From my limited past interactions with the gaming world, programmers often desire tighter manual control instead of using an automated system with a certain amount of abstraction. This is how we got from OpenGL to Vulcan.
If you reserve 4 GB/s PCIe bandwidth per player, you get 4000 m²/s new terrain. Or 4 m² per ms (infinitesimal time). If you have an arc of 180° (direction player is looking and moving into new terrain) in 1km distance, the arc length is 3.14 km. And the arc width is 1.27mm. The maximum speed would be 4.6 km/h.
One player would need pi GiB GPU memory to cover a filled circle (360°) with 1km radius.
If you have different levels of detail per distance, bandwidth and memory size would be more relaxed.
Or are you expecting that the players stay in the same regions?
Then either their movement is small vs. visibility, e.g. 100m, and you more or less keep everything in VRAM or the players move a lot and caching would not work.
Alternatively you could make the 1km not open field, but hide huge areas. E.g. within a city, the buildings cover their insides and the next street behind. Then you could combine theoretically wide visibility with low memory needs and intelligent caching.
Or are you expecting that the players stay in the same regions?
I’m expecting movements in random directions with random distances, creating some redundancy within themselves and between players (using same gpu, in simulation with no player control, but AI-generated movement choices per player, just pure simulation with non-compile-time-known patterns). (thinking about cloud gaming on single gpu for multiple players, as if its single player but streamed to multiple players after simulation ends)