Why does my 3D reconstruction’s front projection algorithm, which uses texture memory for value retrieval, show significant performance improvements compared to global memory on P4000 and RTX4000, but experience a performance decline on A4000 and A2000 compared to global memory? Furthermore, on the A4000, using texture memory to run my algorithm is even more than 30% slower than on the RTX4000. What is the reason for this, and is it related to the architecture of the A2000 and A4000 graphics cards?My kernel function has not been specifically optimized for certain GPUs, nor has it used any low-level instruction optimizations; it simply involves some value retrieval and calculation operations.
I’ve read the question three times now and I am still confused as to what the relative performance of the various GPUs and access modes actually is. Could you express it as performance ratio normalized to the slowest configuration, that is,
global memory | texture | bandwidth |
-----------------------------+---------+------------+
Quadro P4000 1.0 | ? | 243.3 GB/s |
Quadro RTX 4000 ? | ? | 416.0 GB/s |
RTX A2000 ? | ? | 288.0 GB/s |
RTX A4000 ? | ? | 448.0 GB/s |
In the absence of actual data, beyond changes in actual memory bandwidth between GPU architectures, a plausible hypothesis is that performance of the non-texture access path has increased more than in the texture path due to improvements to the general-purpose cache hierarchy, which is the more commonly used path, whereas texture is a specialized (and increasingly niche) way of accessing memory (“make the common case fast, and keep the uncommon case functional”).
[Later:]
I have added the raw GPU memory bandwidth for the different GPUs to the table; the data was taken from the TechPowerUp database) This data suggests that you should see slightly better (by a few percent only) performance on the RTX A4000 compared to the Quadro RTX 4000.