Question about shaders and L2 texture caches

I have a simple shader that wants to sum many textures together in a render pass. There is no kernel or random access occurring, the final output should just be the sum of each pixel at the same texture coordinate from all the textures. For this example say I’m trying to do this with 200 textures.
Previously with OpenGL due to the limited number of texture image units available, I’d split this up into groups of 16 or 32 textures in a pass and ping-pong the running total texture with the output textures to accumulate the result. With Vulkan and it’s large number of image descriptors we can bind, I decided to try to do this all in a single renderpass.
However this ended up being much slower than splitting it across multiple passes, which surprised me. I profiled using NSight, which showed me I was having much less L2 cache hits than the 32 textures/pass workflow.
When doing 32 textures per pass, I would assume that by the time the 32nd texture is sampled, data from the 1st texture would have been pushed out of the L2 cache anyways. However that doesn’t seem to be the case. Does anyone have any insight into what is happening here? Is it that threads are going to sleep after sampling some of the texture (waiting for the data to arrive), causing new ones to wake up and start new texture fetches, which ends up thrashing memory when I’m trying to do 200 textures in a single pass?
Any way I could make this fast in a single pass, even in a compute shader?

Hi @MalcolmB,

sadly I am unable to answer your question here. I am just wondering if there might be a better category than GPU Hardware for this topic. Maybe Vulkan since you mentioned this happens after switching there from OpenGL?

And in case you are not aware, there is also a general NVIDIA developer Discord where someone might be able to comment.