For my application of histogramming, shared memory is very important because it supports random access in parallel. But when shared memory can’t fit the entire histogram, it will need to be spilled into global memory. But global memory can’t do random access in parallel, even when cached, due to the large granularity.
Are the L2 cache and external DRAMs banked or are they only capable of accessing 1 huge word at a time? If they’re banked, I don’t understand why you can’t allocate memory in the L2 cache or DRAM as shared memory. I think this would be a large benefit, especially on Maxwell, where you have huge amounts (> 2MiB) of L2 cache. And even for the memory backed by DRAM, there seems to be at least 8x parallelism (256bit / 32bit) that could be exploited.
I understand one of the main reason why you can’t access global memory in parallel is because that would require 32 virtual to physical address translations, which would be prohibitively expensive to do in parallel. Even Intel’s AVX2 vector memory loads/stores are really serial operations.
But I don’t mind giving up virtual memory and instead, treat it as it’s own address space.
Also, can someone tell me or speculate how much L2 cache memory GM204 will have. I’m dying to know.