Allocate and work __device__ function in DDR memory

How I make sure that my device function is working in DDR memory.
The problem is that array I’m using is too big and does not fit into shared memory and I would like to boost performance.
Maybe you can suggest better approach. I’m not sure my idea is a correct one.
Thank you

Just make sure you don’t connect any SDR memory to your GPU.

Joking aside, where else do you suspect the allocation could end up?

Unfortunately it does not answer my question :
I would like to skip L2; write data and do work in GDDR (Video memory).
Do we have instruction(s) to control it?
So far I could not find them.

No, you can’t bypass the L2.

My reply indeed did not answer your question because I did not understand it. And I am still not sure I do now.

You cannot find instructions in the documentation to bypass L2 cache because there are none.
However PTX has cache modifiers that allow you to hint at using write-back and or evict-first policy.

How do you think bypassing L2 would boost performance? The only case that comes to my mind is if you know the data is not going to be used again (soon), but other data in the cache will which you therefore want to preserve.

[as usual, Robert was faster to reply]

Thank you for your replies.

The idea was to store all the data (large array which does not fit into L2 cache) in GDDR memory. Believing it will make data rotation to be as simple as swapping array indexes in GDDR as well, instead of reading and writing it back and forth.

Am I correct and data will be split between L2 cache and GDDR therefore addresses manipulations will not work the way I wanted it to work?

Is there a way to guarantee that all data is written in GDDR, even going through L2?

All data written to L2 will eventually get to GPU DRAM, unless it is overwritten first (in which case the overwritten data will get sent to GPU DRAM). However the program visible behavior shouldn’t be any different than if there were no L2 cache.

You should be able to do pointer swapping to avoid copying data just as you would expect.

Thanks again. Very helpful

BTW, regarding the cache behavier, I was wondering whether the prefetch or prefetchu in ptx really works…I did several microbenchmarks, but I didn’t notice any latency improvement by using prefetch, except for the TLB miss penalty. Maybe it’s architecture dependent? Seems not work for sm52 and sm75.

I also tried prefetch on sm75 with similar results. Will continue …

While I haven’t tried prefetch on sm_75, I’d also recommend focussing your efforts elsewhere. Are you keeping enough data in flight to fully hide memory latency? Note that latency of a fully loaded memory system is much higher than an idle one.

Vasily Volkov’s thesis covers the subject extensively.

Thank you. Very interesting