Memory intensive CUDA benchmarks

Can someone point me to CUDA applications that are memory throughput bottlenecked (I am using Pascal GTX1070 which has 220GB/s theoritical throughput). By memory throughput bottleneck I mean the bottleneck is the global memory throughput.

It would also be helpful if you could point of such applications which are memory throughput bottlenecked and also don’t fully utilize all the SMs (GTX1070 has 15 SMs with 1024 maximum threads per SM).

I need these benchmarks for some analysis.