I’m benchmarking it with 5,000x5,000 matrixes, and it runs at ~1,350 iterations per second for both the kernel and thrust. I’m using a 3070Ti, which has a data transfer rate is 604 GB/s. 604e9 / (4 /* sizeof(float) */ * 5000 * 5000) is equal to ~6,000. Since I’m working with two 5,000-by-5,000 matrixes in the kernel case, I’d expect to run at 6,000 / 2 = 3,000 iterations per second. I can tell that the overall program is bottlenecked by memory, because when I switch from float to double the iterations per second gets cut precisely in half, and because based on the FLOPS/sec for the 3070Ti, computation is not an issue. I’ve also profiled with ncu, and ncu reports that I’m utilizing >90% of the available bandwidth during this kernel. Any ideas what might be the issue? Thanks.
You have 2 loads and 1 store, each of size float. B[idx] has to be read, and A[idx] has to be read, then the two are multiplied together, then the result is stored to B[idx]. So you might “expect” it to run at 6,000/3 = 2000 iterations per second, if you could achieve peak bandwidth. Generally you cannot. So I would say you are getting around 1350/2000 = 67.5% of peak, based on what you are saying here. A mix of reads and writes is not the way to measure best achievable bandwidth, as bus turnaround (for example) costs something. Even without that there are various overheads in DRAM communication which indicate that the peak number is not realizable in practice. (The peak number of 604GB/s is computed as a number like the maximum Gbit/s that can be delivered on a single wire times the bus width in pins/wires. This is a calculation that doesn’t take into account many factors related to what you can actually do with that.)
If you want to see a fairly good measurement of achievable bandwidth, try the bandwidthTest sample code. Use that as an estimate of peak achievable. (And you might possibly find you are at ~90% of that number).
You can then try things like only doing reads in the kernel to see if you can make improvements in your test case. This sort of benchmarking has to be done carefully, as the compiler may “optimize out” reads if they don’t actually modify globally observable state. Your kernel also has a lot of overhead (like, the whole kernel launch) for each set of 2 reads and one write per thread. You could try doing more work per thread (e.g. a grid-stride loop perhaps even with the grid carefully sized to match your GPU, hitting 100% occupancy). It won’t help a lot, but it may help a little.