Coalescing for compute capability above 2.0?

CamYang · August 18, 2015, 9:58pm

I’m implementing some kernel that involves complex arithemetics right now and visual profiler tells me that the kernel is bandwidth bound and it suffers from memory dependency stalls. Just as a pre-info, I’m using thrust:complex which I think should have the same layout as cuComplexFloat(real and imaginary each 32 bits put aside each other).

The result is not surprising to me as I can think my kernel will indeed involve a lot of global reads and writes. But the problem is that I have already tried to coalesce the memory access by letting adjacent thread accessing adjacent complex numbers in arrays. But regarding this stackoverflow post http://stackoverflow.com/a/5044424, I’m confused of how kepler devices handle global memory acceess? Is my current approach(again, letting each thread access adjacent complex numbers, which in turn means that threads will be accessing every other 32 bits) good enough for coalescing?

Below are my thought:
It seems that since each wrap has 32 threads, and each thread are accessing every 64 bits = 8 bytes complex number, one wrap will be accessing 32 * 8 = 256 bytes; one global read such as

thurst::complex<float> entry = complexArray[threadIdx.x];

will cause each wrap to initiate at least two global memory requests(since cachline is only 128 bytes). But I’m not sure if this will cause in fact four memory access: since one does need to first access the real part and then the imaginary part of a complex number, the first 16 threads of a wrap will access every other 32 bits float in complexArray for the real part, and so is the other 16 threads of the wrap; then threads will try to access the imaginary part, we have another two read requests but since we have just fetched the real parts, the imaginary part falls into cache.

There’s also this post that might be relavent, but I don’t understand the answer… http://stackoverflow.com/questions/17418257/cuda-coalescing-performance-of-small-data-types-fermi-kepler

Robert_Crovella · August 19, 2015, 2:57pm

memory bandwidth bound and memory dependencies don’t necessarily indicate a coalescing issue. Use the profiler to figure out if you have a coalescing issue first. The gld_efficience and gst_efficiency metrics will be a useful indicator of this.

kepler and fermi don’t have any significant differences at this level of coalescing discussion.

I don’t know for sure that you are accessing 64-bits per thread, but there shouldn’t be any coalescing issues by doing so. In fact, properly accessing 64-bit per thread may be slightly more efficient than accessing 32-bits per thread. Yes, it will generate 2 global load transactions, since these are limited to 128 bytes, but that is not a concern.

The profiler has metrics that can tell you the number of actual global load transactions. If you use these, along with some analysis of your code or a test kernel, you can figure out exactly whether you think you’re getting the right “ratio” of warp reads to global transactions.

CamYang · August 22, 2015, 5:47am

txbob:

memory bandwidth bound and memory dependencies don’t necessarily indicate a coalescing issue. Use the profiler to figure out if you have a coalescing issue first. The gld_efficience and gst_efficiency metrics will be a useful indicator of this.

kepler and fermi don’t have any significant differences at this level of coalescing discussion.

I don’t know for sure that you are accessing 64-bits per thread, but there shouldn’t be any coalescing issues by doing so. In fact, properly accessing 64-bit per thread may be slightly more efficient than accessing 32-bits per thread. Yes, it will generate 2 global load transactions, since these are limited to 128 bytes, but that is not a concern.

The profiler has metrics that can tell you the number of actual global load transactions. If you use these, along with some analysis of your code or a test kernel, you can figure out exactly whether you think you’re getting the right “ratio” of warp reads to global transactions.

Thanks @txbob!

The concern was that from nvvp profiler, it generates recommendations like these https://www.dropbox.com/s/9d0hywbgooqkrn6/Screen%20Shot%202015-08-21%20at%2010.35.17%20PM.png?dl=0

https://www.dropbox.com/s/dk8qnpxw11durei/Screen%20Shot%202015-08-21%20at%2010.35.26%20PM.png?dl=0

Note that one of it refers to thrust/complex.h and another one is my kernel(so basically every complex read/store yields a message)

And here’s the gld/gst_efficiency benchmarks you suggested I to test. The copy kernel basically copies from complexA to complexB and batchCHQL is my kernel(which makes sense since most of the operation of the kernel is on complex arrays and there’re also some float array that would “increase” the efficiency)
https://www.dropbox.com/s/rlut5mdjwwtbmkj/Screen%20Shot%202015-08-21%20at%2010.35.40%20PM.png?dl=0

So it seems like the two profiler(nvvp and nvprof’s efficiency benchmark) agrees that the load/store efficiency is not full but rather half? Were you suggesting that this is ok since fermi and kepler can handle these level of access gracefully through its L2 cache(or whatever different reading/loading scheme it has compared to earlier hardwares)? But then is it then nvvp’s issue that it fails to recognize that these accessing pattern is alright(since I got three whole pages of the above warning… I thought that they might be of some importance)?

Also it be nice if I can know where/how to upload a viewable image on this forum!