I’m implementing some kernel that involves complex arithemetics right now and visual profiler tells me that the kernel is bandwidth bound and it suffers from memory dependency stalls. Just as a pre-info, I’m using thrust:complex which I think should have the same layout as cuComplexFloat(real and imaginary each 32 bits put aside each other).
The result is not surprising to me as I can think my kernel will indeed involve a lot of global reads and writes. But the problem is that I have already tried to coalesce the memory access by letting adjacent thread accessing adjacent complex numbers in arrays. But regarding this stackoverflow post http://stackoverflow.com/a/5044424, I’m confused of how kepler devices handle global memory acceess? Is my current approach(again, letting each thread access adjacent complex numbers, which in turn means that threads will be accessing every other 32 bits) good enough for coalescing?
Below are my thought:
It seems that since each wrap has 32 threads, and each thread are accessing every 64 bits = 8 bytes complex number, one wrap will be accessing 32 * 8 = 256 bytes; one global read such as
thurst::complex<float> entry = complexArray[threadIdx.x];
will cause each wrap to initiate at least two global memory requests(since cachline is only 128 bytes). But I’m not sure if this will cause in fact four memory access: since one does need to first access the real part and then the imaginary part of a complex number, the first 16 threads of a wrap will access every other 32 bits float in complexArray for the real part, and so is the other 16 threads of the wrap; then threads will try to access the imaginary part, we have another two read requests but since we have just fetched the real parts, the imaginary part falls into cache.
There’s also this post that might be relavent, but I don’t understand the answer… http://stackoverflow.com/questions/17418257/cuda-coalescing-performance-of-small-data-types-fermi-kepler