Method to get global memory throughput of GPU questions about Sylvain's code

http://perso.univ-perp.fr/sylvain.collange…roughput.tar.gz
I downloaded Sylvain’s “cuda_throughput” which is used to get global memory throughput of GPU, but still have some questions about his method. Is there anyone who can help?

In kernel “throughput_stream_pchase_kernel”,throughput_kernel.cu:
1.I didn’t quite understand the exact meaning of variable “requestsize”,“size” and “stride”.
2.Why is “offset” calculated like that? ( I mean whether “offset=threadNum” works )
3.What does “warmup” mean, to warmup GPU? --line 38
4.Codes from line 47 to 50 are timed. It seems that index should be steadly increaced by threadNumInBlock within a warp, so these memory access are coalesced. Does it mean that this routine is trying to get the maximium throughput ?
5.Why is “start_time” divided by 2?

So many questions…

Thanks very much!!

Hi Xingjian,

These are just some micro-benchmarks I wrote a couple years ago for my own use… It is not documented and most likely buggy. I am actually surprised it still compiles and run on current CUDA versions…

The goal is not to measure the maximum memory bandwidth, but rather to understand what may cause reduced throughput, by using specific access patterns.

The output is used to produce a plot like this:

To understand what the parameters mean, you can think of it as a paralellization of this pseudocode:

char A;

while(i < testsize)

{

	for(int y = 0; y + request_size < dataset_size; y += stride * request_size)

	{

		for(int x = 0; x < request_size; x++)

		{

			read(A[y + x]);

			i++;

		}

	}

}

So we read blocks of size ‘request_size’, each spaced by ‘stride’ bytes, looping through an array of size ‘dataset_size’.

The two tricky parts are:

  • The index calculations needed to map this sequential ordering to a parallel grid of threads.

  • Moving those calculations out of the inner loop. My initial attempts caused the code to become compute-bound when trying to measure the texture cache throughput. Also, I wanted to make sure that I only have at most one pending request per thread to be able to study the effect of occupancy on throughput.

So I pre-compute the access order and store the indices inside the array itself, resulting in a big linked list. The code that is timed then just becomes a linked list traversal.

‘offset’ is the first index we access. Afterwards, we just have to follow the linked list. Because I want to measure non-sequential throughput, I need to take ‘stride’ and ‘requestsize’ into account to compute it.

The array is entirely walked through once before the timing starts. This initializes caches and TLBs and eliminates the initialization overhead from the timing.

Yes and no. You can choose to make the accesses coalesced or not by adjusting ‘requestsize’. A requestsize of 128 or 256 will result in coalesced accesses, but will cause severe partition camping at larger strides. A requestsize of 256 * number of memory partitions distributes the load evenly across the memory bus.

Mmmh… Because I used to use the “slow” SM clock as a reference, which is twice slower than the clock as returned by “clock()”. On the other hand, my host code seems to use the clock rate returned by cudaGetDeviceProperties, which refers to the fast clock… This is likely a bug.

But multiplying the results by two does not make sense either (gets more than the theoretical bandwidth) so there might be another bug that compensate the error. ;) [Edit: never mind, there is indeed a multiplication by 2 in host code…]

This is an old dusty code, so take its results with a grain of salt…