Method to get global memory throughput of GPU questions about Sylvain's code

lixingjian · March 14, 2010, 12:55pm

[url=“http://perso.univ-perp.fr/sylvain.collange/ware/cuda_throughput.tar.gz”]http://perso.univ-perp.fr/sylvain.collange...roughput.tar.gz[/url]
I downloaded Sylvain’s “cuda_throughput” which is used to get global memory throughput of GPU, but still have some questions about his method. Is there anyone who can help?

In kernel “throughput_stream_pchase_kernel”,throughput_kernel.cu:
1.I didn’t quite understand the exact meaning of variable “requestsize”,“size” and “stride”.
2.Why is “offset” calculated like that? ( I mean whether “offset=threadNum” works )
3.What does “warmup” mean, to warmup GPU? --line 38
4.Codes from line 47 to 50 are timed. It seems that index should be steadly increaced by threadNumInBlock within a warp, so these memory access are coalesced. Does it mean that this routine is trying to get the maximium throughput ?
5.Why is “start_time” divided by 2?

So many questions…

Thanks very much!!

Sylvain_Collange · March 15, 2010, 6:55pm

Hi Xingjian,

These are just some micro-benchmarks I wrote a couple years ago for my own use… It is not documented and most likely buggy. I am actually surprised it still compiles and run on current CUDA versions…

The goal is not to measure the maximum memory bandwidth, but rather to understand what may cause reduced throughput, by using specific access patterns.

The output is used to produce a plot like this:

External Media

To understand what the parameters mean, you can think of it as a paralellization of this pseudocode:

char A;

while(i < testsize)

{

	for(int y = 0; y + request_size < dataset_size; y += stride * request_size)

	{

		for(int x = 0; x < request_size; x++)

		{

			read(A[y + x]);

			i++;

		}

	}

}

So we read blocks of size ‘request_size’, each spaced by ‘stride’ bytes, looping through an array of size ‘dataset_size’.

The two tricky parts are:

The index calculations needed to map this sequential ordering to a parallel grid of threads.
Moving those calculations out of the inner loop. My initial attempts caused the code to become compute-bound when trying to measure the texture cache throughput. Also, I wanted to make sure that I only have at most one pending request per thread to be able to study the effect of occupancy on throughput.

So I pre-compute the access order and store the indices inside the array itself, resulting in a big linked list. The code that is timed then just becomes a linked list traversal.

‘offset’ is the first index we access. Afterwards, we just have to follow the linked list. Because I want to measure non-sequential throughput, I need to take ‘stride’ and ‘requestsize’ into account to compute it.

The array is entirely walked through once before the timing starts. This initializes caches and TLBs and eliminates the initialization overhead from the timing.

Yes and no. You can choose to make the accesses coalesced or not by adjusting ‘requestsize’. A requestsize of 128 or 256 will result in coalesced accesses, but will cause severe partition camping at larger strides. A requestsize of 256 * number of memory partitions distributes the load evenly across the memory bus.

Mmmh… Because I used to use the “slow” SM clock as a reference, which is twice slower than the clock as returned by “clock()”. On the other hand, my host code seems to use the clock rate returned by cudaGetDeviceProperties, which refers to the fast clock… This is likely a bug.

But multiplying the results by two does not make sense either (gets more than the theoretical bandwidth) so there might be another bug that compensate the error. ;) [Edit: never mind, there is indeed a multiplication by 2 in host code…]

This is an old dusty code, so take its results with a grain of salt…

Topic		Replies	Views
Visual debugger to see if mem access is coalesced CUDA Programming and Performance	7	1081	November 1, 2011
Effective global memory bandwidth? CUDA Programming and Performance	17	17659	September 18, 2007
Global memory overhead CUDA Programming and Performance	3	2122	February 9, 2008
How to know where the bottleneck is? CUDA Programming and Performance	3	4309	February 29, 2008
Global memory access bottleneck CUDA Programming and Performance	8	3551	September 4, 2015
About coalescing CUDA Programming and Performance	6	2683	April 16, 2010
Global Memory Read Throughput CUDA Programming and Performance	2	757	October 8, 2009
Cuda Memory Bank layout Interleaving, Addressing, Conflicts CUDA Programming and Performance	25	61604	September 4, 2008
Uncoalesced global memory bandwidth CUDA Programming and Performance	3	2273	March 28, 2009
Global Memory Read Throughput CUDA Programming and Performance	0	5973	October 7, 2009

Method to get global memory throughput of GPU questions about Sylvain's code

Related topics