Effective memory bandwidth?

ho126jin · July 21, 2021, 5:12pm

The meaning of effective memory bandwidth in https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#effective-bandwidth-calculation is effective ‘global’ memory bandwidth?

Also, the effective global memory bandwidth means memory throughput??

Robert_Crovella · July 21, 2021, 5:23pm

the formula given as well as the description doesn’t have to apply to global memory. For example, with an appropriately designed test, and use of appropriate metrics (covered in the next section there) it could refer to shared memory bandwidth.

Nevertheless, global memory bandwidth is what is commonly being referred to.

memory throughput and memory bandwidth are often used interchangeably.
This could devolve into a semantics discussion, so I’ll leave it at that.

ho126jin · July 21, 2021, 5:49pm

If so , I’m confused about something.

In this picture 'memory ’ means global memory?

And
If the ‘throughput’ is about shared memory, you mean SM to shared memory?Or global memory to shared memory? Or both.

In this picture, ‘Throughput’ is also about shared memory?

Memcpy(HtoD) is Host(DRAM) → Device(global memory) → shared memory but ‘Throughput’ is only about shared memory?

If I have to consider memory contention, can I consider only shared memory, not global memory?

Robert_Crovella · July 21, 2021, 6:10pm

in that picture it means device memory, i.e. the memory attached to the GPU. “global” is properly used as a logical space identifier. The location of global memory is often, but not always, in device memory. Another possible location for it (for example) is system memory (e.g. pinned host memory).

yes, when I was talking about shared memory, I was referring to the transfer path from SM to shared memory.

no, the throughput there is not about shared memory.

Memcpy(HtoD) refers to host(DRAM) to Device(global memory) only.

ho126jin · July 22, 2021, 12:32pm

Ok, Thanks

So, If I want to know one kernel’s ‘effective shared memory bandwith’, which value of the profiler should I look at?

Robert_Crovella · July 22, 2021, 7:33pm

I don’t recommend the NVIDIA Visual profiler for use on RTX 6000. You should use one of the new profilers. For gathering these kinds of metrics, the one to use is nsight compute. This blog should help with learning to use nsight compute and gather metrics (although it doesn’t cover shared memory specifically).

One possible approach (more or less consistent with the approach laid out in the best practices guide you already linked) would be to gather the metrics that track shared memory activity (loads, stores) and then divide that by the timeframe of interest, such as the kernel duration, perhaps. For example you might use the metric for shared load transactions:

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum

and there is a similar one for shared store transactions. The previously linked blog will show how to convert these to bytes. You could then divide by your measured kernel duration. However, looking at that metric table, there are already metrics for shared throughput, for example for loads:

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum.per_second

So that is probably easier.

I suggest asking detailed profiler usage questions on the forum for whichever profiler you are using.

ho126jin · July 23, 2021, 4:35am

Thank you for your kind reply.

One more thing.

Is there anything about global memory like getting an effective shared memory?Or a metric?

Robert_Crovella · July 23, 2021, 2:35pm

I don’t know what that means, I’m not able to parse that sentence. However, did you look at the table I previously linked? There are global memory metrics listed there. I also list global memory metrics in the blog I linked.

ho126jin · July 26, 2021, 5:44am

I want to know the global memory bandwidth that one kernel uses.

Which value should I look at out of the many metrics?

Robert_Crovella · July 26, 2021, 4:41pm

Perhaps these from this table:

dram read throughput:

dram__bytes_read.sum.per_second

dram write throughput:

dram__bytes_write.sum.per_second

Topic		Replies	Views
Memory terms CUDA Programming and Performance	5	630	May 16, 2019
Effective bandwidth between using shared memory and global memory CUDA Programming and Performance	0	357	August 2, 2020
Performance test sharedmemory <-> globalmemory CUDA Programming and Performance	2	3931	May 30, 2008
I can't show the gpu details about memory throughput Visual Profiler and nvprof	0	775	July 21, 2021
Measuring Effective Bandwidth CUDA Programming and Performance	1	4640	February 20, 2011
How to calculate memory bandwidth from device properties ? CUDA Programming and Performance	11	5385	June 20, 2015
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11245	July 8, 2009
How to Implement Performance Metrics in CUDA C/C++ Technical Blog	20	851	March 11, 2020
Shared Memory Bandwidth CUDA Programming and Performance	3	1390	August 3, 2013
Where can I find the Global memory bandwidth(include Reads and Writes) and Shared memory bandwidth(Loads and Stores) in NVVP? CUDA Programming and Performance	4	914	November 3, 2017

Effective memory bandwidth?

Related topics