Memory bandwidth, datasheet 25.6GB/s

beyondTime · December 26, 2021, 2:42pm

“4GB 64 Bit LPDDR4 25.6GB/s”

What system bus/interface is this bandwidth number from?
(probably no random access, but more likely sequential access to dram modules)
How much protocol overhead is expected within memory bandwidth on system bus levels (for burst or continuous transfer speeds within optimized TX1 SoC, Max-Q design)?
There is a schematic available¹ for TX1 (Jetson Nano TM660M), showing internal system bus (AMBA4/5 ? (A57 “It is natively compatible with AMBA5 CHI and AMBA4 ACE”), AXI interfaces and bit width of connected interfaces to system, peripherals and memory with A57 cores and Maxwell 128 cores GM20B gpu )?

Thx

[1) TECHNICAL REFERENCE MANUAL NVIDIA Tegra X1 Mobile Processor, page 15 (from ~2977, ~86MB, pdf ]

Trumany · December 28, 2021, 2:20am

Hi, we are checking it internally, will update once available.

Trumany · December 28, 2021, 3:22am

Jetson Nano Memory Bandwith:

64 bit * 3200 Mhz / 8 = 25.6 GB/s

You can get the memory BW in real time with “tegrastats”.

beyondTime · January 1, 2022, 5:48pm

Hi, thank you for the calculation example, interface details and tool recommendation.

Trying to understand what’s memory bandwidth was reason using mbw, what was showing numbers from ~3-3.5GB/s and ~8.6GB/s (for memory copy (r/w) and memory fill (w) access).
It seems mbw does not fully saturate memory bandwidth.

25.6GB/s is either read or write bandwidth? What benchmarking tool could verify this number for read or write access towards LPDDR4?

Is it possible to add timestamps to tegrastats output log file?

Thanks for all contribution

Trumany · January 4, 2022, 10:30am

25.6GB/s is the theoretical value. We can confirm it by measuring the DQ/DQS freq. For the real memory bandwidth value, it depends on DRAM protocol, chip system design and software application design.

T0->Tb4: invalid DQ. The self-refresh time and the command/address time can’t be ignored for READ timing.
Tb4->Tc2: valid DQ. The BW of this time = 25.6GB/s.

If the current memory bandwidth doesn’t meet your requirement, you should optimize the software application design or increase DQ freq.

beyondTime · January 8, 2022, 2:20am

in general it is more about understanding first, where memory data is transferred, at what cost for protocol overhead and also what cores cpu/gpu/npu are getting memory access and at what prioritized order

Figure 21: Read Timing

How determine cycle count for this example of read access?
T0->T3: RD-1, CAS-2, 3CK

T3->Ta1: RL=AL+CL (read latency = additive latency + CAS latency)
AL (additive latency, defined within MR (Mode Register) register:) 0, CL-1, CL-2
CAS latency (Column-Address-Strobe latency: cycles between internal READ command (and DQSCK latency) and availability of first bit of output data) for 1600MHz DDR4 : CL =~ 14-16 (prob. no half-clock latencies)
… RL =~ 14-31 (?) CK

Ta1->Tb4: 1-2CK
tLZ(DQS) - DQS Low Impedance Time from CK/CK#
tDQSCK (“is the actual position of a rising strobe edge relative to CK”, DDR4-3200 160ps =~ 1/4CK)
tDQSCK – DQS Output Access Time from CK/CK#
tDQSQ (skew between DQS and DQ)
tREFI (average interval of refresh commands (initiated by MC?) for device ~64ms/8192lines → (100us…) 7.8us (…0.9us) )
DQS_t (data strobe high pulse, true) DQS_c (data strobe low pulse, complement)
RPRE (Read Preamble, training/read leveling data strobe receivers, prog. to 1-2CK cycles, Tb2->Tb4)
RPST (Read Postamble)

Tb4->Tc2: 8 CK (data transfer time: ~5ns)

T0->Tc2: ~26-44CK (1600MHz: 16-27ns, 37-62.5MT/s*16bit*16(single ended traces) ~1.2-2GB/s each MC channel)

Tc2-> : precharge following to burst READ (BC4, BL8) ~4CK
READ (16bit data): ~(26-44)+4CK = ~(1-1.6)GB/s 1channel with precharge
consecutive READ (32bit data): ~(18-36)+16 =~(34-52)+4CK = ~(1.8-2.5)GB/s 1channel, incl. precharge
BC4 (64bit data): ~(18-36)+4x8=~(50-68)+4CK = ~(2.6-3.5)GB/s 1channel, incl. precharge
BL8 (128bit data): ~(18-36)+8x8=~(82-100)+4CK = ~(3.7-4.4)GB/s 1channel, incl precharge

[ BTW: another possibility might be hardware upgrade?
DDR5 Maintains Bandwidth with Increased Core Count, page 3
(non-linear increase of shared bandwidth with faster memory data frequency on multi core systems) ]

Trumany · January 26, 2022, 7:44am

Hi, you can check with DDR vendor for more detail info of that.

Topic		Replies	Views
How to calculate memory bandwidth using EMC_FREQ x%@y from Tegrastate log for Jetson Nano. Jetson Nano	4	1035	December 10, 2019
Memory bandwidth Jetson AGX Xavier	2	547	October 18, 2021
RAM speed Jetson TX2	9	2564	November 14, 2019
Jetson AGX Orin memory read/write bandwidth Jetson AGX Orin performance	2	682	July 3, 2023
Measuring DDR bandwidth Jetson AGX Xavier	7	1903	October 18, 2021
PCIe x4 Bandwidth on Nano Jetson Nano pcie	18	1482	October 18, 2021
Jetson AGX Orin 32GB: Measured Memory Bandwidth Much Lower Than Theoretical Spec Jetson AGX Orin hw , jetson , level3	19	146	June 2, 2025
Low bandwidth of memory copy among CPU Jetson TX1	2	493	October 18, 2021
Jetson TK1 latency too high Jetson TK1	9	6790	November 20, 2014
TX2 JetPack 4.6 memory bandwidth has dropped Jetson TX2 kernel , nvbugs	6	641	December 8, 2021

Memory bandwidth, datasheet 25.6GB/s

Related topics