Memory bandwidth, datasheet 25.6GB/s

“4GB 64 Bit LPDDR4 25.6GB/s”

What system bus/interface is this bandwidth number from?
(probably no random access, but more likely sequential access to dram modules)
How much protocol overhead is expected within memory bandwidth on system bus levels (for burst or continuous transfer speeds within optimized TX1 SoC, Max-Q design)?
There is a schematic available¹ for TX1 (Jetson Nano TM660M), showing internal system bus (AMBA4/5 ? (A57 “It is natively compatible with AMBA5 CHI and AMBA4 ACE”), AXI interfaces and bit width of connected interfaces to system, peripherals and memory with A57 cores and Maxwell 128 cores GM20B gpu )?

Thx

[1) TECHNICAL REFERENCE MANUAL NVIDIA Tegra X1 Mobile Processor, page 15 (from ~2977, ~86MB, pdf ]

Hi, we are checking it internally, will update once available.

Jetson Nano Memory Bandwith:

64 bit * 3200 Mhz / 8 = 25.6 GB/s

You can get the memory BW in real time with “tegrastats”.

1 Like

Hi, thank you for the calculation example, interface details and tool recommendation.

Trying to understand what’s memory bandwidth was reason using mbw, what was showing numbers from ~3-3.5GB/s and ~8.6GB/s (for memory copy (r/w) and memory fill (w) access).
It seems mbw does not fully saturate memory bandwidth.

25.6GB/s is either read or write bandwidth? What benchmarking tool could verify this number for read or write access towards LPDDR4?

Is it possible to add timestamps to tegrastats output log file?

Thanks for all contribution

25.6GB/s is the theoretical value. We can confirm it by measuring the DQ/DQS freq. For the real memory bandwidth value, it depends on DRAM protocol, chip system design and software application design.

T0->Tb4: invalid DQ. The self-refresh time and the command/address time can’t be ignored for READ timing.
Tb4->Tc2: valid DQ. The BW of this time = 25.6GB/s.

If the current memory bandwidth doesn’t meet your requirement, you should optimize the software application design or increase DQ freq.

1 Like

in general it is more about understanding first, where memory data is transferred, at what cost for protocol overhead and also what cores cpu/gpu/npu are getting memory access and at what prioritized order

Figure 21: Read Timing

How determine cycle count for this example of read access?
T0->T3: RD-1, CAS-2, 3CK

T3->Ta1: RL=AL+CL (read latency = additive latency + CAS latency)
AL (additive latency, defined within MR (Mode Register) register:) 0, CL-1, CL-2
CAS latency (Column-Address-Strobe latency: cycles between internal READ command (and DQSCK latency) and availability of first bit of output data) for 1600MHz DDR4 : CL =~ 14-16 (prob. no half-clock latencies)
… RL =~ 14-31 (?) CK

Ta1->Tb4: 1-2CK
tLZ(DQS) - DQS Low Impedance Time from CK/CK#
tDQSCK (“is the actual position of a rising strobe edge relative to CK”, DDR4-3200 160ps =~ 1/4CK)
tDQSCK – DQS Output Access Time from CK/CK#
tDQSQ (skew between DQS and DQ)
tREFI (average interval of refresh commands (initiated by MC?) for device ~64ms/8192lines → (100us…) 7.8us (…0.9us) )
DQS_t (data strobe high pulse, true) DQS_c (data strobe low pulse, complement)
RPRE (Read Preamble, training/read leveling data strobe receivers, prog. to 1-2CK cycles, Tb2->Tb4)
RPST (Read Postamble)

Tb4->Tc2: 8 CK (data transfer time: ~5ns)

T0->Tc2: ~26-44CK (1600MHz: 16-27ns, 37-62.5MT/s*16bit*16(single ended traces) ~1.2-2GB/s each MC channel)

Tc2-> : precharge following to burst READ (BC4, BL8) ~4CK
READ (16bit data): ~(26-44)+4CK = ~(1-1.6)GB/s 1channel with precharge
consecutive READ (32bit data): ~(18-36)+16 =~(34-52)+4CK = ~(1.8-2.5)GB/s 1channel, incl. precharge
BC4 (64bit data): ~(18-36)+4x8=~(50-68)+4CK = ~(2.6-3.5)GB/s 1channel, incl. precharge
BL8 (128bit data): ~(18-36)+8x8=~(82-100)+4CK = ~(3.7-4.4)GB/s 1channel, incl precharge

[ BTW: another possibility might be hardware upgrade?
DDR5 Maintains Bandwidth with Increased Core Count, page 3
(non-linear increase of shared bandwidth with faster memory data frequency on multi core systems) ]