GPCDMA memory to memory low performance

os.kernel · December 4, 2021, 7:21am

each dma size is 12MB, test 1000 loops,  cost:  304.715996480s,    39.38MBps?

the snippet is:

#define BUFF_SIZE (12 * 1024 * 1024)

static int do_memcpy_with_dma_iommu(void)
{
        struct dma_device       *dev;
        struct dma_chan *chan = NULL;
        dma_cap_mask_t mask;
        struct device *cdev;
        struct iommu_domain *domain;
        dma_addr_t dst_iova;
        dma_addr_t src_iova;
        struct dma_async_tx_descriptor *tx = NULL;
        dma_cookie_t dma_cookie;
        int i;
        int ret;
        ktime_t k0, k1;

        dma_cap_zero(mask);
        dma_cap_set(DMA_MEMCPY, mask);

        chan = dma_request_channel(mask, NULL, NULL);
        if(NULL == chan )
        {
                printk("request channel fail\n");
                return -1;
        }

        dev = chan->device;
        cdev = dev->dev;
        domain = iommu_get_domain_for_dev(cdev);

        printk("%s domain %p\n", __func__, domain);
        if (!domain)
                return 0;

        dst_iova = iommu_dma_alloc_iova(cdev, BUFF_SIZE,
                        cdev->coherent_dma_mask);
        if (!dst_iova) {
                dev_err(cdev, "dst iommu_dma_alloc_iova() failed\n");
                goto out;
        }

        dev_info(cdev, "dst IOVA: 0x%08llx\n", dst_iova);
        ret = iommu_map(domain, dst_iova,
                        dst_phys,
                        BUFF_SIZE, IOMMU_READ | IOMMU_WRITE);

        src_iova = iommu_dma_alloc_iova(cdev, BUFF_SIZE,
                        cdev->coherent_dma_mask);
        if (!src_iova) {
                dev_err(cdev, "src iommu_dma_alloc_iova() failed\n");
                goto out;
        }

        dev_info(cdev, "src IOVA: 0x%08llx\n", src_iova);

        ret = iommu_map(domain, src_iova,
                        src_phys,
                        BUFF_SIZE, IOMMU_READ | IOMMU_WRITE);

        k0 = ktime_get();
        for(i = 0; i < 1000; i++)
        {
                dma_finished = 0;
                //tx = dev->device_prep_dma_memcpy(chan, dst_phys, src_phys, BUFF_SIZE, DMA_PREP_INTERRUPT|DMA_CTRL_ACK);
                tx = dev->device_prep_dma_memcpy(chan, dst_iova, src_iova, BUFF_SIZE, DMA_PREP_INTERRUPT|DMA_CTRL_ACK);
                if(NULL == tx)
                {
                        printk("prep_dma_memcpy fail\n");
                        dma_release_channel(chan);

                        return -1;
                }

                tx->callback = tx_callback;

                dma_cookie = dmaengine_submit(tx);
                if (dma_submit_error(dma_cookie))
                {
                        printk("submit fail\n");
                }

                dma_async_issue_pending(chan);

                wait_event_interruptible(wq, dma_finished);
        }

        k1 = ktime_get();
        printk("%s cost: d1=%lld \n", __func__, (k1.tv64 - k0.tv64));

        dma_release_channel(chan);
out:
        return 0;
}

any ideas? thanks.

os.kernel · December 7, 2021, 9:23am

there was one thread: used gpc-dma instead of memcpy for memory copy - Jetson & Embedded Systems / Jetson TX2 - NVIDIA Developer Forums

sudo su
echo 1 > /sys/kernel/debug/bpmp/debug/clk/emc/mrq_rate_locked
cat /sys/kernel/debug/bpmp/debug/clk/emc/max_rate
echo {$max_rate} > /sys/kernel/debug/bpmp/debug/clk/emc/rate

now, the GPCDMA is 240MBps, still low performance.

ShaneCCC · December 7, 2021, 9:42am

Did you boost system clocks?

sudo nvpmodel -m 0
sudo jetson_clocks

os.kernel · December 7, 2021, 10:20am

thanks.

yes.

sudo jetson_clocks is equivalent to
echo 1 > /sys/kernel/debug/bpmp/debug/clk/emc/mrq_rate_locked
cat /sys/kernel/debug/bpmp/debug/clk/emc/max_rate
echo {$max_rate} > /sys/kernel/debug/bpmp/debug/clk/emc/rate
right?

I re-tested, after
sudo nvpmodel -m 0
sudo jetson_clocks

the result is the same as 240MBps.
each dma size is 12MB, test 1000 loops, cost: 49.848 582 720s, 12 000M / 492848 = 240MBps

os.kernel · December 9, 2021, 6:16am

Would you please tell me the normal performance about the GPCDMA memory to memory?

thanks.

os.kernel · December 9, 2021, 10:58am

test use /sys/module/dmatest

root@localhost:/sys/module/dmatest/parameters# sudo nvpmodel -m 0
root@localhost:/sys/module/dmatest/parameters# sudo jetson_clocks
[ 1550.515198] nvgpu: 17000000.gv11b railgate_enable_store:297 [INFO] railgate is disabled.
root@localhost:/sys/module/dmatest/parameters# echo dma0chan18 > channel
root@localhost:/sys/module/dmatest/parameters# echo 50 > iterations
root@localhost:/sys/module/dmatest/parameters#
root@localhost:/sys/module/dmatest/parameters# echo y > run
[ 1579.321872] dmatest: Started 1 threads using dma0chan18
root@localhost:/sys/module/dmatest/parameters# [ 1579.331275] dmatest: dma0chan18-copy: summary 50 tests, 0 failures 8268 iops 64660 KB/s (0)

Jeffli · December 18, 2021, 7:05am

Hi os.kernel,
What kind of device interface connect to Jetson? No more para of transfer can be on Jetson. Maybe you can check the DMA configuration on your device, such as data transfer size.

os.kernel · December 20, 2021, 9:16am

after run jetson_clocks,
loop 100 times

block_size 10MB: 100 * 10MB/ 4.16s = 240 MBps
block_size 8MB: 100 * 8MB / 3. 34s = 239MBps
block_size 4MB: 100 * 4MB / 1.67s = 239MBps
block_size 2MB: 100 * 2MB / 0.83s = 240MBps
block_size 1MB: 100 * 1MB / 0.42s = 238MBps

only gpc dma memory to memory, Do we need consider device interface connect to Jetson?
we use the Jetson Develop kit.

thanks.

os.kernel · December 20, 2021, 9:19am

I notice the gpc dma use the IOMMU,
How can I disable the IOMMU for the gpc dma?

maybe acheive better performance, I need try.

thanks.

os.kernel · December 22, 2021, 9:35am

need help, thanks.

Jeffli · December 23, 2021, 9:30am

hi os.kernel,
could your reference the CUDA code \samples\1_Utilities\bandwidthTest\ ,it can test the bandwidth of memory which using CUDA, from my test on NX D2D test is about 43.2GB/s

os.kernel · December 23, 2021, 10:14am

try disable IOMMU
hardware/nvidia/soc/t19x/kernel-dts/tegra194-soc/tegra194-soc-base.dtsi
gpcdma: dma@2600000 {
compatible = “nvidia,tegra19x-gpcdma”, “nvidia,tegra186-gpcdma”;
reg = <0x0 0x2600000 0x0 0x210000>;
resets = <&bpmp_resets TEGRA194_RESET_GPCDMA>;
reset-names = “gpcdma”;
interrupts = <0 75 0x04
#dma-cells = <1>;
/*iommus = <&smmu TEGRA_SID_GPCDMA_0>;
dma-coherent;
*/
nvidia,start-dma-channel-index = <1>;
dma-channels = <31>;

then dma with no immu：

[ 110.604892] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0xc4300000, fsynr=0x310002, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.605177] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0xc4f00000, fsynr=0x200012, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.605519] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0xc4f0a100, fsynr=0x310012, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.605766] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0xc43121c0, fsynr=0x200002, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.606115] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0xc4f1d540, fsynr=0x310012, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.606912] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0xc4f25880, fsynr=0x200012, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.609128] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0xc4342ac0, fsynr=0x310002, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.624748] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0xc43866c0, fsynr=0x200002, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.640855] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0xc5187c40, fsynr=0x310012, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.657028] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0xc4791bc0, fsynr=0x200002, cb=4, sid=32(0x20 - GPCDMA), pgd=85c816003, pud=85c816003, pmd=0, pte=0
[ 110.698071] mc-err: unknown mcerr fault, int_status=0x00001040, ch_int_status=0x00000200, hubc_int_status=0x00000000 sbs_int_status=0x00000000, hub_int_status=0x00000000
[ 110.698337] mc-err: unknown mcerr fault, int_status=0x00001040, ch_int_status=0x00000200, hubc_int_status=0x00000000 sbs_int_status=0x00000000, hub_int_status=0x00000000
[ 110.703291] mc-err: unknown mcerr fault, int_status=0x00001040, ch_int_status=0x00000200, hubc_int_status=0x00000000 sbs_int_status=0x00000000, hub_int_status=0x00000000
[ 110.718748] mc-err: unknown mcerr fault, int_status=0x00001040, ch_int_status=0x00000200, hubc_int_status=0x00000000 sbs_int_status=0x00000000, hub_int_status=0x00000000

os.kernel · December 23, 2021, 10:31am

thanks,

I am not familiar to CUDA.
Can the CUDA perform memory to memory DMA?

Jeffli · December 23, 2021, 12:06pm

no dma with CUDA, only with cudaMemcpy.
BTW, copy your CPU mem2mem function here, you just copy data or rearranged data, for example ,from planar to interlaced

Jeffli · December 23, 2021, 11:13pm

hi os.kernel,

try this snippet to simply test Jetson mem bandwidth

#include <stdio.h>
#include <time.h>
#include <malloc.h>
#include
#include

#define BUFF_SIZE (12 * 1024 * 1024)
#define MEMCOPY_ITERATIONS 100

using namespace std;

int main()
{
clock_t start,end;//定义clock_t变量

    int buf_size = sizeof(float)*BUFF_SIZE;

    float* dest =(float* )malloc(buf_size);
    float* src = (float* )malloc(buf_size);

    //calculate bandwidth in MB/s
    float bandwidthInMBs;

    start = clock();   //结束时间
    for (int i = 0; i < MEMCOPY_ITERATIONS; i++)
            memcpy(dest,src, buf_size);

    end = clock();   //结束时间
    float elapsedTimeInS = (float)(end - start)/CLOCKS_PER_SEC;

    //calculate bandwidth in MB/s
    bandwidthInMBs = (buf_size / (1024 * 1024) * (float)MEMCOPY_ITERATIONS) / (elapsedTimeInS);

    cout<<"elapsed  : "<<elapsedTimeInS<<" [s] "<< " bandwidth : "<< bandwidthInMBs<<" MB/s"<<endl;

    free(dest);free(src);

}

on TX2, I can get about 3GB/s

os.kernel · December 24, 2021, 2:12am

Jeffli, thanks.

here we don’t get involve the normal use memcpy( malloc(), the memory is with the CACHE).

the requirement is to tranfer the NVBuffer’s data to the allocated ddr addr:
src is NVBuffer
dest is the ddr addr, which later can be used as source of the PCIe DMA

we expect to transfer NVBuffer’s data to dest(ddr addr) directly, memory to memory DMA is the first what we considered.

and we can try the memcpy directly, but
we mmap the dest (ddr addr) with the NONCACHE.

if we mmap the dest(ddr addr) with the CACHE, we need flush the cache before PCIe dma.

maybe still low performance.
and in despite of the performance, the memcpy occupy the CPU, that is not we hope.

Could we make clear the GPC memory to memory’s performance?

Jeffli · December 27, 2021, 7:47am

hi os.kernel,
I remember you said you tested memcpy without DMA first, can I review your code that without DMA first? Actually we have not tested nvbuffer to ddr bandwidth before. So this is case by case.

os.kernel · December 27, 2021, 8:28am

ok. test memcpy without DMA which was not in user context, in kernel context
after run jetson_clocks,
loop 1000 times

block_size 12MB: 1000 * 12MB / 170.72s = 70MBps


static void do_memcpy_no_dma(void)
{
        int i ;

        ktime_t k0, k1;

        k0 = ktime_get();
        memcpy(dst,src,BUFF_SIZE);
        k1 = ktime_get();

        printk("%s one time cost: d1=%lld\n", __func__, (k1.tv64 - k0.tv64));

        k0 = ktime_get();
        for(i = 0;i < 1000;i++)
        {
                memcpy(dst,src,BUFF_SIZE);
        }

        k1 = ktime_get();

        printk("%s cost: d1=%lld\n", __func__, (k1.tv64 - k0.tv64));
}

dst, src comes from:

int memcpy_init(void)
{
        src = dma_alloc_coherent(NULL, BUFF_SIZE, &src_phys, GFP_KERNEL);

        if(NULL == src)
        {
                printk("err:%s:%d\n",__FILE__,__LINE__);
                goto _FAILED_ALLOC_SRC;
        }

        dst = dma_alloc_coherent(NULL, BUFF_SIZE, &dst_phys, GFP_KERNEL);

        if(NULL == dst)
        {
                printk("err:%s:%d\n",__FILE__,__LINE__);
                goto _FAILED_ALLOC_DST;
        }

        printk("src %p phys %#llx\n", src, src_phys);
        printk("dst %p phys %#llx\n", dst, dst_phys);

        return 0;
_FAILED_ALLOC_DST:

        dma_free_coherent(NULL, BUFF_SIZE, src, src_phys);
_FAILED_ALLOC_SRC:

        return -1;

}

crazy_yorick · January 5, 2022, 10:31am

Hi,
I have measured DMA bandwidth using dmatest module and get confusing results to.
Old Xavier AGX 16Gb with Jetpack4.6 installed on SSD with stock kernel and device tree was used (From dmesg):

[ 0.000000] Linux version 4.9.253-tegra (buildbrain@mobile-u64-5497-d3000) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Mon Jul 26 12:19:28 PDT 2021
…
[ 0.000000] Kernel command line: console=ttyTCU0,115200 video=tegrafb earlycon=tegra_comb_uart,mmio32,0x0c168000 gpt rootfs.slot_suffix= tegra_fbmem=0x800000@0xa06aa000 lut_mem=0x2008@0xa06a4000 usbcore.old_scheme_first=1 tegraid=19.1.2.0.0 maxcpus=8 boot.slot_suffix= boot.ratchetvalues=0.4.2 vpr_resize sdhci_tegra.en_boot_part_access=1 quiet root=/dev/nvme0n1p1 rw rootwait rootfstype=ext4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 rootfstype=ext4
…
0.440990] Tegra Revision: A02 SKU: 0xd0 CPU Process: 0 SoC Process: 0
[ 0.441014] DTS File Name: /dvs/git/dirty/git-master_linux/kernel/kernel-4.9/arch/arm64/boot/dts/…/…/…/…/…/…/hardware/nvidia/platform/t19x/galen/kernel-dts/common/tegra194-p2888-0001-p2822-0000-common.dtsi
[ 0.441026] DTB Build time: Jul 26 2021 12:22:25

Experiment:
dmatest is built-in in kernel (BTW to kernel developers - Why???), so following script was used to run test:

sudo ./run-dmatest.sh

Script run-dmatest.sh:

#!/bin/sh
#modprobe dmatest #not needed, built-in
echo 32 > /sys/module/dmatest/parameters/max_channels
echo 65536 > /sys/module/dmatest/parameters/test_buf_size
echo 4194304 > /sys/module/dmatest/parameters/test_buf_size
echo 30000 > /sys/module/dmatest/parameters/timeout
echo 1000 > /sys/module/dmatest/parameters/iterations
echo 1 > /sys/module/dmatest/parameters/sg_buffers
echo 0 > /sys/module/dmatest/parameters/dmatest
echo 1 > /sys/module/dmatest/parameters/threads_per_chan
echo 1 > /sys/module/dmatest/parameters/run
cat /sys/module/dmatest/parameters/wait
grep -H . /sys/module/dmatest/parameters/*

Results read from dmesg with dmesg | grep summary

Block size 65536:
MAXN:
[ 438.768122] dmatest: dma0chan20-copy: summary 1000 tests, 0 failures 2209 iops 69949 KB/s (0)
[ 438.771761] dmatest: dma0chan17-copy: summary 1000 tests, 0 failures 2202 iops 71553 KB/s (0)
[ 438.771989] dmatest: dma0chan16-copy: summary 1000 tests, 0 failures 2208 iops 69567 KB/s (0)
[ 438.775881] dmatest: dma0chan26-copy: summary 1000 tests, 0 failures 2204 iops 68166 KB/s (0)
[ 438.785232] dmatest: dma0chan28-copy: summary 1000 tests, 0 failures 2141 iops 67932 KB/s (0)
[ 438.786662] dmatest: dma0chan27-copy: summary 1000 tests, 0 failures 2140 iops 69640 KB/s (0)
[ 438.787065] dmatest: dma0chan30-copy: summary 1000 tests, 0 failures 2148 iops 68354 KB/s (0)
[ 438.787275] dmatest: dma0chan23-copy: summary 1000 tests, 0 failures 2150 iops 68105 KB/s (0)
[ 438.788206] dmatest: dma0chan22-copy: summary 1000 tests, 0 failures 2097 iops 67662 KB/s (0)
[ 438.790893] dmatest: dma0chan25-copy: summary 1000 tests, 0 failures 2129 iops 69773 KB/s (0)
[ 438.794214] dmatest: dma0chan29-copy: summary 1000 tests, 0 failures 2142 iops 70105 KB/s (0)
[ 438.794776] dmatest: dma0chan21-copy: summary 1000 tests, 0 failures 2128 iops 67470 KB/s (0)
[ 438.795254] dmatest: dma0chan19-copy: summary 1000 tests, 0 failures 2070 iops 67319 KB/s (0)
[ 438.798631] dmatest: dma0chan24-copy: summary 1000 tests, 0 failures 2146 iops 68653 KB/s (0)
[ 438.801568] dmatest: dma0chan18-copy: summary 1000 tests, 0 failures 2125 iops 69192 KB/s (0)

jetson_clocks:
[ 643.924734] dmatest: dma0chan30-copy: summary 1000 tests, 0 failures 2294 iops 71422 KB/s (0)
[ 643.925402] dmatest: dma0chan16-copy: summary 1000 tests, 0 failures 2287 iops 72777 KB/s (0)
[ 643.926088] dmatest: dma0chan17-copy: summary 1000 tests, 0 failures 2292 iops 71762 KB/s (0)
[ 643.931447] dmatest: dma0chan28-copy: summary 1000 tests, 0 failures 2269 iops 73501 KB/s (0)
[ 643.932880] dmatest: dma0chan18-copy: summary 1000 tests, 0 failures 2263 iops 71505 KB/s (0)
[ 643.933867] dmatest: dma0chan19-copy: summary 1000 tests, 0 failures 2260 iops 70835 KB/s (0)
[ 643.934589] dmatest: dma0chan24-copy: summary 1000 tests, 0 failures 2257 iops 71426 KB/s (0)
[ 643.938081] dmatest: dma0chan21-copy: summary 1000 tests, 0 failures 2235 iops 70606 KB/s (0)
[ 643.946116] dmatest: dma0chan23-copy: summary 1000 tests, 0 failures 2237 iops 71277 KB/s (0)
[ 643.947822] dmatest: dma0chan26-copy: summary 1000 tests, 0 failures 2214 iops 69100 KB/s (0)
[ 643.950277] dmatest: dma0chan22-copy: summary 1000 tests, 0 failures 2195 iops 71835 KB/s (0)
[ 643.952924] dmatest: dma0chan29-copy: summary 1000 tests, 0 failures 2200 iops 71083 KB/s (0)
[ 643.954168] dmatest: dma0chan25-copy: summary 1000 tests, 0 failures 2213 iops 69869 KB/s (0)
[ 643.955027] dmatest: dma0chan27-copy: summary 1000 tests, 0 failures 2202 iops 69554 KB/s (0)
[ 643.955504] dmatest: dma0chan20-copy: summary 1000 tests, 0 failures 2241 iops 71702 KB/s (0)

Block size 4194304:
MAXN:
[ 524.270940] dmatest: dma0chan26-copy: summary 1000 tests, 0 failures 55 iops 109295 KB/s (0)
[ 524.308508] dmatest: dma0chan17-copy: summary 1000 tests, 0 failures 54 iops 110534 KB/s (0)
[ 524.334019] dmatest: dma0chan25-copy: summary 1000 tests, 0 failures 54 iops 109157 KB/s (0)
[ 524.348232] dmatest: dma0chan16-copy: summary 1000 tests, 0 failures 54 iops 110186 KB/s (0)
[ 524.455678] dmatest: dma0chan30-copy: summary 1000 tests, 0 failures 54 iops 110281 KB/s (0)
[ 524.479343] dmatest: dma0chan20-copy: summary 1000 tests, 0 failures 54 iops 109603 KB/s (0)
[ 524.605309] dmatest: dma0chan21-copy: summary 1000 tests, 0 failures 53 iops 109927 KB/s (0)
[ 524.636183] dmatest: dma0chan19-copy: summary 1000 tests, 0 failures 54 iops 110936 KB/s (0)
[ 524.718651] dmatest: dma0chan22-copy: summary 1000 tests, 0 failures 53 iops 110223 KB/s (0)
[ 524.752912] dmatest: dma0chan18-copy: summary 1000 tests, 0 failures 53 iops 110593 KB/s (0)
[ 524.762316] dmatest: dma0chan23-copy: summary 1000 tests, 0 failures 53 iops 110701 KB/s (0)
[ 524.826912] dmatest: dma0chan28-copy: summary 1000 tests, 0 failures 53 iops 110239 KB/s (0)
[ 524.862594] dmatest: dma0chan24-copy: summary 1000 tests, 0 failures 53 iops 110686 KB/s (0)
[ 524.876564] dmatest: dma0chan27-copy: summary 1000 tests, 0 failures 53 iops 110608 KB/s (0)
[ 524.947783] dmatest: dma0chan29-copy: summary 1000 tests, 0 failures 53 iops 110680 KB/s (0)

jetson_clocks:
[ 700.472583] dmatest: dma0chan27-copy: summary 1000 tests, 0 failures 55 iops 109378 KB/s (0)
[ 700.637407] dmatest: dma0chan24-copy: summary 1000 tests, 0 failures 54 iops 110444 KB/s (0)
[ 700.647094] dmatest: dma0chan19-copy: summary 1000 tests, 0 failures 54 iops 111367 KB/s (0)
[ 700.757616] dmatest: dma0chan16-copy: summary 1000 tests, 0 failures 54 iops 109272 KB/s (0)
[ 700.837068] dmatest: dma0chan17-copy: summary 1000 tests, 0 failures 54 iops 109808 KB/s (0)
[ 700.842938] dmatest: dma0chan20-copy: summary 1000 tests, 0 failures 53 iops 110722 KB/s (0)
[ 700.892875] dmatest: dma0chan28-copy: summary 1000 tests, 0 failures 54 iops 110583 KB/s (0)
[ 700.907389] dmatest: dma0chan22-copy: summary 1000 tests, 0 failures 54 iops 109977 KB/s (0)
[ 701.000307] dmatest: dma0chan23-copy: summary 1000 tests, 0 failures 53 iops 110369 KB/s (0)
[ 701.016942] dmatest: dma0chan21-copy: summary 1000 tests, 0 failures 53 iops 109994 KB/s (0)
[ 701.129819] dmatest: dma0chan18-copy: summary 1000 tests, 0 failures 53 iops 110798 KB/s (0)
[ 701.149128] dmatest: dma0chan29-copy: summary 1000 tests, 0 failures 53 iops 110440 KB/s (0)
[ 701.379302] dmatest: dma0chan26-copy: summary 1000 tests, 0 failures 52 iops 109679 KB/s (0)
[ 701.396119] dmatest: dma0chan30-copy: summary 1000 tests, 0 failures 52 iops 110056 KB/s (0)
[ 701.666956] dmatest: dma0chan25-copy: summary 1000 tests, 0 failures 52 iops 111213 KB/s (0)

Summary: bandwidth is less then 111000 KB/s for 4Mb block and is not affected by system power saving. Large blocks couldn’t be tested with dmatest, as it use 1 page.

Questions:

Why so bad? What is bottleneck in MEM to MEM transfer with tegra-gpcdma (2600000.dma)?

crazy_yorick · January 5, 2022, 10:56am

Not so related questions:
1.1 EMC is loaded only on 13% while testing according to jtop.

There are 2 DMA devices with 32 channels. But only dma0chan16…dma0chan31 used in test, because they are ennumerated by dma_request_channel. So, only this channels could be exclusively requested?
What channels could be nonexclusively requested with dma_find_channel?
Could 2930000.adma(tegra-adma) be used? How it is bandwidth compared to 2600000.dma(tegra-gpcdma)?

/sys
├── class
│ ├── dma
│ │ ├── dma0chan0 → …/…/devices/2600000.dma/dma/dma0chan0
│ │ ├── dma0chan1 → …/…/devices/2600000.dma/dma/dma0chan1
│ │ ├── dma0chan10 → …/…/devices/2600000.dma/dma/dma0chan10
│ │ ├── dma0chan11 → …/…/devices/2600000.dma/dma/dma0chan11
│ │ ├── dma0chan12 → …/…/devices/2600000.dma/dma/dma0chan12
│ │ ├── dma0chan13 → …/…/devices/2600000.dma/dma/dma0chan13
│ │ ├── dma0chan14 → …/…/devices/2600000.dma/dma/dma0chan14
│ │ ├── dma0chan15 → …/…/devices/2600000.dma/dma/dma0chan15
│ │ ├── dma0chan16 → …/…/devices/2600000.dma/dma/dma0chan16
│ │ ├── dma0chan17 → …/…/devices/2600000.dma/dma/dma0chan17
│ │ ├── dma0chan18 → …/…/devices/2600000.dma/dma/dma0chan18
│ │ ├── dma0chan19 → …/…/devices/2600000.dma/dma/dma0chan19
│ │ ├── dma0chan2 → …/…/devices/2600000.dma/dma/dma0chan2
│ │ ├── dma0chan20 → …/…/devices/2600000.dma/dma/dma0chan20
│ │ ├── dma0chan21 → …/…/devices/2600000.dma/dma/dma0chan21
│ │ ├── dma0chan22 → …/…/devices/2600000.dma/dma/dma0chan22
│ │ ├── dma0chan23 → …/…/devices/2600000.dma/dma/dma0chan23
│ │ ├── dma0chan24 → …/…/devices/2600000.dma/dma/dma0chan24
│ │ ├── dma0chan25 → …/…/devices/2600000.dma/dma/dma0chan25
│ │ ├── dma0chan26 → …/…/devices/2600000.dma/dma/dma0chan26
│ │ ├── dma0chan27 → …/…/devices/2600000.dma/dma/dma0chan27
│ │ ├── dma0chan28 → …/…/devices/2600000.dma/dma/dma0chan28
│ │ ├── dma0chan29 → …/…/devices/2600000.dma/dma/dma0chan29
│ │ ├── dma0chan3 → …/…/devices/2600000.dma/dma/dma0chan3
│ │ ├── dma0chan30 → …/…/devices/2600000.dma/dma/dma0chan30
│ │ ├── dma0chan4 → …/…/devices/2600000.dma/dma/dma0chan4
│ │ ├── dma0chan5 → …/…/devices/2600000.dma/dma/dma0chan5
│ │ ├── dma0chan6 → …/…/devices/2600000.dma/dma/dma0chan6
│ │ ├── dma0chan7 → …/…/devices/2600000.dma/dma/dma0chan7
│ │ ├── dma0chan8 → …/…/devices/2600000.dma/dma/dma0chan8
│ │ ├── dma0chan9 → …/…/devices/2600000.dma/dma/dma0chan9
│ │ ├── dma1chan0 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan0
│ │ ├── dma1chan1 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan1
│ │ ├── dma1chan10 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan10
│ │ ├── dma1chan11 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan11
│ │ ├── dma1chan12 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan12
│ │ ├── dma1chan13 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan13
│ │ ├── dma1chan14 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan14
│ │ ├── dma1chan15 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan15
│ │ ├── dma1chan16 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan16
│ │ ├── dma1chan17 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan17
│ │ ├── dma1chan18 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan18
│ │ ├── dma1chan19 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan19
│ │ ├── dma1chan2 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan2
│ │ ├── dma1chan20 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan20
│ │ ├── dma1chan21 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan21
│ │ ├── dma1chan22 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan22
│ │ ├── dma1chan23 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan23
│ │ ├── dma1chan24 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan24
│ │ ├── dma1chan25 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan25
│ │ ├── dma1chan26 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan26
│ │ ├── dma1chan27 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan27
│ │ ├── dma1chan28 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan28
│ │ ├── dma1chan29 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan29
│ │ ├── dma1chan3 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan3
│ │ ├── dma1chan30 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan30
│ │ ├── dma1chan31 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan31
│ │ ├── dma1chan4 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan4
│ │ ├── dma1chan5 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan5
│ │ ├── dma1chan6 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan6
│ │ ├── dma1chan7 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan7
│ │ ├── dma1chan8 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan8
│ │ └── dma1chan9 → …/…/devices/aconnect@2a41000/2930000.adma/dma/dma1chan9