[orin] nvpva_queue_task_pool_alloc: failed to allocate task_pool->kmem_addr

Hi,
Jetpack: 5.1.2
After I bootup my orin board, and init camera then use VPI to convert BGRA to BGR and Undistortion, but it crashed in kernel:

Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757625] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757626] 14011180 total pagecache pages
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757628] 0 pages in swap cache
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757629] Swap cache stats: add 0, delete 0, find 0/0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757630] Free swap  = 0kB
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757631] Total swap = 0kB
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757632] 16452608 pages RAM
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757633] 0 pages HighMem/MovableOnly
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757633] 377466 pages reserved
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757634] 131072 pages cma reserved
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757635] 0 pages hwpoisoned
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.757640] pva 16000000.pva0: nvpva_queue_task_pool_alloc: failed to allocate task_pool->kmem_addr
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.766994] BUG: Bad page state in process drivers_camera_  pfn:2af600
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.773819] page:000000007d900dc3 refcount:0 mapcount:0 mapping:0000000000000000 index:0xffffa5400 pfn:0x2af600
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.784239] head:000000007d900dc3 order:9 compound_mapcount:1 compound_pincount:0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.792021] flags: 0x8000000000090014(uptodate|lru|head|swapbacked)
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.798498] raw: 8000000000090014 ffffffbfd05adb08 ffffffbfd1593188 0000000000000000
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.806475] raw: 0000000ffffa5400 0000000000000000 00000000ffffffff ffff6ff237fce000
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.814542] page dumped because: page still charged to cgroup
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.820477] page->mem_cgroup:ffff6ff237fce000
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.824968] Modules linked in: realtime_util_ko snd_soc_tegra210_ope snd_soc_tegra186_dspk snd_soc_tegra210_iqc snd_soc_tegra186_asrc snd_soc_tegra210_mvc snd_soc_tegra186_arad snd_soc_tegra210_afc snd_soc_tegra210_adx snd_soc_tegra210_dmic snd_soc_tegra210_amx snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra_pcm snd_soc_tegra210_i2s ramoops snd_soc_tegra210_sfc reed_solomon aes_ce_blk crypto_simd binfmt_misc cryptd aes_ce_cipher ghash_ce sha2_ce sha256_arm64 sha1_ce snd_soc_tegra_machine_driver snd_soc_spdif_tx snd_soc_tegra210_adsp nct1008 snd_soc_tegra_utils snd_soc_simple_card_utils snd_soc_max9867 snd_hda_codec_hdmi nvadsp snd_hda_tegra tegra210_adma snd_hda_codec snd_soc_tegra210_ahub tegra_bpmp_thermal snd_hda_core snd_soc_rt5640 snd_soc_rl6231 ina3221 pwm_fan nvgpu nvmap nfsd
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825024] CPU: 9 PID: 1660161 Comm: drivers_camera_ Not tainted 5.10.104-tegra #28
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825025] Hardware name: Unknown Jetson AGX Orin/Jetson AGX Orin, BIOS 3.1-32827747 03/19/2023
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825026] Call trace:
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825032]  dump_backtrace+0x0/0x1d0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825034]  show_stack+0x20/0x30
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825039]  dump_stack+0xdc/0x140
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825042]  bad_page+0xe4/0x110
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825044]  check_free_page_bad+0x84/0x90
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825046]  __free_pages_ok+0x290/0x480
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825047]  __free_pages+0xc8/0xe0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825049]  kfree+0x3d0/0x470
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825053]  nvpva_queue_alloc+0x41c/0x430
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825055]  pva_open+0x64/0x110
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825058]  chrdev_open+0xac/0x1b0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825059]  do_dentry_open+0x134/0x3a0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825061]  vfs_open+0x34/0x40
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825062]  path_openat+0x850/0xdd0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825064]  do_filp_open+0x80/0x110
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825065]  do_sys_openat2+0x1f8/0x2b0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825066]  do_sys_open+0x60/0xb0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825068]  __arm64_sys_openat+0x2c/0x40
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825070]  el0_svc_common.constprop.0+0x80/0x1d0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825072]  do_el0_svc+0x2c/0xa0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825074]  el0_svc+0x20/0x30
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825075]  el0_sync_handler+0xb0/0xc0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825076]  el0_sync+0x184/0x1c0
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.825078] Disabling lock debugging due to kernel taint
Oct 10 06:27:13 GT2V7-00091 drivers_camera_main: Module_id 55 Severity 2 : PVAUMD: "Failed to open PVA device node"
Oct 10 06:27:13 GT2V7-00091 drivers_camera_main: Module_id 55 Severity 2 : PVAUMD: "Failed to open PVA device for engine, queue " 0 0
Oct 10 06:27:13 GT2V7-00091 drivers_camera_main: Module_id 54 Severity 2 : PVAINTF: "Failed to Create UMD context"
Oct 10 06:27:13 GT2V7-00091 drivers_camera_main: Module_id 54 Severity 2 : PVAINTF: "Failed to initialize context. error=" 3
Oct 10 06:27:13 GT2V7-00091 drivers_camera_main: Module_id 55 Severity 2 : PVAUMD: "Failed to open PVA device node"
Oct 10 06:27:13 GT2V7-00091 drivers_camera_main: Module_id 55 Severity 2 : PVAUMD: "Failed to open PVA device for engine, queue " 0 0
Oct 10 06:27:13 GT2V7-00091 drivers_camera_main: Module_id 54 Severity 2 : PVAINTF: "Failed to Create UMD context"
Oct 10 06:27:13 GT2V7-00091 drivers_camera_main: Module_id 54 Severity 2 : PVAINTF: "Failed to initialize context. error=" 3
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.832736] pva 16000000.pva0: nvpva_queue_task_pool_alloc: failed to allocate task_pool->kmem_addr
Oct 10 06:27:13 GT2V7-00091 kernel: [86510.849825] pva 16000000.pva0: nvpva_queue_task_pool_alloc: failed to allocate task_pool->kmem_addr

And recovered until next reboot.
Please check syslog here:
syslog.tar.gz (274.6 KB)
Is there any way to find out which bug(hardware or software) will cause this problem?
Thanks!
BR/Tim

Looks like memory leak cause the problem.
Could you monitor the free memory to confirm.

Hi @ShaneCCC
I see it has almost 9GB normal free memory here:


And no OOM error print. So I think memory should enough.
Thanks for you reply!

Hi,
Please check the quick start in developer guide and make sure you follow the steps one by one:
https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/IN/QuickStart.html
If the device still cannot be flashed/booted, please refer to the page to get uart log from the device:
https://elinux.org/Jetson/General_debug
If you are using custom board, you can compare uart log of developer kit and custom board to get more information.

Thanks!

Hi,

Please try if the below command helps (Needs to be run as root):

sync && echo 3 > /proc/sys/vm/drop_caches

Thanks.

Sorry, you confuse me. What’s you point? Collect uart log?

Hi @AastaLLL ,
Would please why I need drop cache? Did you found some clue?
BR/Tim

Hi,

This is a known issue and is more related to large pages rather than memory.
The kernel runs out of large pages in the slab allocator so force release cache might help:

$ sync && echo 3 > /proc/sys/vm/drop_caches

Another WAR is to make the kernel behavior more aggressive to help keep more large pages.

$ sudo bash -c 'echo 100 > /proc/sys/vm/watermark_scale_factor'

Thanks.

Thanks @AastaLLL
But there is problem that how can I know nvpva_queue_task_pool_alloc failed? Since I need recover my process automatically.
Now I just call VPI init then it print error in kernel and no alarm signal return to my process.
BR/Tim

Hi,

Suppose the VPI function will fail with the error code like VPI_ERROR_INTERNAL: (cudaErrorDevicesUnavailable)
If you don’t get such a return value, please attach your source so we can check it further.

Thanks.

Hi @ AastaLLL
Sorry for late reply.
I check my code, and it only occured on I init my undistortion components. And it only call VPI func below:

  auto status = vpiWarpMapAllocData(&mapin);
  if (status != VPI_SUCCESS) {
    AERROR << "vpiWarpMapAllocData failed:" << vpiStatusGetName(status);
    return false;
  }
  status = vpiWarpMapGenerateFromFisheyeLensDistortionModel(K, X, K, &distModel,
                                                            &mapin);
  if (VPI_SUCCESS != status) {
    AERROR << "vpiWarpMapGenerateFromFisheyeLensDistortionModel failed: "
           << vpiStatusGetName(status);
    return false;
  }

  // Create the Remap payload for undistortion given the map generated above.
  status = vpiCreateRemap(VPI_BACKEND_VIC, &mapin, &remap_);
  if (VPI_SUCCESS != status) {
    AERROR << "vpiCreateRemap failed: " << vpiStatusGetName(status);
    return false;
  }

  // Now that the remap payload is created, we can destroy the warp mapin.
  vpiWarpMapFreeData(&mapin);
  status = vpiStreamCreate(VPI_BACKEND_VIC, &stream_);
  if (VPI_SUCCESS != status) {
    AERROR << "vpiStreamCreate failed: " << vpiStatusGetName(status);
    return false;
  }

  VPIImageData out_buffer, in_buffer;
  out_buffer.bufferType = VPI_IMAGE_BUFFER_NVBUFFER;
  out_buffer.buffer.fd = out_buf.fd();
  out_pipeline_ = out_buf;
  CHECK_STATUS(vpiImageCreateWrapper(&out_buffer, nullptr, 0, &out_buff_));

  in_buffer.bufferType = VPI_IMAGE_BUFFER_NVBUFFER;
  in_buffer.buffer.fd = in_buf.fd();
  CHECK_STATUS(vpiImageCreateWrapper(&in_buffer, nullptr, 0, &in_buff_));

And there is no error when nvpva_queue_task_pool_alloc failed. And I also didn’t use PVA ever.
BR/Tim

Hi,

Would you mind sharing a complete source so we can test internally?
Moreover, is this issue can be reproduced with VPI 3.x (JetPack 6)?

Thanks.