Kernel 21.5 on TK1 : Strange OutOfMemory killer and low HighMem value

Hello,

We had this strange OOM trace, with low value for

[51381.921936] xmlrpcphone-sta invoked oom-killer: gfp_mask=0x201d2, order=0, oom_score_adj=0
[51381.921945] CPU: 0 PID: 25443 Comm: xmlrpcphone-sta Tainted: G        W  O 3.10.40-gde7aafa-dirty #59
[51381.921961] [<c0016418>] (unwind_backtrace+0x0/0x140) from [<c0012f48>] (show_stack+0x18/0x1c)
[51381.921971] [<c0012f48>] (show_stack+0x18/0x1c) from [<c07f4fe0>] (dump_header.isra.13+0x74/0xb0)
[51381.921980] [<c07f4fe0>] (dump_header.isra.13+0x74/0xb0) from [<c07f5070>] (oom_kill_process.part.15+0x54/0x394)
[51381.921988] [<c07f5070>] (oom_kill_process.part.15+0x54/0x394) from [<c0106f84>] (out_of_memory+0x12c/0x1cc)
[51381.921996] [<c0106f84>] (out_of_memory+0x12c/0x1cc) from [<c010b7f4>] (__alloc_pages_nodemask+0x8e0/0x90c)
[51381.922003] [<c010b7f4>] (__alloc_pages_nodemask+0x8e0/0x90c) from [<c0105cac>] (filemap_fault+0x19c/0x3c8)
[51381.922010] [<c0105cac>] (filemap_fault+0x19c/0x3c8) from [<c0124a28>] (__do_fault+0x88/0x4c4)
[51381.922026] [<c0124a28>] (__do_fault+0x88/0x4c4) from [<c01282ac>] (handle_pte_fault+0xb8/0x1d8)
[51381.922033] [<c01282ac>] (handle_pte_fault+0xb8/0x1d8) from [<c0128430>] (__handle_mm_fault+0x64/0x90)
[51381.922040] [<c0128430>] (__handle_mm_fault+0x64/0x90) from [<c001e628>] (do_page_fault+0xcc/0x318)
[51381.922046] [<c001e628>] (do_page_fault+0xcc/0x318) from [<c0008434>] (do_PrefetchAbort+0x3c/0xa4)
[51381.922053] [<c0008434>] (do_PrefetchAbort+0x3c/0xa4) from [<c000ee14>] (ret_from_exception+0x0/0x10)
[51381.922057] Exception stack(0xd7e87fb0 to 0xd7e87ff8)
[51381.922061] 7fa0:                                     7727a4bc 00000000 7727a4dc 7727a4bc
[51381.922066] 7fc0: b6feee70 7727b050 7727b050 7727aa48 7727ab90 00000000 7727b284 7727a4c4
[51381.922070] 7fe0: b6fef0dc 7727a4b8 b6f28b04 b654a714 60070030 ffffffff
[51381.922073] Mem-info:
[51381.922077] Normal per-cpu:
[51381.922080] CPU    0: hi:  186, btch:  31 usd:   0
[51381.922084] CPU    1: hi:  186, btch:  31 usd:   0
[51381.922087] CPU    2: hi:  186, btch:  31 usd:  30
[51381.922090] CPU    3: hi:  186, btch:  31 usd:  30
[51381.922093] HighMem per-cpu:
[51381.922096] CPU    0: hi:  186, btch:  31 usd:  32
[51381.922099] CPU    1: hi:  186, btch:  31 usd: 105
[51381.922101] CPU    2: hi:  186, btch:  31 usd:   2
[51381.922104] CPU    3: hi:  186, btch:  31 usd:  30
[51381.922111] active_anon:426022 inactive_anon:12256 isolated_anon:0
[51381.922111]  active_file:233 inactive_file:1359 isolated_file:17
[51381.922111]  unevictable:48 dirty:5 writeback:0 unstable:0
[51381.922111]  free:15991 slab_reclaimable:1844 slab_unreclaimable:4940
[51381.922111]  mapped:19162 shmem:21918 pagetables:1133 bounce:0
[51381.922111]  free_cma:4024
[51381.922124] Normal free:63676kB min:2712kB low:3388kB high:4068kB active_anon:343020kB inactive_anon:25460kB active_file:396kB inactive_file:2256kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:507904kB managed:460160kB mlocked:0kB dirty:16kB writeback:0kB mapped:6792kB shmem:56832kB slab_reclaimable:7376kB slab_unreclaimable:19760kB kernel_stack:1312kB pagetables:4532kB unstable:0kB bounce:0kB free_cma:16096kB writeback_tmp:0kB pages_scanned:3553 all_unreclaimable? yes
[51381.922127] lowmem_reserve[]: 0 11416 11416
[51381.922182] HighMem free:288kB min:512kB low:2664kB high:4816kB active_anon:1361068kB inactive_anon:23564kB active_file:536kB inactive_file:3180kB unevictable:192kB isolated(anon):0kB isolated(file):0kB present:1461248kB managed:1461248kB mlocked:192kB dirty:4kB writeback:0kB mapped:69856kB shmem:30840kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5728 all_unreclaimable? yes
[51381.922185] lowmem_reserve[]: 0 0 0
[51381.922192] Normal: 844*4kB (UE) 336*8kB (UEM) 185*16kB (UEM) 98*32kB (UEMC) 68*64kB (UEMC) 31*128kB (UEC) 15*256kB (U) 3*512kB (UC) 3*1024kB (UC) 1*2048kB (C) 8*4096kB (MRC) = 63744kB
[51381.922224] HighMem: 23*4kB (M) 22*8kB (UM) 5*16kB (U) 1*32kB (M) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 444kB
[51381.922247] 23645 total pagecache pages
[51381.922250] 0 pages in swap cache
[51381.922253] Swap cache stats: add 0, delete 0, find 0/0
[51381.922256] Free swap  = 0kB
[51381.922258] Total swap = 0kB
[51381.931635] 515840 pages of RAM
[51381.931692] 16904 free pages
[51381.931695] 31265 reserved pages
[51381.931697] 6806 slab pages
[51381.931700] 282708 pages shared
[51381.931703] 0 pages swap cached
[51381.931707] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[51381.931724] [  272]     0   272      656      154       4        0             0 upstart-udev-br
[51381.931729] [  284]     0   284      521       41       4        0             0 rpc.idmapd
[51381.931734] [  292]   103   292      853      125       4        0             0 dbus-daemon
[51381.931739] [  304]     0   304     2361      189       5        0         -1000 systemd-udevd
[51381.931743] [  368]     0   368      841       96       4        0             0 systemd-logind
[51381.931748] [  402]   101   402     7619      153       8        0             0 rsyslogd
[51381.931752] [  933]     0   933      506       78       4        0             0 rpcbind
[51381.931757] [  965]   118   965      544      126       3        0             0 rpc.statd
[51381.931761] [ 1023]     0  1023      726      273       4        0             0 upstart-file-br
[51381.931766] [ 1026]     0  1026      519       79       3        0             0 upstart-socket-
[51381.931770] [ 1163]     0  1163    12699      292      15        0             0 NetworkManager
[51381.931775] [ 1165]     0  1165      964       48       5        0             0 getty
[51381.931779] [ 1167]     0  1167      964       48       4        0             0 getty
[51381.931784] [ 1172]     0  1172      964       48       5        0             0 getty
[51381.931788] [ 1173]     0  1173      964       48       5        0             0 getty
[51381.931792] [ 1176]     0  1176      964       48       5        0             0 getty
[51381.931797] [ 1197]     0  1197     1471      140       5        0         -1000 sshd
[51381.931801] [ 1210]     0  1210      565       67       4        0             0 cron
[51381.931805] [ 1245]     0  1245     8675      153      10        0             0 polkitd
[51381.931810] [ 1246]   119  1246      483       53       3        0             0 dnsproxy
[51381.931814] [ 1425]     0  1425      964       48       5        0             0 getty
[51381.931818] [ 1426]     0  1426     1349      168       5        0             0 login
[51381.931835] [ 1477]  1000  1477     1150       60       4        0             0 bash
[51381.931840] [ 1550]     0  1550     2763      236       7        0             0 sshd
[51381.931844] [ 1593]  1000  1593     2763      214       6        0             0 sshd
[51381.931849] [ 1596]  1000  1596     1149      105       4        0             0 bash
[51381.931853] [ 1666]   117  1666     1069      121       4        0             0 ntpd
[51381.931857] [ 1786]     0  1786     1889      157       5        0             0 sudo
[51381.931861] [ 1793]     0  1793     1777      145       6        0             0 su
[51381.932085] [ 1802]     0  1802     1168      123       5        0             0 bash
[51381.932092] [ 1835]     0  1835   381360   347823     721        0             0 xmlrpcphone-sta
[51381.932096] [ 1852]     0  1852      567       86       3        0             0 tmux
[51381.932100] [ 1854]     0  1854     1409      881       5        0             0 tmux
[51381.932104] [ 1855]     0  1855     1176      133       4        0             0 bash
[51381.932109] [ 2573]     0  2573     1176      131       5        0             0 bash
[51381.932114] [14079]     0 14079     2804      239       8        0             0 sshd
[51381.932118] [14144]  1000 14144     2804      218       7        0             0 sshd
[51381.932123] [14147]  1000 14147     1152      107       4        0             0 bash
[51381.932127] [14159]     0 14159     1889      157       6        0             0 sudo
[51381.932131] [14167]     0 14167     1777      145       6        0             0 su
[51381.932135] [14176]     0 14176     1170      110       5        0             0 bash
[51381.932141] [25193]     0 25193     2797      239       7        0             0 sshd
[51381.932146] [25246]  1000 25246     2797      234       6        0             0 sshd
[51381.932150] [25249]  1000 25249     1149      104       5        0             0 bash
[51381.932154] [25289]     0 25289     1889      157       6        0             0 sudo
[51381.932158] [25296]     0 25296     1777      145       6        0             0 su
[51381.932163] [25306]     0 25306     1167      123       5        0             0 bash
[51381.932167] [25442]     0 25442    82201    81784     163        0             0 perf
[51381.932172] Out of memory: Kill process 1835 (xmlrpcphone-sta) score 697 or sacrifice child
[51381.940857] Killed process 1835 (xmlrpcphone-sta) total-vm:1525440kB, anon-rss:1321632kB, file-rss:69660kB

Our appliaction appears as “xmlrpcphone-sta” process and only consums 381M of RSS and there is almost no other threads running.

How a 2GBytes linux system could be oom with only using 500Mbytes of RAM ?

The only clue we have is : 44kkB of Free High Mem

HighMem: 23*4kB (M) 22*8kB (UM) 5*16kB (U) 1*32kB (M) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 444kB

Did we miss some important information ?

The system is started with vmalloc=512M at boot time, and our application mmap about 200Mbytes of V4L2 buffer.

When running the system ay have the following buddy info :

cat /proc/buddyinfo 
Node 0, zone   Normal     64     34    156     96     71     14     15      6      3      2     17 
Node 0, zone  HighMem      6      0      0      1      0      0      0      1      0      1      0

Again some low values for HighMem.

Is this normal ?

You may want to install something like “htop” and watch memory as things run to see if maybe something has a memory leak. There may also be cases where something requires physically contiguous memory and the message is less than explicit about what kind of memory is acceptable. I like htop for this because you can click on particular columns (such as VIRT, RES, SHR, MEM%) and sort by this in real time.

There is no user-land memory leak in our application and no kernel-land memory leak (confirmed using kernel KMEMLEAK).

htop, or even valgrind, that we are using since a long time, is useless since memory doesn’t leak.

image grabbing from CSI bus is based on mmap V4L2 api.
usb device mode, including the webcam feature, is quiet intensive in terms of IRQ and memcopy.

Could you please explain the root cause that triggered the oom described in this ticket ?
Is is a low HighMem condition ?

There are other consumers of memory beyond your code…htop is suggested so you can view the nature of the memory used as it goes up (e.g., shared memory may go up with total memory, which in turn might be a clue that pinned memory is the limitation…versus virtual…if virtual swap space could help). A memory leak is just one possibility that was mentioned. I don’t know the cause for your case, I was just hoping to narrow down the memory consumer.

One corollary to this for testing would be to add swap space and see if it helps…if not, then once again the clue would be that a specific type of memory is the issue, versus total memory.

Our Application is designed to use static memory allocation for all huge memory block.

Our Application and USB gagdet driver was stable for long term period ( >30days) on Tegra K1 cpu with R21.4

Our USB gagdet driver and Application is stable for long term period ( >30days) on imx6 CPU

So please consider to provide more deep and intensive support.

Could you please explain the root cause that triggered the oom described in this ticket ?
Is is a low HighMem condition ?

I am unable to directly answer, I only have suggestions to trace the cause.

If you can attach your dmesg log starting at the first line which contains either “oom” or “total_vm” it would help (hover the mouse over the quote icon in the upper right of an existing post…you’ll see the paper clip icon show up for attaching a file…attach the file with a “.txt” extension to please the spam filters). I don’t have the log to look at, but I’m guessing this will create the file:

dmesg | awk '/(oom|total_vm)/,/^$/' > dmesg.txt

In your kernel perhaps you could try an alternate configuration. Start with the existing “/proc/config.gz”. You will see:

CONFIG_VMSPLIT_3G=y 
# CONFIG_VMSPLIT_2G is not set
# CONFIG_VMSPLIT_1G is not set

You can change this and it may either fix the issue or offer more information (you’ll have to compile a new kernel, this is not a simple module change):

# CONFIG_VMSPLIT_3G is not set
CONFIG_VMSPLIT_2G=y
# CONFIG_VMSPLIT_1G is not set

I would also suggest make sure “CONFIG_SWAP” is enabled and try with and without a couple GB of swap added…knowing if swap changes the issue or does not will help figure out what contributes to the OOM.

The attachment already includes all the line from the oom. Previous dmesg lines was more than hour earlier. So it should be possible to diagnose the root cause of the oom.

We didn’t tried yet to change the VMSPLIT value, but very few people are writing on Internet for that specific topic on recent arm cpu. We will try next week to change VMSPLIT.

About Highmem zone very low value, to you have any ideas of the consequence it could generate.

cat /proc/buddyinfo 
Node 0, zone   Normal     64     34    156     96     71     14     15      6      3      2     17 
Node 0, zone  HighMem      6      0      0      1      0      0      0      1      0      1      0

The oom trace also show those kind of values : 444kB of free area

[51381.922224] HighMem: 23*4kB (M) 22*8kB (UM) 5*16kB (U) 1*32kB (M) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 444kB

Do you have any clear documentation of each zone (Normal/HighMem) usage and purpose for ARMv7 cpu ?

About swap, our application has a realtime requirement of 16ms latency, with a maximal 60~100ms latency, so it would lead to other issues to use SWAP and make it worse complex to debug.
Anyway, we will try it.

With our USB device WEBCAM driver, there is 5000 kalloc/kfree (due to usbrequest struct) per sec (+irq) for 15frame per secondes , that could lead to a “high memory fragmentation” issue under high memory usage conditions (aka low remaining memory)

This URL has some info on HighMem:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/4/html/Reference_Guide/s2-proc-buddyinfo.html

I think this URL gives more insight:
https://unix.stackexchange.com/questions/4929/what-are-high-memory-and-low-memory-on-linux

The ARMv7 is a 32-bit system and has the 3G/1G split by default. You should know fairly quickly if HighMem is the issue once you’ve tried the “CONFIG_VMSPLIT_2G=y” option. The reason to test with swap is just to gain an idea of what kind of memory is needed…some types of memory limitations can’t be helped with swap. If swap helps, then it is unlikely the issue is specifically HighMem. Choosing a VM split of 2G when swap helps is probably a bad idea because you’d be limiting where RAM can be used in a way which doesn’t address the OOM.
[EDIT: This may not be technically exact for how VM split works versus what HighMem is, but results of using a different split still involve what the kernel has access to.]

You already know this is what tried to allocate more than the system has, but it doesn’t help solve the problem:

xmlrpcphone-sta invoked oom-killer

…if swap helps, then swap can be added and this process can be marked to not allow swapping out (other programs would swap in that case, leaving more RAM for xmlrpcphone-sta). If 2G/2G split works and user space does not run out of RAM (you’re emptying one type of RAM to fill another type) then your problem is solved. If 2G/2G helps and user space runs out of RAM, then you need both the 2G/2G split and swap (with your app marked for not allowing swapping out). Hopefully the 2G split will do the job because then you won’t have to deal with anything related to swap.

We added 64Gbytes of SSD SWAP, and it clearly doesn’t help.

Could you confirm that testing the “CONFIG_VMSPLIT_2G=y” kernel config is the next case to be tested ?

Yes, this is the next thing to test. VMSPLIT can provide more physical memory to the kernel (it’ll change from reserving kernel/user of 1G/3G to 2G/2G…user space will get less, drivers needing physical memory will get more)…if this is where memory is out, then this could fix (or at least improve) the case.

When putting the new kernel in I’d suggest check the “uname -r” before build (it’s probably “3.10.40-ga7da876”), and set the CONFIG_LOCALVERSION to “-ga7da867_test”. This means modules would be searched for at “/lib/modules/3.10.40-ga7da867_test/”, and thus the original module directory would be left alone. Do a full module install to “/lib/modules/3.10.40-ga7da867_test/”. From the original module directory you will find subdirectory “extra/”, copy this recursively into “/lib/modules/3.10.40-ga7da867_test/”. Add a second entry to extlinux.conf, e.g., something like this, and then select it via serial console at boot time (this also sets USB to USB3; zImage is also renamed):

LABEL ga7da867_test
      MENU LABEL ga7da867_test
      LINUX /boot/zImage-3.10.40-ga7da867_test
      FDT /boot/tegra124-jetson_tk1-pm375-000-c00-00.dtb
      APPEND console=ttyS0,115200n8 console=tty1 no_console_suspend=1 lp0_vec=2064@0xf46ff000 mem=2015M@2048M memtype=255 ddr_die=2048M@2048M section=256M pmuboard=0x0177:0x0000:0x02:0x43:0x00 tsec=32M@3913M otf_key=c75e5bb91eb3bd947560357b64422f85 usbcore.old_scheme_first=1 core_edp_mv=1150 core_edp_ma=4000 tegraid=40.1.1.0.0 debug_uartport=lsport,3 power_supply=Adapter audio_codec=rt5640 modem_id=0 android.kerneltype=normal fbcon=map:1 commchip_id=0 usb_port_owner_info=2 lane_owner_info=6 emc_max_dvfs=0 touch_id=0@0 board_info=0x0177:0x0000:0x02:0x43:0x00 net.ifnames=0 root=/dev/mmcblk0p1 rw rootwait tegraboot=sdmmc gpt

Should this fix the issue you can change the extlinux.conf “DEFAULT” from “primary” to “ga7da867_test” (which will leave the original entry available to test via serial console). Should this not at least cause an OOM behavior change then further debugging would be needed.

Please considere that our kernel memory usage is not that big :

slabtop -o
 Active / Total Objects (% used)    : 93581 / 99368 (94.2%)
 Active / Total Slabs (% used)      : 4968 / 4968 (100.0%)
 Active / Total Caches (% used)     : 94 / 160 (58.8%)
 Active / Total Size (% used)       : 26695.31K / 27429.75K (97.3%)
 Minimum / Average / Maximum Object : 0.02K / 0.28K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 16182  16136  99%    0.13K    558       29      2232K dentry                 
 15399  15270  99%    0.06K    261       59      1044K kmalloc-64             
 13334  13175  98%    0.06K    226       59       904K sysfs_dir_cache        
  9724   9687  99%    0.33K    884       11      3536K inode_cache            
  7110   6933  97%    0.12K    237       30       948K kmalloc-128            
  4576   4043  88%    0.09K    104       44       416K vm_area_struct         
  3540   3484  98%    0.06K     60       59       240K buffer_head            
  3390   3012  88%    0.03K     30      113       120K anon_vma_chain         
  2706   2705  99%    0.59K    451        6      1804K ext4_inode_cache       
  2660   1658  62%    0.19K    133       20       532K filp                   
  2486   2432  97%    0.03K     22      113        88K ext4_extent_status     
  2197   2161  98%    0.29K    169       13       676K radix_tree_node        
  2034   1840  90%    0.03K     18      113        72K anon_vma               
  1610   1595  99%    0.36K    161       10       644K proc_inode_cache       
  1469   1461  99%    0.03K     13      113        52K ftrace_event_field     
  1456   1285  88%    0.50K    182        8       728K kmalloc-512            
  1010    991  98%    0.37K    101       10       404K shmem_inode_cache      
   600    473  78%    0.19K     30       20       120K kmalloc-192            
   565    531  93%    0.03K      5      113        20K ftrace_event_file      
   521    521 100%   16.00K    521        1      8336K kmalloc-16384          
   508    508 100%    1.00K    127        4       508K kmalloc-1024           
   480    479  99%    0.25K     32       15       128K kmalloc-256            
   480    472  98%    0.12K     16       30        64K cred_jar               
   354    170  48%    0.06K      6       59        24K pid                    
   254    248  97%    2.00K    127        2       508K kmalloc-2048           
   236    114  48%    0.06K      4       59        16K fs_cache               
   203      2   0%    0.02K      1      203         4K jbd2_revoke_table_s    
   189    189 100%    0.56K     27        7       108K signal_cache           
   184     26  14%    0.04K      2       92         8K eventpoll_pwq          
   180    159  88%    0.12K      6       30        24K kmem_cache             
   180    178  98%    1.25K     60        3       240K task_struct            
   168    151  89%    1.05K     24        7       192K idr_layer_cache        
   160    111  69%    0.38K     16       10        64K skbuff_fclone_cache    
   159    159 100%    1.31K     53        3       212K sighand_cache          
   148    112  75%    0.10K      4       37        16K ext4_groupinfo_4k      
   145      1   0%    0.02K      1      145         4K nsproxy                
   145     14   9%    0.02K      1      145         4K ip_fib_alias           
   145     42  28%    0.02K      1      145         4K jbd2_inode             
   135    114  84%    0.25K      9       15        36K files_cache            
   130     95  73%    0.38K     13       10        52K sock_inode_cache       
   120     26  21%    0.12K      4       30        16K eventpoll_epi          
   118     23  19%    0.06K      2       59         8K blkdev_ioc             
   113     13  11%    0.03K      1      113         4K ip_fib_trie            
   113      2   1%    0.03K      1      113         4K tcp_bind_bucket        
   113      2   1%    0.03K      1      113         4K sd_ext_cdb             
   113      7   6%    0.03K      1      113         4K fib6_nodes             
   106     49  46%    0.07K      2       53         8K inotify_inode_mark     
    99     57  57%    0.44K     11        9        44K mm_struct              
    77     53  68%    0.56K     11        7        44K UNIX                   
    67      1   1%    0.05K      1       67         4K configfs_dir_cache     
    64     64 100%    0.23K      4       16        16K nf_conntrack_c0d26740
free -lm
             total       used       free     shared    buffers     cached
Mem:          1892       1394        498          9         12        263
Low:           721        225        496
High:         1170       1169          1
-/+ buffers/cache:       1117        774
Swap:            0          0          0
/proc/meminfo 
MemTotal:        1938056 kB
MemFree:          514248 kB
Buffers:           13116 kB
Cached:           262392 kB
SwapCached:            0 kB
Active:          1275924 kB
Inactive:          56124 kB
Active(anon):    1057032 kB
Inactive(anon):     1364 kB
Active(file):     218892 kB
Inactive(file):    54760 kB
Unevictable:         672 kB
Mlocked:             672 kB
HighTotal:       1199100 kB
HighFree:           1844 kB
LowTotal:         738956 kB
LowFree:          512404 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 8 kB
Writeback:             0 kB
AnonPages:       1057492 kB
Mapped:            53316 kB
Shmem:              1860 kB
Slab:              28564 kB
SReclaimable:       9364 kB
SUnreclaim:        19200 kB
KernelStack:        1216 kB
PageTables:         2996 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      969028 kB
Committed_AS:    1212060 kB
VmallocTotal:     253952 kB
VmallocUsed:      211160 kB
VmallocChunk:      12536 kB
NvMapMemFree:          0 kB
NvMapMemUsed:          0 kB

Our main process is not consuming not that much memory :

cat /mnt/tmp/qse.status.2 
Name:   xmlrpcphone-sta
State:  S (sleeping)
Tgid:   5200
Pid:    5200
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 32
Groups: 0 
VmPeak:  1146960 kB
VmSize:  1145568 kB
VmLck:       672 kB
VmPin:         0 kB
VmHWM:   1081584 kB
VmRSS:   1080088 kB
VmData:  1090036 kB
VmStk:       136 kB
VmExe:      3072 kB
VmLib:     10944 kB
VmPTE:      1088 kB
VmSwap:        0 kB
Threads:        8
SigQ:   0/15007
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180004a02
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Cpus_allowed:   f
Cpus_allowed_list:      0-3
voluntary_ctxt_switches:        445
nonvoluntary_ctxt_switches:     6458

But does VMSPLIT change the issue? Not all memory is equal…it isn’t always about totals, sometimes it is about how it is arranged.

We are launching an endurance test this night with more tracepoint on syncpt workqueue and vi kthread workqueue. With this version we may have more detail on timing on IRQ => syncpt WQ => vi kthread WQ if a syncpt timeout occurs.

We will try VMSPLIT 2G tomorrow.

We found a Memory leak that occurs only on specifics race condition in our application process, and lead to SWAP usage if enabled.
We will relaunch test once this leak has been found and corrected.

Nevertheless, the know that the syncpt timeout can occurs without this leak, with a stable RSS usage from the application. We now are focusing to make reproducible this issue.