最近维护测试服务器越来越多出现OOM。每次都是改改内核参数,貌似有点用处。但是这个治标不治本,源头没有找到。
自己先了解了一下一些基础性的东西。每次OOM后查看messages日志都能看到
Jun 18 17:10:23 free-72-222 kernel: oom-killer: gfp_mask=0xd0Jun 18 17:10:23 free-72-222 kernel: Mem-info:Jun 18 17:10:23 free-72-222 kernel: DMA per-cpu:Jun 18 17:10:23 free-72-222 kernel: cpu 0 hot: low 2, high 6, batch 1Jun 18 17:10:23 free-72-222 kernel: cpu 0 cold: low 0, high 2, batch 1Jun 18 17:10:23 free-72-222 kernel: cpu 1 hot: low 2, high 6, batch 1Jun 18 17:10:23 free-72-222 kernel: cpu 1 cold: low 0, high 2, batch 1Jun 18 17:10:23 free-72-222 kernel: cpu 2 hot: low 2, high 6, batch 1Jun 18 17:10:27 free-72-222 kernel: cpu 2 cold: low 0, high 2, batch 1Jun 18 17:10:27 free-72-222 kernel: cpu 3 hot: low 2, high 6, batch 1Jun 18 17:10:27 free-72-222 kernel: cpu 3 cold: low 0, high 2, batch 1Jun 18 17:10:27 free-72-222 kernel: Normal per-cpu:Jun 18 17:10:27 free-72-222 kernel: cpu 0 hot: low 32, high 96, batch 16Jun 18 17:10:27 free-72-222 kernel: cpu 0 cold: low 0, high 32, batch 16Jun 18 17:10:27 free-72-222 kernel: cpu 1 hot: low 32, high 96, batch 16Jun 18 17:10:27 free-72-222 kernel: cpu 1 cold: low 0, high 32, batch 16Jun 18 17:10:27 free-72-222 kernel: cpu 2 hot: low 32, high 96, batch 16Jun 18 17:10:27 free-72-222 kernel: cpu 2 cold: low 0, high 32, batch 16…Jun 20 14:46:44 free-72-222 kernel: cpu 2 cold: low 0, high 32, batch 16Jun 20 14:46:44 free-72-222 kernel: cpu 1 cold: low 0, high 32, batch 16Jun 20 14:46:44 free-72-222 kernel: cpu 2 hot: low 32, high 96, batch 16Jun 20 14:46:44 free-72-222 kernel: cpu 2 cold: low 0, high 32, batch 16Jun 20 14:46:44 free-72-222 kernel: cpu 3 hot: low 32, high 96, batch 16Jun 20 14:46:44 free-72-222 kernel: cpu 3 cold: low 0, high 32, batch 16Jun 20 14:46:44 free-72-222 kernel:Jun 20 14:46:44 free-72-222 kernel: Free pages: 35748kB (24320kB HighMem)Jun 20 14:46:44 free-72-222 kernel: protections[]: 0 0 0Jun 20 14:46:44 free-72-222 kernel: protections[]: 0 0 0Jun 20 14:46:44 free-72-222 kernel: protections[]: 0 0 0Jun 20 14:46:44 free-72-222 kernel: protections[]: 0 0 0Jun 20 14:46:44 free-72-222 kernel: Normal free:3304kB min:3336kB low:6672kB high:10008kB active:617956kB inactive:0kB present:729088kB pages_scanned:1293 all_unreclaimable? noJun 20 14:46:44 free-72-222 kernel: protections[]: 0 0 0Jun 20 14:46:44 free-72-222 kernel: HighMem free:24320kB min:512kB low:1024kB high:1536kB active:2836904kB inactive:486976kB present:3358720kB pages_scanned:0 all_unreclaimable? noJun 20 14:46:44 free-72-222 kernel: protections[]: 0 0 0Jun 20 14:46:44 free-72-222 kernel: DMA: 34kB 28kB 616kB 432kB 564kB 1128kB 1256kB 2512kB 21024kB 22048kB 04096kB = 8124kBJun 20 14:46:44 free-72-222 kernel: Normal: 04kB 18kB 016kB 132kB 164kB 1128kB 0256kB 0512kB 11024kB 12048kB 04096kB = 3304kBJun 20 14:46:44 free-72-222 kernel: HighMem: 59424kB 58kB 016kB 032kB 064kB 0128kB 0256kB 1512kB 01024kB 02048kB0*4096kB = 24320kBJun 20 14:46:44 free-72-222 kernel: 428935 pagecache pagesJun 20 14:46:44 free-72-222 kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0Jun 20 14:46:44 free-72-222 kernel: 0 bounce buffer pagesJun 20 14:46:44 free-72-222 kernel: Free swap: 0kBJun 20 14:46:44 free-72-222 kernel: 1026048 pages of RAMJun 20 14:46:44 free-72-222 kernel: 839680 pages of HIGHMEMJun 20 14:46:44 free-72-222 kernel: 10594 reserved pagesJun 20 14:46:44 free-72-222 kernel: 413640 pages sharedJun 20 14:46:44 free-72-222 kernel: 0 pages swap cachedJun 20 14:46:44 free-72-222 kernel: Out of Memory: Killed process 19148 (java). 这样的日志,对于里面的
Jun 20 14:46:44 free-72-222 kernel: DMA: 34kB 28kB 616kB 432kB 564kB 1128kB 1256kB 2512kB 21024kB 22048kB 04096kB = 8124kBJun 20 14:46:44 free-72-222 kernel: Normal: 04kB 18kB 016kB 132kB 164kB 1128kB 0256kB 0512kB 11024kB 12048kB 04096kB = 3304kBJun 20 14:46:44 free-72-222 kernel: HighMem: 59424kB 58kB 016kB 032kB 064kB 0128kB 0256kB 1512kB 01024kB 02048kB0*4096kB = 24320kB 就能看出OOM发生的当时实际的Normal,HigMem的空闲内存值分布是 2304KB,24320KB。
另外我们也可以根据
# cat /proc/buddyinfoNode 0, zone DMA 3 2 6 4 5 1 1 2 2 2 0Node 0, zone Normal 28 1 1 1 2413 918 161 12 1 1 0Node 0, zone HighMem 126 8173 18550 1254 22 0 0 1 0 0 0
查看系统当前的内存情况。对于buddyinfo 的解释参考
http://www.centos.org/docs/5/html/5.2/Deployment_Guide/s2-proc-buddyinfo.html
其实就是
第1列4kB 第2列8kB 第3列16kB 第4列32kB 第5列64kB 第6列128kB 第7列256kB 第8列512kB 第9列1024kB 第10列2048kB 第11列*4096kB
总的和就是当时剩余的内存值。
我们可以使用命令
echo m > /proc/sysrq-trigger
让内核把当前的buddyinfo信息打印到messages日志中
# cat /proc/buddyinfo ;
echo m > /proc/sysrq-triggerNode 0, zone DMA 3 2 6 4 5 1 1 2 2 2 0Node 0, zone Normal 0 0 0 1 2403 918 161 12 1 1 0Node 0, zone HighMem 0 0 18372 1254 22 0 0 1 0 0 0
这里面空闲的内存值加起来和free的结果是一直的
# free -mtotal usedfreeshared buffers cachedMem: 4000 33516480 257 2046-/+ buffers/cache: 1047 2952Swap: 0 0 0
再回到系统对进程内存分配控制的2个内核参数
/proc/sys/vm/overcommit_memory 可以有个三个值
0 (default): as before: guess about how much overcommitment is reasonable,
1: never refuse anymalloc(),
`malloc()`2: be precise about the overcommit – never commit a virtual address space larger than swap space plus a fractionovercommit_ratioof the physical memory. Here/proc/sys/vm/overcommit_ratio(by default 50) is another user-settable parameter. It is possible to setovercommit_ratioto values larger than 100.
简单地说vm.overcommit_memory = 0,这时候可以申请到比较多的内存,但是仍然会在一定的时候申请失败;vm.overcommit_memory = 1,所有的malloc都会成功;
vm.overcommit_memory = 2,当前可以申请的内存大小是
`/proc/sys/vm/overcommit_ratio`默认值是50.这个时候
设置过小会浪费内存,造成一部分内存不能被使用,设置过大又失去意义。需要根据实际情况调整。
[root@test /var/log]#
echo m > /proc/sysrq-trigger
`[root@test /var/log]
#
echo m > /proc/sysrq-trigger ``[root@test /var/log]
# echo m > /proc/sysrq-trigger `
[root@test /var/log]# cat /proc/buddyinfoNode 0, zone DMA 3 2 6 4 5 1 1 2 2 2 0Node 0, zone Normal 0 0 1 1 2360 918 161 12 1 1 0Node 0, zone HighMem 0 0 17900 1254 22 0 0 1 0 0 0
`[root@test /var/log]
cat /proc/buddyinfo
Node 0, zone DMA 3 2 6 4 5 1 1 2 2 2 0 Node 0, zone Normal 0 0 1 1 2360 918 161 12 1 1 0 Node 0, zone HighMem 0 0 17900 1254 22 0 0 1 0 0 0`
[root@test /var/log]# grep HighMem: /var/log/messages |tail -n 1Jun 20 16:02:37 free-72-222 kernel: HighMem: 0*4kB 0*8kB 17948*16kB 1254*32kB 22*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 329216kB
`[root@test /var/log]
grep HighMem: /var/log/messages |tail -n 1
Jun 20 16:02:37 free-72-222 kernel: HighMem: 0*4kB 0*8kB 17948*16kB 1254*32kB 22*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 329216kB`
`[root@test /var/log]
grep Normal: /var/log/messages |tail -n 1
Jun 20 16:02:37 free-72-222 kernel: Normal: 0*4kB 0*8kB 1*16kB 1*32kB 2371*64kB 918*128kB 161*256kB 12*512kB 1*1024kB 1*2048kB 0*4096kB = 319728kB`
`[root@test /var/log]
cat /proc/meminfo
MemTotal: 4096132 kB MemFree: 655324 kB Buffers: 265820 kB Cached: 2101480 kB SwapCached: 0 kB Active: 2690184 kB Inactive: 595584 kB HighTotal: 3350528 kB HighFree: 328256 kB LowTotal: 745604 kB LowFree: 327068 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 1236 kB Writeback: 0 kB Mapped: 969720 kB Slab: 94556 kB CommitLimit: 2048064 kB Committed_AS: 4843488 kB PageTables: 4580 kB VmallocTotal: 114680 kB VmallocUsed: 1232 kB VmallocChunk: 112956 kB` 参数/proc/sys/vm/drop_caches
`执行sync同步后可以试试
echo 3 > /proc/sys/vm/drop_caches把buffer和cache都释放出来,不过实际上感觉这样也是治标不治本。还是得从应用这个源头去找原因。最后查出来时jvm的参数设置的有问题,机器的内存就2G,但是jvm堆栈内存设置的过大,所以导致多次的OOM。另外还有此查出来是自动挂载NFS的automount有bug,导致产生了巨多的网络链接,没有办法只有重启机器,重启后就OK了。`