igb Detected Tx Unit Hang

公司大量的服务器使用了intel的网卡,主要是intel I350/82580/82576。 各个版本的igb驱动都出现过detected tx unix hang的报错。
大部分是老版本igb的bug,更新到新版后基本都解决了。不过最近发现有几台I350(4端口)的升级了驱动还是存在这样的报错。
[text]
Detected Tx Unit Hang in Quad Port Adapters http://downloadmirror.intel.com/20927/eng/e1000.htm
In some cases ports 3 and 4 don’t pass traffic and report ‘Detected Tx Unit Hang’ followed by
‘NETDEV WATCHDOG: ethX: transmit timed out’ errors. Ports 1 and 2 don’t show any errors and will pass traffic.
This issue MAY be resolved by updating to the latest kernel and BIOS. The user is encouraged to run an
OS that fully supports MSI interrupts. You can check your system’s BIOS by downloading the Linux
Firmware Developer Kit that can be obtained at http://www.linuxfirmwarekit.org/
[/text]
intel的文档是说4端口的网卡因为3,4端口没有流量,所以有”detected tx unix hang”的提示,可以通过更新BIOS解决。不过我仔细看了demesg的内容,显示出现detected tx unix hang的是eth0和eth1。今天尝试了修改各种参数,比如gro,tso之类的都关闭了,增加了TX ring的大小,甚至连tx-checksumming也关闭了。不过最后还是有这样的报错。
[text]
[ 3181.339388] igb 0000:02:00.0: Detected Tx Unit Hang
[ 3181.339390] Tx Queue <7>
[ 3181.339390] TDH <28c>
[ 3181.339391] TDT <28c>
[ 3181.339392] next_to_use <28c>
[ 3181.339392] next_to_clean <b2>
[ 3181.339393] buffer_info[next_to_clean]
[ 3181.339394] time_stamp <1002bfb38>
[ 3181.339394] next_to_watch <ffff88183fa00b20>
[ 3181.339395] jiffies <1002c0074>
[ 3181.339396] desc.status <1b8001>
[ 3424.144387] igb 0000:02:00.1: Detected Tx Unit Hang
[ 3424.144388] Tx Queue <7>
[ 3424.144389] TDH <e39>
[ 3424.144389] TDT <e39>
[ 3424.144390] next_to_use <e39>
[ 3424.144391] next_to_clean <8b3>
[ 3424.144391] buffer_info[next_to_clean]
[ 3424.144392] time_stamp <1002fa71c>
[ 3424.144392] next_to_watch <ffff8818226e8b30>
[ 3424.144393] jiffies <1002fb60d>
[ 3424.144394] desc.status <1b8001>
[/text]
后来咨询了硬件团队的同学,他们说之前在hadoop集群内遇到过类似的问题,最终通过新增vm.min_free_kbytes=81920内核参数解决。

参考:
http://kernel.taobao.org/index.php/Kernel_Documents/mm_sysctl

===
这个问题现在发现还是存在,设置了这个内核参数没有起到作用。原因待继续排查。。。。。

此条目发表在System分类目录。将固定链接加入收藏夹。