Salut,
Je reviens vers vous (décidément) car je viens d'avoir encore un crash dans mon pool.... Pour rappel depuis 3 certaines MAJ (dont maj HA) j'ai des soucis avec mon pool de 2 serveur avec SR en NFS.
Cette nuit le serveur secondaire a encore crashé, et je revois des logs qui me chagrine:
Dec 15 00:26:16 xenserver-2 kernel: [669720.849173] ixgbe 0000:0d:00.0 eth4: Fake Tx hang detected with timeout of 80 seconds
des logs comme ca, yen a depuis hier 18:17 et juste avant qu'ils connmencent on a un pb kernel:
ec 14 18:18:26 xenserver-2 kernel: [647650.756874] ------------[ cut here ]------------
Dec 14 18:18:26 xenserver-2 kernel: [647650.756893] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1a4/0x280()
Dec 14 18:18:26 xenserver-2 kernel: [647650.756896] NETDEV WATCHDOG: eth4 (ixgbe): transmit queue 8 timed out
Dec 14 18:18:26 xenserver-2 kernel: [647650.756897] Modules linked in: tun nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt openvswitch(O) gre libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath coretemp crc32_pclmul dcdbas aesni_intel aes_x86_64 ipmi_devintf ablk_helper cryptd lrw gf128mul glue_helper dm_mod microcode psmouse i7core_edac edac_core sg shpchp lpc_ich mfd_core hed ipmi_si ipmi_msghandler wmi nls_utf8 isofs nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc ip_tables x_tables sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid sd_mod serio_raw ata_piix libata ehci_pci ehci_hcd uhci_hcd e1000e(O) ixgbe(O) ptp pps_core megaraid_sas(O) bnx2(O) scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh scsi_mod ipv6 autofs4
Dec 14 18:18:26 xenserver-2 kernel: [647650.756983] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 3.10.0+10 #1
Dec 14 18:18:26 xenserver-2 kernel: [647650.756985] Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.9.0 07/29/2013
Dec 14 18:18:26 xenserver-2 kernel: [647650.756988] 0000000000000009 ffff880183203d58 ffffffff81545307 ffff880183203d90
Dec 14 18:18:26 xenserver-2 kernel: [647650.756993] ffffffff81054da1 ffff88017bb80000 0000000000000008 0000000000000000
Dec 14 18:18:26 xenserver-2 kernel: [647650.756997] ffff88017bb75800 ffff88017bb75780 ffff880183203df0 ffffffff81054e0c
Dec 14 18:18:26 xenserver-2 kernel: [647650.757001] Call Trace:
Dec 14 18:18:26 xenserver-2 kernel: [647650.757004] <IRQ> [<ffffffff81545307>] dump_stack+0x19/0x1b
Dec 14 18:18:26 xenserver-2 kernel: [647650.757022] [<ffffffff81054da1>] warn_slowpath_common+0x61/0x80
Dec 14 18:18:26 xenserver-2 kernel: [647650.757025] [<ffffffff81054e0c>] warn_slowpath_fmt+0x4c/0x50
Dec 14 18:18:26 xenserver-2 kernel: [647650.757030] [<ffffffff8149f914>] dev_watchdog+0x1a4/0x280
Dec 14 18:18:26 xenserver-2 kernel: [647650.757034] [<ffffffff8149f770>] ? dev_deactivate_queue.constprop.29+0x60/0x60
Dec 14 18:18:26 xenserver-2 kernel: [647650.757039] [<ffffffff81063cd3>] call_timer_fn+0x53/0x130
Dec 14 18:18:26 xenserver-2 kernel: [647650.757042] [<ffffffff8149f770>] ? dev_deactivate_queue.constprop.29+0x60/0x60
Dec 14 18:18:26 xenserver-2 kernel: [647650.757047] [<ffffffff810658fd>] run_timer_softirq+0x22d/0x290
Dec 14 18:18:26 xenserver-2 kernel: [647650.757054] [<ffffffff8105d48b>] __do_softirq+0xfb/0x240
Dec 14 18:18:26 xenserver-2 kernel: [647650.757059] [<ffffffff8155509c>] call_softirq+0x1c/0x30
Dec 14 18:18:26 xenserver-2 kernel: [647650.757069] [<ffffffff81014203>] do_softirq+0x43/0x80
Dec 14 18:18:26 xenserver-2 kernel: [647650.757072] [<ffffffff8105d6d9>] irq_exit+0x49/0xa0
Dec 14 18:18:26 xenserver-2 kernel: [647650.757081] [<ffffffff81384ca5>] xen_evtchn_do_upcall+0x35/0x50
Dec 14 18:18:26 xenserver-2 kernel: [647650.757084] [<ffffffff815550fe>] xen_do_hypervisor_callback+0x1e/0xa0
Dec 14 18:18:26 xenserver-2 kernel: [647650.757085] <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Dec 14 18:18:26 xenserver-2 kernel: [647650.757093] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Dec 14 18:18:26 xenserver-2 kernel: [647650.757101] [<ffffffff8100a340>] ? xen_safe_halt+0x10/0x30
Dec 14 18:18:26 xenserver-2 kernel: [647650.757107] [<ffffffff8101a844>] ? default_idle+0x44/0xd0
Dec 14 18:18:26 xenserver-2 kernel: [647650.757110] [<ffffffff8101b038>] ? arch_cpu_idle+0x18/0x30
Dec 14 18:18:26 xenserver-2 kernel: [647650.757117] [<ffffffff810a3532>] ? cpu_startup_entry+0x1c2/0x280
Dec 14 18:18:26 xenserver-2 kernel: [647650.757125] [<ffffffff8152b442>] ? rest_init+0x72/0x80
Dec 14 18:18:26 xenserver-2 kernel: [647650.757134] [<ffffffff81ad6eee>] ? start_kernel+0x404/0x40f
Dec 14 18:18:26 xenserver-2 kernel: [647650.757137] [<ffffffff81ad68f3>] ? repair_env_string+0x5e/0x5e
Dec 14 18:18:26 xenserver-2 kernel: [647650.757140] [<ffffffff81ad65ee>] ? x86_64_start_reservations+0x2a/0x2c
Dec 14 18:18:26 xenserver-2 kernel: [647650.757144] [<ffffffff81ad9b48>] ? xen_start_kernel+0x531/0x53d
Dec 14 18:18:26 xenserver-2 kernel: [647650.757146] ---[ end trace b8f7a697402b44ac ]---
Dec 14 18:18:26 xenserver-2 kernel: [647650.757156] ixgbe 0000:0d:00.0 eth4: Fake Tx hang detected with timeout of 5 seconds
Dec 14 18:18:36 xenserver-2 kernel: [647660.752866] ixgbe 0000:0d:00.0 eth4: Fake Tx hang detected with timeout of 10 seconds
Dec 14 18:18:41 xenserver-2 kernel: [647666.079846] usb 5-2.2: USB disconnect, device number 84
Dec 14 18:18:43 xenserver-2 kernel: [647667.561258] usb 5-2.2: new low-speed USB device number 85 using ehci-pci
Dec 14 18:18:43 xenserver-2 kernel: [647667.664248] input: PixArt USB Optical Mouse as /devices/pci0000:00/0000:00:1a.7/usb5/5-2/5-2.2/5-2.2:1.0/input/input10502
Dec 14 18:18:43 xenserver-2 kernel: [647667.664944] hid-generic 0003:093A:2510.2906: input,hidraw4: USB HID v1.11 Mouse [PixArt USB Optical Mouse] on usb-0000:00:1a.7-2.2/input0
Dec 14 18:18:56 xenserver-2 kernel: [647680.784899] ixgbe 0000:0d:00.0 eth4: Fake Tx hang detected with timeout of 20 seconds
Dec 14 18:19:36 xenserver-2 kernel: [647720.848913] ixgbe 0000:0d:00.0 eth4: Fake Tx hang detected with timeout of 40 seconds
Dec 14 18:19:45 xenserver-2 kernel: [647729.256911] usb 5-2.2: new low-speed USB device number 86 using ehci-pci
Dec 14 18:19:45 xenserver-2 kernel: [647729.356914] input: PixArt USB Optical Mouse as /devices/pci0000:00/0000:00:1a.7/usb5/5-2/5-2.2/5-2.2:1.0/input/input10503
Dec 14 18:19:45 xenserver-2 kernel: [647729.357812] hid-generic 0003:093A:2510.2907: input,hidraw4: USB HID v1.11 Mouse [PixArt USB Optical Mouse] on usb-0000:00:1a.7-2.2/input0
Dec 14 18:20:45 xenserver-2 kernel: [647789.471939] usb 5-2.2: USB disconnect, device number 86
Dec 14 18:20:46 xenserver-2 kernel: [647790.949078] usb 5-2.2: new low-speed USB device number 87 using ehci-pci
Dec 14 18:20:46 xenserver-2 kernel: [647791.048068] input: PixArt USB Optical Mouse as /devices/pci0000:00/0000:00:1a.7/usb5/5-2/5-2.2/5-2.2:1.0/input/input10504
Dec 14 18:20:46 xenserver-2 kernel: [647791.048860] hid-generic 0003:093A:2510.2908: input,hidraw4: USB HID v1.11 Mouse [PixArt USB Optical Mouse] on usb-0000:00:1a.7-2.2/input0
Dec 14 18:20:56 xenserver-2 kernel: [647800.848898] ixgbe 0000:0d:00.0 eth4: Fake Tx hang detected with timeout of 80 seconds
Et donc on voit bien a la fin du pb kernel les Fake Tx qui commence avec un timeout de 5 puis 10-20-40 et enfin 80. 80 étant peut être la dernière limite avant que watchdog réagisse?
Bref pour vous, de quoi ça pourrait provienir? Ca parle de CPU0 au début, peut être juste la donnée en timeout qui est passé sur ce CPU vers la carte réseau ? Puis après le problème réseau qui continue, et donc peut être un souci de carte 10G?
Merci d'avance