echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.

tail /var/log/messages
echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 19 05:27:57 web kernel: Call Trace:
May 19 05:27:57 web kernel: [<ffffffff811d10a0>] ? sync_buffer+0x0/0x50
May 19 05:27:57 web kernel: [<ffffffff81549183>] io_schedule+0x73/0xc0
May 19 05:27:57 web kernel: [<ffffffff811d10e0>] sync_buffer+0x40/0x50
May 19 05:27:57 web kernel: [<ffffffff81549c6f>] __wait_on_bit+0x5f/0x90
May 19 05:27:57 web kernel: [<ffffffff811d10a0>] ? sync_buffer+0x0/0x50
May 19 05:27:57 web kernel: [<ffffffff81549d18>] out_of_line_wait_on_bit+0x78/0x90
May 19 05:27:57 web kernel: [<ffffffff810a6920>] ? wake_bit_function+0x0/0x50
May 19 05:27:57 web kernel: [<ffffffff811d1096>] __wait_on_buffer+0x26/0x30
May 19 05:27:57 web kernel: [<ffffffffa00707ef>] jbd2_journal_commit_transaction+0x117f/0x14f0 [jbd2]
May 19 05:27:57 web kernel: [<ffffffff8108fb2b>] ? try_to_del_timer_sync+0x7b/0xe0
May 19 05:27:57 web kernel: [<ffffffffa0075a38>] kjournald2+0xb8/0x220 [jbd2]
May 19 05:27:57 web kernel: [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40
May 19 05:27:57 web kernel: [<ffffffffa0075980>] ? kjournald2+0x0/0x220 [jbd2]
May 19 05:27:57 web kernel: [<ffffffff810a640e>] kthread+0x9e/0xc0
May 19 05:27:57 web kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
May 19 05:27:57 web kernel: [<ffffffff810a6370>] ? kthread+0x0/0xc0
May 19 05:27:57 web kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
May 19 05:28:40 web kernel: end_request: I/O error, dev vdb, sector 564522839
May 19 05:33:42 web kernel: end_request: I/O error, dev vdb, sector 564522839
May 19 05:38:44 web kernel: end_request: I/O error, dev vdb, sector 564522839

进程等待IO时,经常处于D状态,即TASK_UNINTERRUPTIBLE状态,处于这种状态的进程不处理信号,所以kill不掉,如果进程长期处于D状态,那么肯定不正常,
原因可能有二
1)IO路径上的硬件出问题了,比如硬盘坏了(只有少数情况会导致长期D,通常会返回错误)
2)内核自己出问题了
这种问题不好定位,而且一旦出现就通常不可恢复,kill不掉,通常只能重启恢复了。
内核针对这种开发了一种hung task的检测机制。
基本原理是:定时检测系统中处于D状态的进程,如果其处于D状态的时间超过了指定时间(默认120s,可以配置),则打印相关堆栈信息,也可以通过proc参数配置使其直接panic。

1、查看是否存在坏块
/sbin/badblocks -v /dev/sdc

2、问题分析
May 19 05:27:57 web kernel: [<ffffffff811d10a0>] ? sync_buffer+0x0/0x50
May 19 05:27:57 web kernel: [<ffffffff81549183>] io_schedule+0x73/0xc0
May 19 05:27:57 web kernel: [<ffffffff811d10e0>] sync_buffer+0x40/0x50
May 19 05:27:57 web kernel: [<ffffffff81549c6f>] __wait_on_bit+0x5f/0x90

3、临时方案
根据应用程序情况,对vm.dirty_ratio,vm.dirty_background_ratio两个参数进行调优设置。
# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5
# sysctl -p

如果系统永久生效,修改/etc/sysctl.conf文件。
#vi /etc/sysctl.conf

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

重启系统生效
http://www.361way.com/kernel-hung-task-analysis/4326.html

/dev/vda1 Inodes that were part of a corrupted orphan linked list found.

/dev/vda1 contains a file system with errors, check forced.
/dev/vda1 Inodes that were part of a corrupted orphan linked list found.
/dev/vda1 UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

#检查文件系统
1、ext4文件系统
fsck -y /dev/vda1
或者
fsck.ext4 -a /dev/vda1

2、xfs文件系统
xfs_repair -n /dev/vda1 #检查文件系统是否损坏,只检查文件系统是否有损坏

fsck.ext4 [-panyrcdfvtDFV] [-b superblock] [-B blocksize]
[-I inode_buffer_blocks] [-P process_inode_size]
[-l|-L bad_blocks_file] [-C fd] [-j external_journal]
[-E extended-options] device

Emergency help:
-p Automatic repair (no questions)
-n Make no changes to the filesystem
-y Assume "yes" to all questions
-c Check for bad blocks and add them to the badblock list
-f Force checking even if filesystem is marked clean
-v Be verbose
-b superblock Use alternative superblock
-B blocksize Force blocksize when looking for superblock
-j external_journal Set location of the external journal
-l bad_blocks_file Add to badblocks list
-L bad_blocks_file Set badblocks list

Usage: xfs_repair [options] device
Options:
-f The device is a file
-L Force log zeroing. Do this as a last resort.
-l logdev Specifies the device where the external log resides.
-m maxmem Maximum amount of memory to be used in megabytes.
-n No modify mode, just checks the filesystem for damage.
-P Disables prefetching.
-r rtdev Specifies the device where the realtime section resides.
-v Verbose output.
-c subopts Change filesystem parameters - use xfs_admin.
-o subopts Override default behaviour, refer to man page.
-t interval Reporting interval in seconds.
-d Repair dangerously.
-V Reports version and exits.
https://support.microsoft.com/en-in/help/3213321/linux-recovery-cannot-ssh-to-linux-vm-due-to-file-system-errors-fsck