首页 » 编程 » shell » 正文

记一次在 shell 脚本中使用 tail 引发的血案

在一个夜黑风高的深夜,我的分别在2台服务器上的tomcat都同时down了,都报出了如下的错误信息:

Jul 31, 2018 11:40:00 PM com.ex.service.voip.VoipService finishTransactions
INFO: scheduled finish VoIP transactions 2018-07-31T23:40:00.000+08:00
Jul 31, 2018 11:47:00 PM org.apache.coyote.AbstractProtocol pause
INFO: Pausing ProtocolHandler ["http-nio-8866"]
Jul 31, 2018 11:47:00 PM org.apache.coyote.AbstractProtocol pause
INFO: Pausing ProtocolHandler ["ajp-bio-8009"]
Jul 31, 2018 11:47:00 PM org.apache.catalina.core.StandardService stopInternal
INFO: Stopping service Catalina
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:13, serverValue:9334536}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:9, serverValue:9334532}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:6, serverValue:9334529}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:5, serverValue:9334518}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:10, serverValue:9334534}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:14, serverValue:9334537}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:12, serverValue:9334535}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:18, serverValue:9334542}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:8, serverValue:9334531}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:17, serverValue:9334543}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:15, serverValue:9334538}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:7, serverValue:9334530}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:11, serverValue:9334533}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM com.mongodb.diagnostics.logging.JULLogger log
INFO: Closed connection [connectionId{localValue:16, serverValue:9334541}] to 172.17.10.116:3717 because the pool has been closed.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-1] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-2] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-3] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-4] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-5] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-6] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-7] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-8] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-9] but has failed to stop it. This is very likely to create a memory leak.
Jul 31, 2018 11:47:02 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [pool-2-thread-10] but has failed to stop it. This is very likely to create a memory leak.
下面还有日志.................
.....................

看到上面的日志,最开始以为是mognodb连接池有问题,后来以为是内存泄露问题,但是经过后来的排查发现都不是,最后觉得是我在部署war时同时启动2台机器上的tomcat导致的,怀着这样的疑问,第二天晚上又有部署,于是这次特别小心,都是一个一个启动的,都成功启动完成后还特地测试了没有问题,但是等我洗了一个澡回来后就收到告警服务down了,妈的什么情况,先将服务起起来再看。又仔细的看了看完整的日志,发现和我通常停止的时候的日志不太一样,于是精简下来大概如下:

Aug 03, 2018 12:02:00 AM org.apache.coyote.AbstractProtocol pause
INFO: Pausing ProtocolHandler ["http-nio-8866"]
Aug 03, 2018 12:02:00 AM org.apache.coyote.AbstractProtocol pause
INFO: Pausing ProtocolHandler ["ajp-bio-8009"]
Aug 03, 2018 12:02:00 AM org.apache.catalina.core.StandardService stopInternal
INFO: Stopping service Catalina
Aug 03, 2018 12:02:01 AM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["http-nio-8866"]
Aug 03, 2018 12:02:01 AM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["ajp-bio-8009"]
Aug 03, 2018 12:02:01 AM org.apache.coyote.AbstractProtocol destroy
INFO: Destroying ProtocolHandler ["http-nio-8866"]
Aug 03, 2018 12:02:01 AM org.apache.coyote.AbstractProtocol destroy
INFO: Destroying ProtocolHandler ["ajp-bio-8009"]

搜了一下“org.apache.coyote.AbstractProtocol pause” 发现这位老兄和我的情况一模一样 https://blog.csdn.net/zhouyannian1988/article/details/53508689   ,看了老哥的博客再回过头来看看我的问题,究其原因是我的脚本有一点问题。

 

在我的服务器上为了节省每次部署war包时的时间,以及方便性,我写了脚本来帮我做这件事:

cat deploy_war.sh

#!/bin/bash

tag_id=110  # tag_id 每次更新修改这个id值运行脚本就行了
source_warname=server-api-11.0.war  # war包的名字,几乎不怎么改,有变动就修改

tag_path=/data/tags  # war的路径
target_warname=$tag_id-$source_warname  # 拼接出一个tag_id加war包名字的新war包
cp $tag_path/$tag_id/$source_warname ./$target_warname  # 拷贝这个新war到部署目录,也是这个脚本所在目录

dir=`echo ${target_warname%.*}` # 定义解压目录
unzip -d $dir $target_warname # 解压到指定目录
rsync -avzPO --delete ./$dir/ /data/resources/api-webcontent/  # 同步到tomcat项目运行目录
rm -rf ./$dir  # 删除解压出来的目录

sleep 5
/data/scripts/tomcat7_api.sh restart  # 重启tomcat
/data/scripts/tomcat7_api.sh log  # 重启后查看tomcat日志

说明一下脚本最后tomcat重启部分:

/data/scripts/tomcat7_api.sh restart   会先停掉tomcat再启动tomcat,

/data/scripts/tomcat7_api.sh log   就是通过tail -f 查看catalina.out的日志输出,使得不用再手动去敲命令去看是否启动成功或有没有报错。

 

我一直都用上面脚本有一段时间了,一直都没有问题,但就是咋这个夜晚发生了问题,原因就在于这个脚本执行完tomcat restart之后,又执行log调用tail查看日志,使得这个脚本一直是处于挂起的状态的。

我通常在公司都是执行完 ./deploy_war.sh 发现没有问题就让ssh终端窗口在哪儿开着的,也没有什么问题,一段时间就按照正常流程 执行 Ctrl+C退出了,然后Ctrl+D退出ssh,公司网络稳定。

但是本次更新是在我家里做的,和在公司一样执行完脚本窗口还是在那里开着的,测试没有问题就去洗澡了,等回来就看到告警2个tomcat都down了,其原因是我家里的网络不稳定,我离开的时候网络中断了,也导致我ssh连上的终端也中断了,而运行脚本时看日志的终端窗口也断掉了,经过一番查找资料后原来就是因为脚本中那最后一行 通过log调用tail看日志有问题,在执行脚本后查看日志时假如不手动不Ctrl+C结束运行,在ssh终端意外关闭的情况下就会导致tomcat被停掉的问题,原因是直接关闭ssh终端时,sshd会把SIGHUP信号传递给bash进程,bash会把SIGHUP传递给它的子进程,并且对于其子进程deploy_war.sh,bash还会对deploy_war.sh 的进程组里的成员都传播一遍SIGHUP,java也是属于deploy_war.sh 进行组里的子进程,所以收到信号后就被干死了。

 

深层原因是:

shell在非交互模式下对后台进程处理SIGINT信号时设置的是IGNORE。

交互模式与非交互模式对作业控制(job control)默认方式不同:

交互模式下shell不会对后台进程处理SIGINT信号设置为忽略,非交互模式下会设置为忽略。

 

交互模式下的后台进程会设置一个自己的进程组ID,因为默认如果采用父进程的进程组ID,父进程会把收到的键盘事件比如ctrl-c之类的SIGINT传播给进程组中的每个成员,假设后台进程也是父进程组的成员,因为作业控制的需要不能忽略SIGINT,你在终端随意ctrl-c就可能导致所有的后台进程退出,显然这样是不合理的;所以为了避免这种干扰后台进程设置为自己的pgid。

 

而非交互模式下,通常是不需要作业控制的,所以作业控制在非交互模式下默认也是关闭的(当然也可以在脚本里通过选项set -m打开作业控制选项)。不开启作业控制的话,脚本里的后台进程可以通过设置忽略SIGINT信号来避免父进程对组中成员的传播,因为对它来说这个信号已经没有意义。

 

回到我的tomcat:

tomcat启动的时候就是以非交互方式后台启动,java进程也被shell设置了忽略SIGINT信号,因此在ctrl-c结束test.sh进程时,系统发送的SIGINT对java没有影响。

在非交互模式下,shell对java进程设置了SIGINT,SIGQUIT信号设置了忽略,但并没有对SIGHUP信号设为忽略。所以 SIGHUP (kill -1) 会让tomcat进程退出。

 

直接关闭ssh终端时,sshd把SIGHUP传递给bash进程后,bash会把SIGHUP传递给它的子进程,并且对于其子进程deploy_war.sh,bash还会对deploy_war.sh的进程组里的成员都传播一遍SIGHUP。因为java后台进程从父进程catalina.sh(又是从其父进程deploy_war.sh)继承的pgid,所以java进程仍属于deploy_war.sh进程组里的成员,收到SIGHUP后退出。

 

解决办法:

方式1:在脚本开都加上 “set -m” 打开作业控制选项,就不会让java进程退出了。此时java后台进程继承父进程catalina.sh的pgid,而catalina.sh不再使用deploy_war.sh的进程组,而是自己的pid作为pgid,catalina.sh进程在执行完退出后,java进程挂到了init下,java与deploy_war.sh进程就完全脱离关系了,bash也不会再向它发送信号。

方式2:移除最后一行查看日志的代码,使得脚本可以正常执行完成后退出,另外再手动的去查看日志

 

 

关于“set -m” 更多的资料可以使用 man set 进行查看。

 

更详细的描述可看这位老哥的博客,我的情况一模一样。

https://blog.csdn.net/zhouyannian1988/article/details/53508689

 

附我的tomcat启动脚本:

cat tomcat7_api.sh

#!/bin/bash

tomcat_home=/data/app/tomcat7-api
tomcat_name=tomcat7-api
tomcat_port=8080

watchlog=$tomcat_home/logs/catalina.out
date=`date +%Y-%m-%d-%H:%M:%S`


flag=0

start() {
        status
        if [ $flag -eq 1 ]; then
                exit
        elif [ $flag -eq 2 ]; then
                stop
                $tomcat_home/bin/startup.sh
        else
                $tomcat_home/bin/startup.sh
        fi
}

stop() {
        status
        if [ $flag -ne 0 ]; then
                $tomcat_home/bin/shutdown.sh
                sleep 5

                status
                if [ $flag -ne 0 ]; then
                        echo "[$date] tomcat process not dead after shutdown.sh, let's kill it."
                        ps aux|grep $tomcat_name|grep -v grep|awk '{print $2}'|xargs kill -9
                        sleep 5
                        status
                        if [ $flag -eq 0 ]; then
                                echo "[$date] tomcat                            [ stoped ]"
                        else
                                echo "[$date] tomcat                            [ failed ]"

                        fi
                        echo "[$date] Clean work cache..."
                        rm -rf $tomcat_home/work/Catalina/*
                
                        echo "[$date] Clean cache done."
                fi
        fi
}

status() {
        count_listen=`netstat -natp|grep $tomcat_port|grep "LISTEN"|wc -l`
        count_ps=`ps aux|grep $tomcat_name|grep java|grep -v grep|wc -l`

        if [ $count_listen == 0 ] && [ $count_ps == 0 ]; then
                echo "[$date] tomcat is stopped."
                flag=0
        elif [ $count_listen == 1 ] && [ $count_ps -eq 1 ]; then
                echo "[$date] tomcat is running."
                flag=1
        elif [ $count_listen == 0 ] && [ $count_ps -ge 1 ]; then
                echo "[$date] tomcat port is closed, but process still running."
                flag=2
        fi
}

log() {
        echo "log file is: $watchlog"
        sleep 1
        tail -20f $watchlog
}


case "$1" in
        start)
                start
                ;;
        stop)
                stop
                ;;
        restart)
                stop
                start
                ;;
        status)
                status
                ;;
        log)
                log
                ;;
        *)
                echo $"Usage: $0 {start|stop|status|log|restart}"
esac

 

赞 (1)

发表评论

*