Post date: Mar 01, 2017 9:8:45 PM
Discovering the problem:
I was recently doing an urgent troubleshoot. At some time during the troubleshoot, I needed to bring down a bunch of related processes on a group of RHEL 6 Linux hosts. Normally, this task is made easy with cluster ssh and a one-line shell script. But, after running initctl stop on all of these services and initctl status to make sure they were stopped: 'initctl status' said every one of them was stopped, but every one of them was still running. If this was a single process on a single host, I'd just send the kill signal and continue on without much further thought, but in this case I needed to manually kill a lot of processes (somewhere around 50 processes total). This led me to believe something was really wrong with the setup, since it was every single one of the processes that upstart failed to stop.
Reproducing the problem:
It turns out, this problem was extremely easy to reproduce on a local Virtual Machine. I installed a fresh CentOS 6 VM and set out to recreate the issue I had in production. First, I created a really simple java program that didn't do much but would run forever as a 'service'.
Here is what this service looks like, it just prints the date every second in an endless loop.
[root@vm6 ~]# cat fake.java
import java.util.*;
public class fake {
public static void main(String args[])
throws InterruptedException {
while (true) {
System.out.println(new Date() + " Testing");
Thread.sleep(1000);
}
}
}
[root@vm6 ~]# javac fake.java
[root@vm6 ~]# java fake
Wed Mar 01 11:24:01 CST 2017 Testing
Wed Mar 01 11:24:02 CST 2017 Testing
Wed Mar 01 11:24:03 CST 2017 Testing
^C
[root@vm6 ~]#
Now, I needed to write a simple upstart script, with a shell script to launch the java process.
Here is the upstart script, I made copies and changed variable PARAM to easily simulate multiple services:
[root@vm6 init]# cat /etc/init/fake1.conf
start on runlevel [345]
stop on runlevel [016]
env PARAM=test1
console none
respawn
exec su - testuser -c "/usr/local/bin/fake.sh $PARAM"
The shell script is quite simple:
[root@vm6 init]# cat /usr/local/bin/fake.sh
cd /usr/local/bin
java fake $1 >/tmp/fake_$1.log
Now to fire these up and see if they do the same thing when stopped.
[root@vm6 init]# start fake1
fake1 start/running, process 4413
[root@vm6 init]# start fake2
fake2 start/running, process 4436
[root@vm6 init]# start fake3
fake3 start/running, process 4459
[root@vm6 init]# ps -fp $(pgrep -f fake)
UID PID PPID C STIME TTY STAT TIME CMD
root 4413 1 0 11:39 ? Ss 0:00 su - testuser -c /usr/local/bin/fake.sh test1
testuser 4414 4413 0 11:39 ? Ss 0:00 -bash -c /usr/local/bin/fake.sh test1
testuser 4425 4414 0 11:39 ? Sl 0:00 java fake test1
root 4436 1 0 11:39 ? Ss 0:00 su - testuser -c /usr/local/bin/fake.sh test2
testuser 4437 4436 0 11:39 ? Ss 0:00 -bash -c /usr/local/bin/fake.sh test2
testuser 4448 4437 0 11:39 ? Sl 0:00 java fake test2
root 4459 1 0 11:39 ? Ss 0:00 su - testuser -c /usr/local/bin/fake.sh test3
testuser 4460 4459 0 11:39 ? Ss 0:00 -bash -c /usr/local/bin/fake.sh test3
testuser 4471 4460 0 11:39 ? Sl 0:00 java fake test3
[root@vm6 init]# stop fake1
fake1 stop/waiting
[root@vm6 init]# stop fake2
fake2 stop/waiting
[root@vm6 init]# stop fake3
fake3 stop/waiting
[root@vm6 init]# status fake1
fake1 stop/waiting
[root@vm6 init]# status fake2
fake2 stop/waiting
[root@vm6 init]# status fake3
fake3 stop/waiting
[root@vm6 init]# ps -fp $(pgrep -f fake)
UID PID PPID C STIME TTY STAT TIME CMD
testuser 4425 1 0 11:39 ? Sl 0:00 java fake test1
testuser 4448 1 0 11:39 ? Sl 0:00 java fake test2
testuser 4471 1 0 11:39 ? Sl 0:00 java fake test3
Well, that reproduced the problem quite nicely! I stopped all three services, and the shell scripts got killed but all three java processes remain running.
Now, for the hard part...
Troubleshooting / debugging
I started by using strace to trace system calls / signals on the three processes for one of the fake "services".
[root@vm6 ~]# start fake1
fake1 start/running, process 4515
[root@vm6 ~]# pgrep -fl fake
4515 su - testuser -c /usr/local/bin/fake.sh test1
4516 -bash -c /usr/local/bin/fake.sh test1
4527 java fake test1
[root@vm6 ~]# strace -tt -ff -p 4515 -p 4516 -p 4527 2>/tmp/trace1 &
[1] 4559
[root@vm6 ~]# stop fake1
fake1 stop/waiting
[root@vm6 ~]# pgrep -fl fake
4527 java fake test1
[root@vm6 ~]# kill 4527
[root@vm6 ~]#
[1]+ Done strace -tt -ff -p 4515 -p 4516 -p 4527 2> /tmp/trace1
Lets take a look at signals sent to our processes
[root@vm6 ~]# grep '\-\- SIG' /tmp/trace1
[pid 4515] 11:52:42.021765 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
[pid 4516] 11:52:42.022852 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=4515, si_uid=0} ---
[pid 4515] 11:52:44.033677 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=4515, si_uid=0} ---
[pid 4527] 11:52:58.028242 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=3931, si_uid=0} ---
So, I can see that init (si_pid=1) sent a kill to the "su" process. In turn, the "su" process (si_pid=4515) sent a kill to the shell script (pid 4516). But, the kill signal to the java process came from my shell (si_pid=3931), because the service's shell didn't propagate the kill signal to the java process. After researching this a little bit, I found that this is expected. If a shell process gets a kill signal, it won't kill sub-processes before exiting.
While messing around with the upstart script, I stumbled onto something. Here is an alternate version of the upstart script
[root@vm6 init]# cat fake1.conf
start on runlevel [345]
stop on runlevel [016]
env PARAM=test1
console none
respawn
exec su - testuser -- /usr/local/bin/fake.sh $PARAM
First, an observation on the change in behavior
[root@vm6 init]# start fake1
fake1 start/running, process 4610
[root@vm6 init]# start fake2
fake2 start/running, process 4633
[root@vm6 init]# start fake3
fake3 start/running, process 4656
[root@vm6 init]# pgrep -fl fake
4610 su - testuser -- /usr/local/bin/fake.sh test1
4611 -bash /usr/local/bin/fake.sh test1
4622 java fake test1
4633 su - testuser -- /usr/local/bin/fake.sh test2
4634 -bash /usr/local/bin/fake.sh test2
4645 java fake test2
4656 su - testuser -- /usr/local/bin/fake.sh test3
4657 -bash /usr/local/bin/fake.sh test3
4668 java fake test3
[root@vm6 init]# stop fake1
fake1 stop/waiting
[root@vm6 init]# stop fake2
fake2 stop/waiting
[root@vm6 init]# stop fake3
fake3 stop/waiting
[root@vm6 init]# pgrep -fl fake
[root@vm6 init]#
So, I have to wonder, why removing the "-c" in favor of "--" fixed the issue I had experienced. Lets do another strace to see what's going on.
[root@vm6 init]# start fake1
fake1 start/running, process 4687
[root@vm6 init]# pgrep -fl fake
4687 su - testuser -- /usr/local/bin/fake.sh test1
4688 -bash /usr/local/bin/fake.sh test1
4699 java fake test1
[root@vm6 init]# strace -tt -ff -p 4687 -p 4688 -p 4699 2>/tmp/trace2 &
[1] 4710
[root@vm6 init]# stop fake1
fake1 stop/waiting
[root@vm6 init]#
[1]+ Done strace -tt -ff -p 4687 -p 4688 -p 4699 2> /tmp/trace2
Let's see those kill signals this time:
[root@vm6 init]# grep '\-\- SIG' /tmp/trace2
[pid 4699] 12:09:07.521458 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
[pid 4688] 12:09:07.521507 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
[pid 4687] 12:09:07.521531 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
12:09:09.538486 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=4687, si_uid=0} ---
Wait, what?!? Upstart sent the kill signal to all three processes. Now I start to feel like I might know how Alice felt after discovering the rabbit hole went much deeper than it first appeared. So, what's going on here? Why would init care if I used "-c" or any other parameter for that matter?
Ok, let's trace that again, including the init process in our trace. I'll have to manually stop the trace, as (I hope) the init process won't exit.
[root@vm6 init]# start fake
start: Unknown job: fake
[root@vm6 init]# start fake1
fake1 start/running, process 4722
[root@vm6 init]# pgrep -fl fake
4722 su - testuser -- /usr/local/bin/fake.sh test1
4723 -bash /usr/local/bin/fake.sh test1
4734 java fake test1
[root@vm6 init]# strace -tt -ff -p 1 -p 4722 -p 4723 -p 4734 2>/tmp/trace3 &
[1] 4745
[root@vm6 init]# stop fake1
fake1 stop/waiting
[root@vm6 init]# fg
strace -tt -ff -p 1 -p 4722 -p 4723 -p 4734 2> /tmp/trace3
^C
[root@vm6 init]#
Let's take a look, including kill syscall and signals in our grep filter this time
[root@vm6 init]# egrep -e 'kill\(|\-\- SIG' /tmp/trace3
[pid 1] 12:14:42.940553 kill(4294962574, SIGTERM <unfinished ...>
[pid 4734] 12:14:42.940958 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
[pid 4723] 12:14:42.941006 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
[pid 4722] 12:14:42.941848 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
[pid 4722] 12:14:42.943613 kill(4723, SIGTERM <unfinished ...>
[pid 1] 12:14:42.963620 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=4734, si_status=143, si_utime=11, si_stime=11} ---
[pid 4722] 12:14:44.958977 kill(4723, SIGKILL) = 0
[pid 4722] 12:14:44.959284 kill(4722, SIGTERM) = 0
[pid 4722] 12:14:44.959349 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=4722, si_uid=0} ---
12:14:44.959954 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=4723, si_status=SIGTERM, si_utime=0, si_stime=2} ---
So, upstart sends a kill to... pid 4294962574. That doesn't seem right. I think this is a display error in strace; my guess is that strace is displaying an unsigned long, and the pid is negative. That's right, a negative pid. From kill's man page we can see what the intent is.
-n where n is larger than 1. All processes in process group n are signaled. When an argument of the
form ‘-n’ is given, and it is meant to denote a process group, either the signal must be specified
first, or the argument must be preceded by a ‘--’ option, otherwise it will be taken as the signal to
send.
OK, so here is the magic. Upstart is sending the kill signal to a process group, not to a pid. That explains why all three of my latest processes got the kill signal. Now, lets back up a step and look at what the difference is between "-c" vs. "--".
One service with "--" and one with "-c"
4949 su - testuser -- /usr/local/bin/fake.sh test1
4950 -bash /usr/local/bin/fake.sh test1
4961 java fake test1
4972 su - testuser -c /usr/local/bin/fake.sh test2
4973 -bash -c /usr/local/bin/fake.sh test2
4984 java fake test2
So, what's different? Let's check process groups using "-O pgrp" with ps
[root@vm6 init]# ps -O pgrp -p $(pgrep -f fake)
PID PGRP S TTY TIME COMMAND
4949 4949 S ? 00:00:00 su - testuser -- /usr/local/bin/fake.sh test1
4950 4949 S ? 00:00:00 -bash /usr/local/bin/fake.sh test1
4961 4949 S ? 00:00:00 java fake test1
4972 4972 S ? 00:00:00 su - testuser -c /usr/local/bin/fake.sh test2
4973 4973 S ? 00:00:00 -bash -c /usr/local/bin/fake.sh test2
4984 4973 S ? 00:00:00 java fake test2
Here is the "Aha!" moment for me. I can see that with "-c", the shell script is in a new process group and with "--" it is in same process group as su. In this case, when we stop fake1, init sends kill to process group 4949, killing all three. When we stop fake2, init sends kill to process group 4972, sending kill to only the "su" process. "su" intercepts the kill signal and sends kill to the shell subprocess, pid 4973. The shell exits, leaving java running. With this knowledge, I've come up with a couple more ways to fix the issue.
The "Fix"
use "--" instead of "-c" to keep same process group across the 3 service processes
use "exec java" in the shell script so that java replaces the shell process and gets killed by "su"
trap kill signal in shell script and send kill to java process before exiting
Now, I have to wonder... upstart sends kill to process group. What about SYSV init (CentOS 5), or systemd (CentOS 7)?
So, option 1 is out for RHEL 7, because systemd sends kill signal to the pid and not to the process group like sysv or upstart.