应用strace排查程序内部问题

隔壁组发来一个SDK程序，需要集成到现有Go项目中。这个SDK目测是C/C++编写的，编译成.so（动态链接库）文件，需要由CGO调用。
然而实际测试过程中发现一个奇怪的现象：程序在虚拟机的Linux系统中运行正常，但在Docker容器中总是报出error exit status 127的错误。由于该项目负责人正在休假，我的工作也无法继续进行了。有没有一种办法可以查看程序内部的详细执行过程信息呢？

strace简介

以下是维基百科的介绍：

strace是Linux的诊断，调试和指导性的用户空间实用程序。它用于监视和篡改进程与Linux内核之间的交互，包括系统调用，信号传递和进程状态更改。
最常见的用法是使用strace启动程序，该程序将打印该程序进行的系统调用的列表。如果程序连续崩溃或无法按预期运行，这将很有用；例如，使用strace可能表明程序正在尝试访问不存在或无法读取的文件。
一个替代的应用程序是使用该-p标志将其附加到正在运行的进程。如果某个进程已停止响应，并且例如在尝试建立网络连接时可能显示该进程正在阻塞，则此功能很有用。

分别在正常运行和不正常环境中执行以下命令

1	strace -o output.txt ./demo

发现output.txt有以下结果

# 正常的
...
waitid(P_PID, 25467, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25467, si_status=0, si_utime=0, si_stime=0}, WEXITED|WNOWAIT, NULL) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigreturn(0xc000000180)              = 0
wait4(25467, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 354946}, ru_stime={0, 112982}, ...}) = 25467
exit_group(0)

# 不正常的
...
waitid(P_PID, 65, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=65, si_uid=0, si_status=127, si_utime=0, si_stime=0}, WEXITED|WNOWAIT, NULL) = 0
wait4(65, [{WIFEXITED(s) && WEXITSTATUS(s) == 127}], 0, {ru_utime={tv_sec=0, tv_usec=1104}, ru_stime={tv_sec=0, tv_usec=0}, ...}) = 65
write(1, "error exit status 127\n", 22) = 22
write(1, "Aes decrpyt error: exit status 1"..., 35) = 35
write(1, "exit status 127\n", 16)       = 16
exit_group(0)                           = ?
+++ exited with 0 +++

可见，不正常的环境中，程序确实报出了error exit status 127的错误，但是有没有进一步的信息呢？
由于日志中含有wait4...的信息，我们怀疑实际出问题的是子进程，strace -f可以跟踪由fork调用所产生的子进程，执行如下：

1	strace -o -f output.txt ./demo

发现output.txt有以下结果

# 正常的
25326 clone( <unfinished ...>
25330 set_robust_list(0x7f8e6f7f89e0, 0x18) = 0
25330 sigaltstack(NULL, {ss_sp=0, ss_flags=SS_DISABLE, ss_size=0}) = 0
25330 sigaltstack({ss_sp=0xc00006a000, ss_flags=0, ss_size=32768}, NULL) = 0
25330 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
25330 gettid()                          = 25330
...
25331 execve("/bin/sh", ["sh", "-c", "openssl enc -aes-256-cbc -d -K 3"...], [/* 38 vars */] <unfinished ...>
25326 <... clone resumed> child_stack=0, flags=CLONE_VM|CLONE_VFORK|SIGCHLD) = 25331
25326 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
25326 close(9)                          = 0

# 不正常的
104   write(2, "sh: openssl: command not found\n", 31) = 31
100   <... epoll_pwait resumed>[{EPOLLIN, {u32=2830700296, u64=139851260825352}}], 128, -1, NULL, 3) = 1
104   exit_group(127)                   = ?
100   futex(0xb0c270, FUTEX_WAKE_PRIVATE, 1) = 1
99    <... futex resumed>)              = 0
104   +++ exited with 127 +++
100   read(5,  <unfinished ...>
99    --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=104, si_uid=0, si_status=127, si_utime=0, si_stime=1} ---
98    <... waitid resumed>{si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=104, si_uid=0, si_status=127, si_utime=0, si_stime=0}, WEXITED|WNOWAIT, NULL) = 0
100   <... read resumed>"sh: openssl: command not found\n", 512) = 31
99    rt_sigreturn({mask=[]} <unfinished ...>
98    wait4(104,  <unfinished ...>
100   read(5,  <unfinished ...>
99    <... rt_sigreturn resumed>)       = 0
98    <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 127}], 0, {ru_utime={tv_sec=0, tv_usec=0}, ru_stime={tv_sec=0, tv_usec=16606}, ...}) = 104
100   <... read resumed>"", 1505)       = 0

显而易见，原来是通过命令行调用了openssl程序，而Docker中默认没有安装openssl，于是子进程报出了sh: openssl: command not found的错误，但是主进程并没有把这个错误向上抛出，而是报出一个Aes decrpyt error: exit status 1的错误，让人迷惑。

通过这个案例也可以说明一个问题：将错误信息向上报出而不是一些语焉不详的报错信息，在程序接口设计和团队合作中十分重要。