Introduction
Many times applications fails in certain scenario or crash in regression testing , This kind of problems are difficult to reproduce and debug, In this kind of situation the core dump comes very handy, core dump is the snap shot of crashed process at the time of crash, Normally the kernel takes this snap shot of the crashed process and generate the core, There are many debuggers available to analyse this core for us but we will only look at gdb (Gnu debugger). Core dump is the snap shot of the crashed process stack, Stack is the memory use to store local variables and function call frames like
1) Function parameters
2) Frame pointer (if used)
3) Return address
4) Local variables
Addition to the above information kernel also dumps the system registers like programme counter, stack pointer and link register which gives detailed information about the dying process, Core is like a black box which is use to get the last moment information about the crashed plane, Once the kernel takes the stack and register snap shot then gdb can provide the complete information about the crashed process.
Generate core in Linux
Core file limit should be unlimited to generate the core in linux, to set core file limit execute the below command on shell
[yusufOnLinux]$ ulimit -c unlimited
Once this is done the core is generated in the current directory of the process but we can also change the core name and path by changing it in /proc/sys/kernel/core_pattern
[yusufOnLinux]$ echo /home/yusuf/mycore > /proc/sys/kernel/core_pattern
Once above setting is done the core file would be generated at /home/yusuf/mycore.pid, For further details refer http://linux.die.net/man/5/core
Compilation option for gdb
Binary or library should be compiled with debugging option to use it with gdb, debugging option is enabled using -g compiler option
[yusufOnLinux]$ gcc -g -o gdb_core -lpthread gdb_core_app.c
The above compilation will generate the un-stripped binary with debugging option, this information can be retrieve using file command
[yusufOnLinux]$ file gdb_core
gdb_core: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped
Strip and gdb
Strip is the utility to strip down the unwanted section's and debugging information from the binary and object file, This drastically reduces the binary size and it is used mostly for the embedded systems where the storage flash is limited, But strip and gdb doesn't work well with each other because strip removes the information needed by the gdb for core processing, Thus to debug the binary with gdb it should be un-stripped compiled with -g option.
But in most of the scenarios the core obtained from the field is generated from the stripped binary which is difficult to debug, In this case we can re-compile the binaries on host with debug option and gdb can be used with existing core and re-compiled debug binaries.
Debugging process crash
I have written a small programme with multiple threads accessing the same pointer, The pointer is initialized by one thread periodically and accessed by other thread , I have intentionally not used the synchronization in-order to crash the application and get the core dump, Execute binary to generate core dump as shown below.
You can download the source code from http://www.fileupyours.com/files/296434/gdb_core_app.c,
You can download the source code from http://www.fileupyours.com/files/296434/gdb_core_app.c,
[yukhan@bhling20 blog]$ ./gdb_core
Segmentation fault (core dumped)
[yukhan@bhling20 blog]$ ls
core.22488 gdb_core gdb_core_app.c
Once the core is generated we can start debugging through gdb, remember if the core is generated from stripped binary then re-compile the binary with debug options and pass the debug binary and core as parameters to gdb on command shell as below.
[yukhan@bhling20 blog]$ gdb gdb_core core.3494
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./gdb_core'.
Program terminated with signal 11, Segmentation fault.
[New process 3496]
[New process 3495]
[New process 3494]
#0 0x0000000000400678 in entry_thread2 (arg=0x0) at gdb_core_app.c:40
(gdb)
Once executed you should able to see the gdb prompt along with the core info generated by the gdb, In multi-threaded environment you will have multiple stack snap shot for each thread, using below command we can dump the stack trace for multiple threads
(gdb) thread apply all bt
the output of above command will dump the stack frames for all the threads in a process, as shown below
Thread 3 (process 3494):
#0 0x000000304de07655 in pthread_join () from /lib64/libpthread.so.0
#1 0x000000000040061e in main () at gdb_core_app.c:17
Thread 2 (process 3495):
#0 0x000000304d272844 in _int_malloc () from /lib64/libc.so.6
#1 0x000000304d27402a in malloc () from /lib64/libc.so.6
#2 0x000000000040064f in entry_thread1 (arg=0x0) at gdb_core_app.c:27
#3 0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#4 0x000000304d2d30ad in clone () from /lib64/libc.so.6
Thread 1 (process 3496):
#0 0x0000000000400678 in entry_thread2 (arg=0x0) at gdb_core_app.c:40
#1 0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#2 0x000000304d2d30ad in clone () from /lib64/libc.so.6
In our example we have three threads , the main process and two threads created by us, to switch between the threads use "thread <number>" command
(gdb) thread 1
This will switch to thread 1 after switchover the stack for the current thread can be dummped with bt command
(gdb) bt
#0 0x0000000000400678 in entry_thread2 (arg=0x0) at gdb_core_app.c:40
#1 0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#2 0x000000304d2d30ad in clone () from /lib64/libc.so.6
Once the back trace is dumped we can inspect each frame, in this case frame 0 looks interesting to us, lets dump that
(gdb) frame 0
#0 0x0000000000400678 in entry_thread2 (arg=0x0) at gdb_core_app.c:40
"frame <number>" command is use to dump the particular frame of stack trace, the above output shows us frame 0 output, The line 40 of gdb_core_app.c has caused the segmentation fault, lets look at the source through gdb
(gdb) list +
The command "list +" will show the source for the stack dump, once executed we get the below output
34 void* entry_thread2(void* arg)
35 {
36 int temp;
37
38 while(1)
39 {
40 temp = *glb_ptr;
41 printf("Value got %d\n",temp);
42 }
43
We can inspect any variable here, lets dump the value of glb_ptr at the time of crash with below command
(gdb) print glb_ptr
$1 = (int *) 0x0
(gdb)
the print command on gdb prompt shows the value of any variable in current context, here glb_ptr is null due to which the line 40 caused an segmentation fault.
Debugging a hang process
Normally a process hang due to the deadlocks caused by the programming error like unlock not done, This kind of problems are difficult to debug in large systems, but if we can dump the stack trace of hang process then its easy to find out all the threads blocked on a lock , this is enough to give the hint about the lock which has caused the deadlock then its all about the code walk through and analysis. The hang process can be made to generate core by kill command, I have written a small programme same as above the only difference we make use of mutex and intentionally we do not unlock the mutex which cause the deadlock, Source code can be found at http://www.fileupyours.com/files/296434/gdb_hang_app.c . We compile the code using gcc and execute in background as below
[yusufOnLinux]$ gcc -g -o gdb_hang -lpthread gdb_hang_app.c
[yukhan@bhling20 blog]$ ./gdb_hang &
[1] 22488
The "&" at the end tells the shell to execute the binary in background.
[yukhan@bhling20 blog]$ ps -ef |grep yukhan
yukhan 22488 887 0 18:10 pts/43 00:00:00 ./gdb_hang
yukhan 22521 887 0 18:10 pts/43 00:00:00 ps -ef
From the above output we can take out the pid(process id ) of gdb_hang process, process id is the unique id given to each process by kernel, it is use to identify the particular process, to generate the core we need to send the signal 11 to the hang process, kill command takes the singal number and process id as a argument.
[yukhan@bhling20 blog]$ kill -11 22488
[yukhan@bhling20 blog]$
[1]+ Segmentation fault (core dumped) ./gdb_hang
[yukhan@bhling20 blog]$
As soon as we send the signal 11 to gdb_hang process , it cause the segmentation fault and core is generated, once the core is generated then its easy to debug with gdb as shown in previous esample.
[yukhan@bhling20 blog]$ gdb gdb_hang core.22488
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./gdb_hang'.
Program terminated with signal 11, Segmentation fault.
[New process 22488]
[New process 22506]
[New process 22489]
#0 0x000000304de07655 in pthread_join () from /lib64/libpthread.so.0
Now we can dump all the thread stack strace using "thread apply all bt" command
(gdb) thread apply all bt
Thread 3 (process 22489):
#0 0x000000304de0ce74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x000000304de08874 in _L_lock_106 () from /lib64/libpthread.so.0
#2 0x000000304de082e0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000004006f3 in entry_thread1 (arg=0x0) at gdb_hang_app.c:29
#4 0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#5 0x000000304d2d30ad in clone () from /lib64/libc.so.6
Thread 2 (process 22506):
#0 0x000000304de0ce74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x000000304de08874 in _L_lock_106 () from /lib64/libpthread.so.0
#2 0x000000304de082e0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000400734 in entry_thread2 (arg=0x0) at gdb_hang_app.c:45
#4 0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#5 0x000000304d2d30ad in clone () from /lib64/libc.so.6
Thread 1 (process 22488):
#0 0x000000304de07655 in pthread_join () from /lib64/libpthread.so.0
#1 0x00000000004006cd in main () at gdb_hang_app.c:21
(gdb)
In this case the thread 2 and thread 3 both waits on the same lock (see frame 2), Lets switch to the thread 3 and try to look at the source
(gdb) thread 3
[Switching to thread 3 (process 22489)]#3 0x00000000004006f3 in entry_thread1 (arg=0x0) at gdb_hang_app.c:29
29 pthread_mutex_lock(&foo_mutex);
(gdb) bt
#0 0x000000304de0ce74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x000000304de08874 in _L_lock_106 () from /lib64/libpthread.so.0
#2 0x000000304de082e0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000004006f3 in entry_thread1 (arg=0x0) at gdb_hang_app.c:29
#4 0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#5 0x000000304d2d30ad in clone () from /lib64/libc.so.6
(gdb) frame 3
#3 0x00000000004006f3 in entry_thread1 (arg=0x0) at gdb_hang_app.c:29
29 pthread_mutex_lock(&foo_mutex);
The dump shows the line of code which locks the mutex and its also gives the hint which mutex may have caused the deadlock, further digging in this case we can find out that unlock is commented out (intentionally in this case), but in real scenario this can happen due to many reasons.
(gdb) list +
24
25 void* entry_thread1(void* arg)
26 {
27 while(1)
28 {
29 pthread_mutex_lock(&foo_mutex);
30 glb_ptr = NULL;
31
32 glb_ptr = (int*) malloc(sizeof(int));
33
(gdb) list +
34 *glb_ptr = 1000;
35 //pthread_mutex_unlock(&foo_mutex);
36 }
37 }
Wow great, so the line 35 has been commented which actually unlocks the mutex and this is the reason for deadlock.
Conclusion
Through gdb you can analyse lot more thing then listed here, refer man gdb to go in detail, the above article only gives you the idea how core can be utilised to debug different problems, its one of the most effective way to debug un-expected scenarios like crash and hang, So go ahead and enjoy exploring gdb.
Conclusion
Through gdb you can analyse lot more thing then listed here, refer man gdb to go in detail, the above article only gives you the idea how core can be utilised to debug different problems, its one of the most effective way to debug un-expected scenarios like crash and hang, So go ahead and enjoy exploring gdb.


