Popular Posts

Wednesday, November 24, 2010

Debugging core using gdb

Introduction

Many times applications fails in certain scenario or crash in regression testing , This kind of problems are difficult to reproduce and debug, In this kind of situation the core dump comes very handy, core dump is the snap shot of crashed process at the time of crash, Normally the kernel takes this snap shot of the crashed process and generate the core, There are many debuggers available to analyse this core for us but we will only look at gdb (Gnu debugger). Core dump is the snap shot of the crashed process stack, Stack is the memory use to store local variables and function call frames like

1) Function parameters
2) Frame pointer (if used)
3) Return address
4) Local variables

Addition to the above information kernel also dumps the system registers like programme counter, stack pointer and link register which gives detailed information about the dying process, Core is like a black box which is use to get the last moment information about the crashed plane, Once the kernel takes the stack and register snap shot then gdb can provide the complete information about the crashed process.

Generate core in Linux

Core file limit should be unlimited to generate the core in linux, to set core file limit execute the below command on shell

[yusufOnLinux]$ ulimit -c unlimited

Once this is done the core is generated in the current directory of the process but we can also change the core  name and path by changing  it in /proc/sys/kernel/core_pattern

[yusufOnLinux]$ echo /home/yusuf/mycore > /proc/sys/kernel/core_pattern

Once above setting is done the core file would be generated at /home/yusuf/mycore.pid, For further details refer http://linux.die.net/man/5/core

Compilation option for gdb

Binary or library should be compiled with debugging option to use it with gdb, debugging option is enabled using -g compiler option

[yusufOnLinux]$ gcc -g -o gdb_core -lpthread gdb_core_app.c

The above compilation will generate the un-stripped binary with debugging option, this information can be retrieve using file command

[yusufOnLinux]$ file gdb_core
gdb_core: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped

Strip and gdb

Strip is the utility to strip down the unwanted section's and debugging information from the binary and object file, This drastically reduces the binary size and it is used mostly for the embedded systems where the storage flash is limited, But strip and gdb doesn't work well with each other because strip removes the information needed by the gdb for core processing, Thus to debug the binary with gdb it should be un-stripped compiled with -g option.
But in most of the scenarios the core obtained from the field is generated from the stripped binary which is difficult to debug, In this case we can re-compile the binaries on host with debug option and gdb can be used with existing core and re-compiled debug binaries.

Debugging process crash

I have written a small programme with multiple threads accessing the same pointer, The pointer is initialized by one thread periodically and accessed by other thread , I have intentionally not used the synchronization in-order to crash the application and get the core dump, Execute binary to generate core dump as shown below.
You can download the source code from http://www.fileupyours.com/files/296434/gdb_core_app.c,

[yukhan@bhling20 blog]$ ./gdb_core
Segmentation fault (core dumped)

[yukhan@bhling20 blog]$ ls
core.22488 gdb_core  gdb_core_app.c  

Once the core is generated we can start debugging through gdb, remember if the core is generated from stripped binary then re-compile the binary with debug options and pass the debug binary and core as parameters to gdb on command shell as below.

[yukhan@bhling20 blog]$ gdb gdb_core core.3494
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./gdb_core'.
Program terminated with signal 11, Segmentation fault.
[New process 3496]
[New process 3495]
[New process 3494]
#0  0x0000000000400678 in entry_thread2 (arg=0x0) at gdb_core_app.c:40
(gdb)

Once executed you should able to see the gdb prompt along with the core info generated by the gdb, In multi-threaded environment you will have multiple stack snap shot for each thread, using below command we can dump the stack trace for multiple threads

(gdb) thread apply all bt

the output of above command will dump the stack frames for all the threads in a process, as shown below

Thread 3 (process 3494):
#0  0x000000304de07655 in pthread_join () from /lib64/libpthread.so.0
#1  0x000000000040061e in main () at gdb_core_app.c:17

Thread 2 (process 3495):
#0  0x000000304d272844 in _int_malloc () from /lib64/libc.so.6
#1  0x000000304d27402a in malloc () from /lib64/libc.so.6
#2  0x000000000040064f in entry_thread1 (arg=0x0) at gdb_core_app.c:27
#3  0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#4  0x000000304d2d30ad in clone () from /lib64/libc.so.6

Thread 1 (process 3496):
#0  0x0000000000400678 in entry_thread2 (arg=0x0) at gdb_core_app.c:40
#1  0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#2  0x000000304d2d30ad in clone () from /lib64/libc.so.6

In our example we have three threads , the main process and two threads created by us, to switch between the threads use "thread <number>" command

(gdb) thread 1

This will switch to thread 1 after switchover the stack for the current thread can be dummped with bt command

(gdb) bt
#0  0x0000000000400678 in entry_thread2 (arg=0x0) at gdb_core_app.c:40
#1  0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#2  0x000000304d2d30ad in clone () from /lib64/libc.so.6

Once the back trace is dumped we can inspect each frame, in this case frame 0 looks interesting to us, lets dump that

(gdb) frame 0
#0  0x0000000000400678 in entry_thread2 (arg=0x0) at gdb_core_app.c:40

"frame <number>" command is use to dump the particular frame of stack trace, the above output shows us frame 0 output,  The line 40 of gdb_core_app.c has caused the segmentation fault, lets look at the source through gdb

(gdb) list +

The command "list +" will show the source for the stack dump, once executed we get the below output

34      void* entry_thread2(void* arg)
35      {
36          int temp;
37
38          while(1)
39          {
40              temp = *glb_ptr;
41              printf("Value got %d\n",temp);
42          }
43

We can inspect any variable here, lets dump the value of glb_ptr at the time of crash with below command

(gdb) print glb_ptr
$1 = (int *) 0x0
(gdb)

the print command on gdb prompt shows the value of any variable in current context, here glb_ptr is null due to which the line 40 caused an segmentation fault.

Debugging a hang process

Normally a process hang due to the deadlocks caused by the programming error like unlock not done, This kind of problems are difficult to debug in large systems, but if we can dump the stack trace of hang process then its easy to find out all the threads blocked on a lock , this is enough to give the hint about the lock which has caused the deadlock then its all about the code walk through and analysis. The hang process can be made to generate core by kill command, I have written a small programme same as above the only difference we make use of mutex and intentionally we do not unlock the mutex which cause the deadlock, Source code can be found at http://www.fileupyours.com/files/296434/gdb_hang_app.c . We compile the code using gcc and execute in background as below

[yusufOnLinux]$ gcc -g -o gdb_hang -lpthread gdb_hang_app.c

[yukhan@bhling20 blog]$ ./gdb_hang &
[1] 22488

The "&" at the end tells the shell to execute the binary in background.

[yukhan@bhling20 blog]$ ps -ef |grep yukhan
yukhan   22488   887  0 18:10 pts/43   00:00:00 ./gdb_hang
yukhan   22521   887  0 18:10 pts/43   00:00:00 ps -ef

From the above output we can take out the pid(process id ) of gdb_hang process, process id is the unique id given to each process by kernel, it is use to identify the particular process, to generate the core we need to send the signal 11 to the hang process, kill command takes the singal number and process id as a argument.

[yukhan@bhling20 blog]$ kill -11 22488
[yukhan@bhling20 blog]$
[1]+  Segmentation fault      (core dumped) ./gdb_hang
[yukhan@bhling20 blog]$

As soon as we send the signal 11 to gdb_hang process , it cause the segmentation fault and core is generated, once the core is generated then its easy to debug with gdb as shown in previous esample.

[yukhan@bhling20 blog]$ gdb gdb_hang core.22488
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./gdb_hang'.
Program terminated with signal 11, Segmentation fault.
[New process 22488]
[New process 22506]
[New process 22489]
#0  0x000000304de07655 in pthread_join () from /lib64/libpthread.so.0

Now we can dump all the thread stack strace using "thread apply all bt" command

(gdb) thread apply all bt

Thread 3 (process 22489):
#0  0x000000304de0ce74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000000304de08874 in _L_lock_106 () from /lib64/libpthread.so.0
#2  0x000000304de082e0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000004006f3 in entry_thread1 (arg=0x0) at gdb_hang_app.c:29
#4  0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#5  0x000000304d2d30ad in clone () from /lib64/libc.so.6

Thread 2 (process 22506):
#0  0x000000304de0ce74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000000304de08874 in _L_lock_106 () from /lib64/libpthread.so.0
#2  0x000000304de082e0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000400734 in entry_thread2 (arg=0x0) at gdb_hang_app.c:45
#4  0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#5  0x000000304d2d30ad in clone () from /lib64/libc.so.6

Thread 1 (process 22488):
#0  0x000000304de07655 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000000004006cd in main () at gdb_hang_app.c:21
(gdb)

 In this case the thread 2 and thread 3 both waits on the same lock (see frame 2), Lets switch to the thread 3 and try to look at the source

(gdb) thread 3
[Switching to thread 3 (process 22489)]#3  0x00000000004006f3 in entry_thread1 (arg=0x0) at gdb_hang_app.c:29
29              pthread_mutex_lock(&foo_mutex);

(gdb) bt
#0  0x000000304de0ce74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000000304de08874 in _L_lock_106 () from /lib64/libpthread.so.0
#2  0x000000304de082e0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000004006f3 in entry_thread1 (arg=0x0) at gdb_hang_app.c:29
#4  0x000000304de06367 in start_thread () from /lib64/libpthread.so.0
#5  0x000000304d2d30ad in clone () from /lib64/libc.so.6

(gdb) frame 3
#3  0x00000000004006f3 in entry_thread1 (arg=0x0) at gdb_hang_app.c:29
29              pthread_mutex_lock(&foo_mutex);

The dump shows the line of code which locks the mutex and its also gives the hint which mutex may have caused the deadlock, further digging in this case we can find out that unlock is commented out (intentionally in this case), but in real scenario this can happen due to many reasons.

(gdb) list +
24
25      void* entry_thread1(void* arg)
26      {
27          while(1)
28          {
29              pthread_mutex_lock(&foo_mutex);
30              glb_ptr = NULL;
31
32              glb_ptr = (int*) malloc(sizeof(int));
33
(gdb) list +
34              *glb_ptr = 1000;
35              //pthread_mutex_unlock(&foo_mutex);
36          }
37      }

Wow great, so the line 35 has been commented which actually unlocks the mutex and this is the reason for deadlock.

Conclusion

Through gdb you can analyse lot more thing then listed here, refer man gdb to go in detail, the above article only gives you the idea how core can be utilised to debug different problems, its one of the most effective way to debug un-expected scenarios like crash and hang, So go ahead and enjoy exploring gdb.



Sunday, November 14, 2010

Raw socket, Packet socket and Zero copy networking in Linux

Introduction

If you are a Linux enthusiast and just curious to know how the Ethernet frame is processed, how to sniff the packets even if it is not destined for your computer then you are at the right place, You need to have basics of C and networking that's it.
          Linux provides Packet sockets to sniff the link layer packet at the application,  generally also known as raw sockets, but i would like to make a distinction here that packet socket are use to send and receive the packets at data link layer(layer 2) and where as raw sockets are use to send the raw packet till layer 3 and can only receive specific protocols like icmp at application layer, please refer the following blog for more detail on raw sockets rawSocket.
          Lets have a brief introduction about Ethernet LAN and then we can move to the Linux specific support to sniff the packets, Ethernet segment is shared by all the host connected to same hub and the packet sent by one host is sent to all the host on the segment, The host on the LAN is identified by unique mac address, which is 6 bytes long.



As multiple host share the same Ethernet segment this might result into collision, thus to detect the collision and to re-transmit the packet Ethernet defines the minimum frame size of 64 bytes. There are three types of Ethernet frames possible
i)Unicast Frame
This is received by all the host on the hub but processed only by the host whose mac address match with destination mac address in the frame, other host just drops it.
ii)Multicast Frame
This received by all the host on the hub but processed by the host whose Ethernet controller is configured for the destination mac address in the frame, other host just drops it.
iii)Broadcast Frame
This received by all the host on the hub and processed by all the host.


Linux networking subsystem

Linux has well defined device model, Everything in Linux is a file, device is a special kind of file known as device file and each device has the major and the minor number associated with it but network devices are the exceptions and the Linux network interface are not based on device files, Network devices are identified by name like eth0...ethx and each device inside the kernel is identified by interface index, Mostly all the devices has the driver associated with it, which performs the basic I/O on the device, Network device also has the network driver, which receives and transmits the packets over the network using Ethernet controller(Mac and Phy), For high Ethernet rate like 1000 mbps link , DMA (Direct Memory Access) is use to transfer the data between Ethernet controller  and the kernel buffer.Once the packet is in the kernel buffer the network driver is notified either by interrupt or may be by poller thread, Network driver fills the data in sk_buff, sk_buff is a structure in Linux networking stack,which represent the data along with lots of meta information, hence forth for further discussion we will refer it as skb. skb contains lots of  meta information , which helps kernel to manage the network packet efficiently.





The networking stack inside the kernel is consist of layers and each layer does the well defined protocol processing, The driver receives the packet and pass it to the layer 3, depending on the layer 3 protocol type the corresponding handler is called and once the processing is done the layer 3 pass it to the layer 4 protocol handler, it could be udp or tcp, this is the very high level understanding of protocol processing in kernel ,there are lot more things involved in the processing like Net filter hooks and layer 2 ebtables hook, we are not going to talk about the filters and detailed processing because they deserve altogether a separate article,So lets look at the important information filed by the driver for further processing, Mostly driver will fill the below information in skb

i)   Network data including Ethernet header
ii)  Aligning of IP header to 16 byte boundary
iii) Setting of  protocol,device info and pkt_type, here the protocol field contains the layer 2 protocol and pkt_type is the macro which determines whether the packet is layer 2 broadcast/multicast/unicast or destined for other host(promisc mode)

Once the above information is filled, driver makes a call to the netif_rx to enque the packet for soft irq, finally the packet lands up in netif_receive_skb softirq handler, if the packet socket has registered with protocol  ETH_P_ALL then all packets are delivered to the socket, this is only true with ETH_P_ALL but when application registers with specific protocol like ETH_P_IP then linux only deliver the packets to specific sockets, Code can be found at following path http://lxr.linux.no/linux+v2.6.36/net/core/dev.c


Packet Socket

 Traditionally the protocol specific processing is done inside the kernel and only the data is send/receive by the application, In this case application has no access to any of the protocol header because it is added/removed by the kernel, The IP packet contains Ethernet header, IP header and data, the final packet is constructed by appending the header for each layer, User only needs to put the data in socket, construction of the header is done inside the kernel.

On the other hand Packet socket is very powerful feature of Linux ,It allows to implement the protocols completely in application, including the link layer processing.  Application can open the packet socket and can read the packets from the kernel, No special API's are required, normal socket API's works well with packet socket. 





Packet socket completely bypass the kernel networking stack, it directly receives and sends the packet to link layer, the above figure give the brief overview of Linux packet socket sub-system, Packet socket is created using call socket(PF_PACKET,SOCK_RAW,protocol) , to open a socket you need to specify the domain, socket type and protocol, in this case the domain(family) is PF_PACKET, the socket type can be SOCK_RAW or SOCK_DGRAM, depending on the application requirement, if  SOCK_DGRAM is used in packet socket then application receives the packet without ethernet header, If SOCK_RAW is used then application receives the complete frame include link layer header, socket types are defined in Protocol is use to filter only specific types of packet, like if protocol is ETH_P_IP then only IP packets are received, to receive all protocol packets, application can register with ETH_P_ALL protocol, Protocol Id's are defined in http://lxr.linux.no/linux+v2.6.36/include/linux/if_ether.h, If you have multiple interfaces then you can also bind the socket  to particular interface to receive and send the packet. To receive the packets which are not destined to the local host ,we need to set the Ethernet in promisc mode, Since packet socket may have serious security implications, only root user can create this kind of sockets.

We have so far covered the basics of packet sockets and I think its also worth mentioning about the
BPF (Berkeley packet filters), These are the filters used by the kernel to filter the packets on user based criteria, depending on the protocol used by application, it might receive all the packets, filters can be used to  restrict the packet reception to achieve higher performance, To generate the BPF code there is a easy way out, thanks to tcpdump, which provide the BPF code in C format for the applied filter for example  tcpdump -dd ether proto 0x8100 -i eth0 will display the filter in C code as shown below

{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 1, 0x00008100 },
{ 0x6, 0, 0, 0x00000060 },
{ 0x6, 0, 0, 0x00000000 },

This could be use in our code for filtering.

Socket options

Once packet socket created we need to set some socket options in order to get the desired behaviour, settings like putting the Ethernet interface in promisc mode and binding the socket to specific interface, Ethernet in promisc mode receives all kinds of packets ,even if not destined to local host (destination mac is different from local mac), Normally Ethernet will accept all the broadcast packets and unicast packets intended for it, accepts multicast packet only if enabled. If you want to see all the packets in Hub then you need to set the Ethernet in promisc mode, to do so the user need to have root permissions. We can set the socket in promisc mode using sockopt.
To bind or to set the interface in promisc mode we need to find out the interface number use in kernel

struct ifreq ifr;
strncpy ((char *) ifr.ifr_name, interface.c_str (), IFNAMSIZ);
ioctl (sockId, SIOCGIFINDEX, &ifr)

We use ifreq structure to fill the interface name and then ioctl to get the interface id, ifreq structure is use to configure the network devices in Linux, Once the interface index is found we can go ahead and attached socket to the specific interface

struct sockaddr_ll sll;
sll.sll_family = AF_PACKET;
sll.sll_ifindex = ifr.ifr_ifindex;
sll.sll_protocol = htons (protocol);
bind ( sockId, (struct sockaddr *) &sll, sizeof (sll) ) 

We fill up the the information like socket domain, protocol and the interface number in sockaddr_ll and use bind call to bind the socket.Then finally we set the interface to promisc mode, first we fill up the packet_mreq structure with interface index and interface flag(PACKET_MR_PROMISC) and finally pass it to the sockopt call.

struct packet_mreq      mr;
memset (&mr, 0, sizeof (mr));
mr.mr_ifindex = ifr.ifr_ifindex;
mr.mr_type = PACKET_MR_PROMISC;
setsockopt (sockId, SOL_PACKET,PACKET_ADD_MEMBERSHIP, &mr, sizeof (mr))

Once the above setting is done , we are ready to use the packet socket for sending and receiving the raw data.

Frame construction


The Ethernet header  consist of  6 bytes destination address, 6 bytes source address and 2 bytes Ethernet protocol type, The other bytes in frame like start delimiter and FCS is added by the Ethernet H/w, Mac address is the unique H/W address to identify the host, here the destination mac address  is the receivers mac address and source address is the senders address, packet type identify the layer 3 protocols like Ip, Arp contained in the Ethernet frame, Ip has value 0x0800 and Arp has 0x0806, they are defined in http://lxr.linux.no/linux+v2.6.36/include/linux/if_ether.h, Note the 4 byte is added extra after source mac addr if the packet is  Vlan tagged, out of four bytes two bytes will be 0x8100 denoting Vlan packet and another 2 bytes will have Vlan id and pcp bits, immediately after Vlan tag the two bytes will denote the layer three protocol. Lets see how to construct layer 2 header in C code.

char buf[1522];
struct ethhdr  *eth;
eth = (struct ethhdr*) buf;
memcpy(eth->h_dest,dest_mac,ETH_ALEN);
memcpy(eth->h_source,src_mac,ETH_ALEN); 
eth->h_proto = ETH_P_IP;

The logic above is very straight forward, we take the buffer of 1522 bytes, 1522 bytes because Ethernet packet including Vlan cannot cross beyond that, Then we type cast the start of the packet to Ethernet header i.e. struct ethhdr (declare in include/linux/if_ether.h) and then fill up the destination ,source mac address, source mac address can also be retrieved from kernel interface, please find complete code here, destination mac address normally is learned through the arp protcol but just to try something with packet sockets  we can send the layer 2 boradcast packet. finally we fill the layer 3 protocol type as IP.

After 14  bytes of Ethernet header the layer 3 protocol header will start, depending on the layer 2 protocol type the layer 3 header is type casted, If its a IP protcol then Ethernet data will be type casted as struct iphdr.

struct iphdr *ip =  (struct iphdr*) (eth +1);


Further IP has many fields like protocol version(IPV4), the header length, protocol type within IP (it could be udp,tcp or icmp), tos(type of service for QOS), total length (including the IP header and IP data) and 2 bytes check sum. The most important field of iphdr is the source and destination IP address, IP address is used at layer 3 for routing information and also to identify the network domain, Please refer the complete code here to understand further. The IP packet has 2 bytes of checksum and it is defined by the standard, checksum needs to be correct otherwise the receiving IP stack will not accept the packet.

How to send packet

Once the socket is open and packet construction is done, sending of packet is just using the write command.

write(fd,buf,len);

write system call  takes 3 parameters ,the socket descriptor, buffer pointer and the length of buffer.

How to receive packet

To receive we can use read system call, Once the packet is received the application protocol should process it further.

read(fd,buf,len);

read system call  takes 3 parameters ,the socket descriptor, buffer pointer and the length of buffer.

Zero copy networking

So far we have read the normal packet socket details but packet socket provides one of the very powerful feature of Zero copy, Normal flow of packets involves the copying of packet from kernel space to user space and vice versa, switching of modes(kernel<>user) can be very expensive in real time designs, Unfortunately not much has been written about this feature of packet socket and through this article I will try to bridge this gap, This feature allows Kernel to share the buffer with application, Kernel and application both operate on the same buffer without any overhead of copying the data, synchronization is achieved through some status flags. To enable this feature, Kernel should be compiled with below configuration

CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y


In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very inefficient. It uses very limited buffers and requires one system call to capture each packet, it requires two if you want to get packet’s
timestamp (like libpcap always does). In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size configurable circular buffer mapped in user space that can be used to either send or receive packets. This way reading packets just needs to wait for them, most of the time there is no need to issue a single system call. Concerning transmission, multiple packets can be sent through one system call to get the highest bandwidth. To use mmap , I would suggest to use libcap for portability reasons but this can also be done for understanding and learning low level implementation.

Below are the system calls involved with Zero copy configuration

socket() - Opening of packet socket
setsockopt() - Setting up the size of receive and transmit circular buffer
mmap() - map the kernel buffer in user space
poll() - wait for new incoming packets
send() - send the batch of packet
close() - close the socket

socket creation and the binding of the socket to interface is the same shown above, the important action is to allocation of RX ring buffer and TX ring buffer. To do this we need to use the setsockopt call as below

setsockopt(fd , SOL_PACKET , PACKET_RX_RING , (void*)&req , sizeof(req));
setsockopt(fd , SOL_PACKET , PACKET_TX_RING , (void*)&req , sizeof(req));

The most important argument in the above call is the request structure, which defines the ring buffer parameters

struct tpacket_req
{
unsigned int tp_block_size; /* Minimal size of contiguous block */
unsigned int tp_block_nr; /* Number of blocks */
unsigned int tp_frame_size; /* Size of frame */
unsigned int tp_frame_nr; /* Total number of frames */
};

This structure is defined in include/include/if_packet.h, The above call of setsockopt sets the circular buffer, which is unswapable memory in kernel, As this buffers are mapped to user space , the applications can directly read the packet and read the meta information like timestamp without using any system call, Frame are grouped in a block, block is a contigous memory and contains the frames. We need to specify the the number of blocks, sizeof block, number of frames spread across blocks and size of frame.

So if we need lets say a buffer of 16 frames of size 256 bytes each, then we can have below configuration

struct tpacket_req req;
req.tp_block_size =  2048
req.tp_block_nr   =   2
req.tp_frame_size =  256
req.tp_frame_nr   =   16

The idea above is to split the 16 frame into 2 blocks of 8 frames, So each block will contain 8 frames, there are 2 blocks of  2K bytes each, The number of blocks depends on the static array in kenrel for block pointers and total size of blocks depends on the memory available in the kernel , for better understanding please refer http://lxr.linux.no/linux+v2.6.36/Documentation/networking/packet_mmap.txt

Once the data-structure is initialized and ring buffers are allocated in the kernel, application can do the mmap to map the memory at user space, once the mmap is done user is ready to receive and the send the packet using Zero copy socket.

Note: To use the Zero copy, socket should be bound to an interface.

Conclusion


Packet socket can be very handy in  designing real-time applications like Video streaming, Audio streaming, Ethernet Service OAM and SCTP protocol. Applications like video streaming and audio streaming requires real-time packet reception in order to provide noise-less service, these kind of real-time requirements can be achieved by the combination of BPF and Zero copy networking. New protocols like Ethernet OAM and SCTP can be developed at application layer without any kind of special support from kernel.

Friday, November 5, 2010

Get thread Id in Linux

Pthread library provides the call pthread_self to get task id but this id is not the same as linux provided thread id, Linux view all the threads as LWP(light weight process) and identify using thread id, to get the linux thread id we need to use syscall system call, glibc doesn't provide the gettid function, so we define our own.

pid_t gettid (void)
{
        return syscall(__NR_gettid);
}