UNIX Programming The UNIX Systems Programming Interface Outline Motivation Levels of Abstraction in the UNIX Environment UNIX Overview UNIX System Calls Library Routines vs. System Calls Handling Errors in System Calls Process Management Process Creation Program Invocation Process Synchronization Process Termination Fork, Exec, Wait, Exit Example Jobs and Process Groups Process Memory Organization Exception Handling: setjmp and longjmp Basic Input and Output Routines Descriptors Opening and Creating I/O Devices Closing I/O Devices Reading and Writing I/O Devices Read and Write Example Writev Example Controlling I/O Devices Non-blocking I/O Random Access Methods lseek Example I/O Multiplexing Duplicating Descriptors Converting Descriptors to FILE Pointers Motivation . UNIX is a large and complex platform for developing and executing applications . There are many ways to learn how to use its services and tools . Before diving into the details, this set of slides examines the overall OS picture . eg comparing and contrasting different models and emerging trends Levels of Abstraction in the UNIX Environment UNIX Overview . A typical UNIX `host` is connected to a `network` . The network connects a group of hosts . Each host typically has @ A name (upto 255 chars), set and read via: int sethostname (char *s); int gethostname (char *buf, int len); int uname (struct utsname *name); @ A 32 bit Internet Protocol (IP) address, obtained via struct hostent * gethostbyname (char *name); . Each host contains a kernel that provides a set of services for application and system programmers UNIX Overview (cont'd) . General services in the UNIX kernel: . `Virtual memory management` . Transparent/efficient access to `memory hierarchies` . eg remote file systems, mag tape, CD-ROM, mag disk, solid-state disk, RAM, cache, CPU registers . `File management` . Persistent storage devices . `Process management` . Abstraction of the CPU(s) . `I/O device/communication management` . Network controllers, terminals, printers, etc. UNIX overview (cont'd) . Each host runs a number of independent `processes` that execute in different `protection domains`, eg . `Kernel Processes` . Supplies a `software veneer' to diverse hardware resources . Originally single-threaded, now multi-threaded for symmetric multiprocessing (SMP) . `User Processes` . Which run in `user-space` and execute application programs . `Daemon Processes` . Which are typically started at system boot-time, run in `user-space` in the `background` and carry out various long-running system services . eg nfsd, inetd, rwhod, etc. . May require special privileges UNIX overview (cont'd) . eg  user pid %cpu %mem sz rss tt stat start time command klefstad 5271 23.1 2.9 184 428 p0 R 11:25 0:00 ps -aux root 1 0.0 0.0 52 0 ? iw Oct 12 0:08 /sbin/init - root 3874 0.0 0.0 80 0 ? iw Oct 15 2:08 /usr/etc/cron root 2 0.0 0.0 0 0 ? D Oct 12 0:02 pagedaemon root 43 0.0 0.0 68 0 ? iw Oct 12 0:45 portmap root 0 0.0 0.0 0 0 ? D Oct 12 0:53 swapper root 145 0.0 0.0 40 0 b iw Oct 12 0:00 - std.9600 ttyb (getty) root 99 0.0 0.2 16 28 ? S Oct 12 1:36 screenblank root 64 0.0 0.0 16 0 ? S < Oct 12 0:16 (nfsd) root 63 0.0 0.0 16 0 ? S < Oct 12 0:17 (nfsd) bin 46 0.0 0.0 36 0 ? iw Oct 12 0:36 ypbind root 66 0.0 0.0 16 0 ? S < Oct 12 0:18 (nfsd) root 65 0.0 0.0 16 0 ? S < Oct 12 0:17 (nfsd) root 114 0.0 0.0 56 0 ? iw Oct 12 2:41 inetd root 77 0.0 0.0 56 0 ? iw Oct 12 0:31 syslogd root 117 0.0 0.0 44 0 ? iw Oct 12 0:10 /usr/lib/lpd root 84 0.0 0.0 52 0 ? iw Oct 12 0:00 rpc.statd root 22272 0.0 1.1 116 164 ? S Nov 12 17:16 amd -C cse -p -l syslog -x warn root 92 0.0 0.0 80 0 ? iw Oct 12 0:00 rpc.lockd root 105 0.0 0.1 12 8 ? S Oct 12264:53 update root 29551 0.0 4.9 712 720 ? S Jan 8 0:04 /usr/bin/X11/xterm -d net1.ics. root 112 0.0 0.0 48 0 ? iw Oct 12 19:22 in.rwhod -r root 4551 0.0 0.0 40 0 co iw 14:40 0:00 /usr/etc/getty std.9600 console klefstad 29552 0.0 3.5 352 516 p0 S Jan 8 0:04 -tcsh (tcsh) nobody 4520 0.0 0.0 36 0 ? iw Oct 16 0:01 selection_svc  UNIX overview (cont'd) . Each user-level process has . An emphemeral, host-unique process id (between 1 and ~32,000), obtained via int pid = getpid (); . A virtual address space (may be shared) . eg stack, heap, data, text . Timers, signal handlers and masks, profile buffers, etc. . Descriptors associated with objects . eg files, devices, terminals, network controllers, access rights . One or more threads of control and associated run-time state and scheduling information UNIX overview (cont'd) . A UNIX process may execute in two different `protection domains' throughout its lifetime @ `User-Mode` . Normal mode for executing application programs . Lacks certain privileges and instruction execution capabilities . Protects the OS kernel from errors in user-programs @ `Kernel-Mode` . Provides secure interface to internal kernel data structures that are not directly accessible to application programs UNIX overview (cont'd) . Application programs automatically perform a `mode-switch' between user- and kernel-mode when system calls are made . Called `trapping into the kernel' . Note, system calls are `entry points' into the kernel . Note, the OS also runs in `kernel-mode' when: @ Handling asynchronous external events . eg packet arrivals on network controllers @ Error conditions occur in user applications . eg `traps' (such as a divide by 0) UNIX System Calls . System calls are special subroutines . They request the operating system kernel (which runs in kernel-mode) to take some action on behalf of application programs (running in `user-space') eg @ Create a new process @ Invoke a new program @ Read/Write file blocks from/to disk @ Establish a network connection with a remote host UNIX System Calls (cont'd) . UNIX system calls have a C library interface, though they are often implemented using special assembly language instructions . eg @ckmd@ on the VAX . The steps the OS goes through to invoke a system call include: . User process invokes a system call using a C language interface . which is really just a subroutine veneer located in the @/lib/libc.a@ or @/lib/libc.so@ library . eg int gettimeofday (struct timeval *tp, struct timezone *tzp); . This system call interface (which is often written in assembly code), sets up the arguments appropriately (e.g., selecting which system call number to use) and generates a `trap instruction' that causes the process to switch from user-mode into kernel-mode UNIX System Calls (cont'd) . System call steps (cont'd) . Once in kernel-mode, control passes to an entry point called @syscall()@, which @ Takes the system call number passed in as part of the trap instruction @ Indexes into a vector of pointers to system functions @ Invokes the appropriate function . This function actually carries out the system call . Following this, the entire process unwinds and control is returned to the user processing running in user-mode Library Routines vs. System Calls . In UNIX, library routine APIs look just like system calls . However, they often do not use the OS kernel to perform their work . eg . Calculating the @sin@ or @sqrt@ function . Sorting an array of integers . However, some library routines are `front-ends' for system calls . eg @printf@, which calls @write@ Library Routines vs. System Calls (cont'd) . System calls in UNIX require a mode switch (from user- to kernel-mode) and are generally more expensive to use than library calls . Manual sections reflect this distinction: @ Section 2 covers system calls @ Section 3 covers library routines @ Section 1 covers application programs Library Routines vs. System Calls (cont'd) . `Programming Tip`: . Since system calls are expensive, design programs to reduce the number of system calls compared with library calls . Caching is a frequent technique for reducing system call overhead... . eg there is an `incredible` speed difference between the following programs int main (void) { int c; while ((c = getchar ()) != EOF) putchar (c); } . vs int main (int) { char c; while (read (0, &c, sizeof c) > 0) if (write (1, &c, sizeof c) != 1) perror ("write"), exit (1); } Handling Errors in System Calls . System calls `generally` return -1 on failure and >= 0 on success . If an error occurs, the library wrapper sets a global integer variable (@errno@) before returning . `Programming Tip`: . (Almost) `always` check the return value of a system call!! . However, only check @errno@ if system call actually failed... . @perror@ (a library call) may be used to print an appropriate error message explaining why a call failed, eg void perror (const char *msg); if (pipe (pipefd) == -1) perror ("error creating pipe"), exit(1); /* "error creating pipe: too many open files" */ Process Management . The `process' is the basic unit of protection and execution in UNIX . A process is `a program in execution' . Consists of a collection of resources, eg . Process identifier . CPU . Real and effective user and group ids . Processor status word . Scheduling status (eg running, ready-to-run, blocked, etc.), . Priority . Virtual memory . Run-time stack . Open I/O devices . Signal masks and signal handlers . Timers and profile buffers Process Management (cont'd) . Traditional process management in UNIX involves: @ `Creation` . eg @fork@ or @vfork@ (makes a logical copy) @ `Invocation` . eg the @exec@ family (overlays current process) @ `Synchronization` . eg @wait@ (and variants), semaphores, file and record locking @ `Communication` . eg various IPC mechanisms like @signal@, @pipes@, @sockets@, @message queues@, etc. @ `Termination` . eg @exit@ Process Management (cont'd) . Systems like MACH and Solaris distinguish between `processes/tasks` and `threads` . `Process/Task` - unit of resource allocation, protection, and authorization . `Thread` - unit of execution within a process or task . Like any other running program, threads must be scheduled . This is usually done on a system-wide or process-wide basis Process Management (cont'd) . Threads are separate execution paths within a process address space . Generally, a thread shares the same global address space, open files, signal handlers, etc. . However, it has a separate stack, registers, instruction pointer, and signal mask . Objects in the shared address space are not protected, and may be modified at any time by any thread . Mutual exclusion methods (eg mutexes, condition variables, etc.) must be used to protected such shared objects from unexpected modification . We will cover threads in detail later in the course... Process Creation . Two general reasons for creating processes: @ A process wants to make a copy of itself so that one part can handle one set of operations, while the other copy performs another task, eg . Network servers . eg login daemon, ftp daemon . Multiprocess applications @ A process wants to execute another program, eg . Shell command interpreters . Multi-pass C compilers . @inetd@ superserver . Note, there is generally more overhead for the second case, since an @exec@ is performed Process Creation (cont'd) . The only way in UNIX to create a new process is via the @fork@ system call int child_pid = fork (); . Fork makes a `logical' duplicate of the parent process . Duplicate is called the `child` process . Duplicating a process may seem like a lot of unnecessary work... . Optimizations include: . `Copy-on-write' implementations . Utilize memory-mapping facilities . @vfork@ . Basically made obsolete with `copy-on-write' Process Creation (cont'd) . Important subtlety: open files in the parent remain open in the child Process Creation (cont'd) . @fork@ is called once (by the parent), but returns twice (once in the parent and once in the chilfd) . At this point, the two processes are `almost` identical . ie they possess different PIDs . However, they `may' change as the result of subsequent operations... . @fork@ returns 0 in the child, child PID in the parent . Typical idiom: int pid; switch (pid = fork()) { case -1: perror ("fork failed"), exit (1); case 0: do_child_work (); break; default: do_parent_work (); break; } Program Invocation . The only way to execute a program in UNIX is to have an existing process issue a system call from the @exec@ family . @execve@ is generally a system call . Other APIs are library calls to @execve@ . @exec@ replaces the current process with a new program . The PID does not change . But text and data typical do change! . As with @fork@, any open files remain open . Unless the `close-on-exec' @fcntl@ is invoked prior to the @exec@ . However, shared memory segments of the calling process will not be attached in the new process Program Invocation (cont'd) . @exec@ is generally a more expensive operation than @fork@: . Some parts of the program must be loaded from disk . @fork@ often uses `copy-on-write' optimizations . However, @exec@ may use demand-paging... . Note that thread implementations typically provide a lightweight variant of @fork@ and @exec@ that combines both creation and invocation operations into a single `spawn new thread' command Program Invocation (cont'd) . Signal handling semantics . Signals set to SIG_DFL or SIG_IGN retain the same disposition in the new program . The disposition for `caught' signals is reset to the default . Signals set to be blocked in the calling process remain blocked in the new process . Note, the difference between `real` and `effective` user IDs is important to understand Program Invocation (cont'd) . @exec@ is a system call, but there are multiple interfaces to this call: @ execl (char *path, char *arg0 [ , arg1,... , argn ] (char *) 0); @ execlp (char *file, char *arg0 [ , arg1,... , argn ] (char *) 0); @ execle (char *path, char *arg0 [ , arg1,... , argn ] (char *) 0, char *envp[]); @ execv (char *path, char* argv[]); @ execvp (char *file, char *argv[]); @ execve (char *path, char *argv[], char *envp[]); . Note, @exec@ can fail if @file@ or @path@ do not exist or appropriate permission bits are not set! . But `not` if the @exec@'d program fails! Process Synchronization . The @wait@ system call is used in conjunction with @fork@ . @wait@ delays its caller until a signal is received or any one of its child processes terminate or stop: int child_pid = wait (int *statusp); . @wait@ implements the following general rules: @ If any child has already died then return immediately, returning the PID and exit status @ If there are no remaining children return is immediate with the value -1 @ If only running or blocked children exist the calling process is blocked @ If multiple children exit, then parent may need to loop to `reap' them all Program Invocation (cont'd) . The following tweaking typically happens between @fork@ and @exec@: changing process groups changing tty process group changing session id changing controlling terminal redirecting I/O closing files. changing signal handlers changing uid/gid changing limits changing priority changing working directory changing root directory changing umask Process Synchronization . If @statusp@ != NULL then on return it points to information on `why` the program stopped and `what` the exit status was . ie 2 pieces of info are stored in a single value! . Programs may stop due to `tracing' and/or due to exiting . Note that there are many other variants of @wait@ . eg int waitpid (int pid, int *statusp, int options); int wait3 (int *statusp, int options, struct rusage *); int wait4 (int pid, int *statusp, int options, struct rusage *); int waitid (idtype_t idtype, id_t id, siginfo_t *infop, int options); . These variants are often important for `reaping zombies' via the WNOHANG option Process Communication . Often, separate UNIX processes do no need to communicate . eg multiple concurrent @ftp@ sessions . Other times, they do . eg local and remote interprocess communication (IPC) involves the interaction of two or more processes . eg provides the basis for peer-to-peer and client/server architectures . Interprocess communication (IPC) is a rich and complex topic that is explored in-depth later on Process Termination . The @exit@ system call terminates the current process, returning an `exit status.' . Usually 0 or 1... . Note that @exit@ `never` returns! . void exit (int status); . Parent is notified of the cause of child's termination . Note that there are two exit calls: @ @_exit@: a `system call` that exits immediately and never returns... @ @exit@: a `library routine` that flushes stdio buffers, calls any @atexit()@ functions, and then calls @_exit@ Fork, Exec, Wait, Exit Example . ezshell - a simple shell program . eg  prompt..: date Sat Jul 11 22:48:26 pdt 1992 prompt..: ^D => sort -r /usr/dict/words => who aporter ttyp0 Jul 10 13:38 (festus.cs.umd.ed) klefstad ttyp1 Jul 5 17:36 (net6.ics.uci.edu) yessayan ttyp4 Jul 6 12:35 (chapelle.ics.uci) sam ttyp5 Jul 6 12:56 (john-bigboote.ic) => [lots of sorted words omitted] => ^D  Fork, Exec, Wait, Exit Example (cont'd) . eg /* ezshell.c */ #include #include void parse (char *, char **); int execute (char **, int); /* Reap zombie'd children (run in the parent) */ void child_reaper (int) { int status; signal (SIGCHLD, child_reaper); while (waitpid (-1, &status, WNOHANG) >= 0) continue; } int main (int argc, char *argv[]) { char *prompt = argc > 1 ? argv[1] : "command: "; int dont_wait = argc > 2 && *argv[2] == 'b' ? 1 : 0; signal (SIGCHLD, child_reaper); for (;;) { char buf[1024]; char *args[64]; fputs (prompt, stdout); if (fgets (buf, sizeof buf, stdin) == 0) printf ("\n"), exit (0); parse (buf, args); if (execute (args, dont_wait) < 0) perror (argv[0]); } } Fork, Exec, Wait, Exit Example (cont'd) . eg /* Transforms string into an argv-style vector of strings */ void parse (char *buf, char *args[]) { while (*buf != 0) { while (*buf == ' ' || *buf == '\t') *buf++ = '\0'; *args++ = buf; while (*buf != '\0' && *buf != ' ' && *buf != '\t') buf++; } *args = 0; } /* Fork, exec, and (maybe) wait */ int execute (char *args[], int dont_wait) { int pid; switch (pid = fork ()) { case -1: return -1; case 0: /* child */ execvp (args[0], args); _exit (1); default: /* parent */ if (dont_wait == 0 && wait (0) == -1) return -1; } return 0; } Job Control and Process Groups . Each process is associated with a `process group' . Process groups are used for `job control.' . A group of 1 or more related processes is called a 'job', eg % rwho | awk ' { print $1 } ' | sort -u | wc . Signals sent to a `process group leader' are delivered to all members of the process group . eg the `hang up' signal is sent to the login process group when a user logs out . By default, a new process has a unique PID and it inherits it parent's process group id Jobs and Process Groups (cont'd) . @getpgrp@ returns the process group for a specified process int pid = getpid (); int pgrp = getpgrp (); . @setpgrp@ sets the process group id for a process to its process ID int setpgrp (); . A privileged process may set its process group and user ID to any value . eg used by @inetd@ Jobs and Process Groups (cont'd) . An unprivileged process may set its process group id to its PID . This allows it to become its own process group leader (and prevents it from receiving signals sent to its parent) . This is important for network server daemons... . @setsid()@ sets the process group ID and session ID of the calling process to the process ID of the calling process, and releases the process's controlling terminal Process Memory Organization . Each process has five logical memory segments, ie @ Text segment @ Data segment @ BSS segment . `Block started by symbol' @ Stack segment @ Heap segment . The text and data segment are physically stored on disk . Typically `mapped' in at @exec@ time... . The other segments are created dynamically by the OS at program load-time . or as part of dynamic linking Process Memory Organization @ `Text Segment` . Fixed-size read-only sharable program instructions (may also contain read-only data) @ `Data Segment` . For initialized global and static data @ `bss Segment` . For uninitialized global and static data @ `Heap` . Memory for dynamically allocated data structures of arbitrary size and arbitrary alloc/dealloc order . @sbrk@ changes the limit for heap segment caddr_t new_limit = sbrk (int incr); Process Memory Organization @ `Stack` . For subroutine activation records (alloc/dealloc in LIFO order) . @getrlimit@ and @setrlimit@ are used to change the limit for the stack segment struct rlimit rlim; getrlimit (RLIMIT_STACK, &rlim); rlim.rlim_cur = rlim.rlim_max; setrlimit (RLIMIT_STACK, &rlim); . Note that this doesn't work very well in threaded programs since there are multiple stacks! Process Memory Organization (cont'd) . Relationship between process segments and C/C++ variable storage classes Exception Handling: setjmp and longjmp . Provides a method for performing non-local gotos . Mostly used for error handling . Again, not very useful for threaded programs . Requires saving machine `state' into a data structure called a @jmp_buf@ . The @jmp_buf@ is used to later unwind the stack back to the original state . @jmp_buf@ is an array containing general program registers, the process status word, and the instruction and stack pointers Exception Handling: setjmp and longjmp (cont'd) . @setjmp@: saves the machine `state` in a @jmp_buf@ for a subsequent return via @longjmp@ int setjmp (jmp_buf env); . @longjmp@: transfers control to the matching @setjmp@ call int longjmp (jmp_buf env, int value) . Note, @jmp_buf@ is an array, and is passed by reference . This allows it to be modified... Exception Handling: setjmp and longjmp (cont'd) . @setjmp@ returns a different value for all `returns' other than the first one (to distinguish subsequent @longjmps@) ie . First call returns 0 . Subsequent calls return > 0 (if called by @longjmp@) . Note the status of variables on return: @ The general CPU and floating-point data registers are restored to the values they had at the time that @setjmp()@ was called @ All memory-bound (ie auto/static/global) data have values as of the time @longjmp()@ was called . ie only nanoseconds before! Exception Handling (cont'd) . eg #include #include #include #define WORD "thunderbird" static jmp_buf env; /* Stores current program state */ static int timeout (int sig) { signal (SIGALRM, timeout); longjmp (env, 1); } int main (void) { int not_saved = 10; register int maybe_saved = 11; /* May not be saved */ char buf[16] = ""; signal (SIGALRM, timeout); if (setjmp (env) == 0) { alarm (15); not_saved++; maybe_saved++; printf ("type a word; if you don't in 15 " "seconds I'll use \"%s\": ", WORD); fgets (buf, sizeof buf, stdin); alarm (0); } if (buf[0] == '\0') strcpy (buf, WORD); printf ("\nword: %s, not_saved: %d, maybe_saved: %d\n", buf, not_saved, saved); exit (0); } Basic Input and Output Routines . Several basic categories of I/O device routines: @ `Opening and Closing I/O Devices` @ `Reading and Writing I/O Devices` @ `Random access methods` @ `I/O Multiplexing` @ `Controlling I/O Devices` . Miscellaneous operations @ `Duplicating descriptors` @ `Converting descriptors to file pointers` Descriptors . Application processes access all I/O devices in UNIX via `descriptors.' . eg file descriptors and socket descriptors . Often referred to collectively as `file` descriptors . We refer to them more accurately as I/O descriptors... . Descriptors are represented by a small unsigned integer value, which are `handles' that index into a kernel-maintained, per-process I/O descriptor table . Certain values are predefined by default to support transparent I/O redirection: @ Standard input has descriptor 0 @ Standard output has descriptor 1 @ Standard error has descriptor 2 Descriptors (cont'd) . Processes typically start out with default of 64 descriptors . The max limit is often around 200 descriptors . To determine the number of entries in the descriptor table use: . The @getdtablesize@ system call in BSD . The @sysconf@ call with the argument @_sc_open_max@ in SVR4 . @setrlimit@ and @getrlimit@ are used to change number of entries, eg #include /* If @new_limit@ == -1 set the limit to the maximum allowable. Otherwise, set it to be the value of @new_limit@. */ int raise_file_limit (int new_limit) { struct rlimit rl; if (new_limit < 0) { if (getrlimit (RLIMIT_NOFILE, &rl) == -1) return -1; rl.rlim_cur = rl.rlim_max; } else rl.rlim_cur = new_limit; return setrlimit (RLIMIT_NOFILE, &rl); } Descriptors (cont'd) @ `Per-Process Descriptor Table` One entry for each instance of an open device @ `Global File Table` Allows sharing of `file pointers' @ `Global vnode Table` Cache of information for all open devices Descriptors (cont'd) . UNIX devices typically implement a well-defined set of system calls: . @open@ and @close@ . @read@ and @write@ . @lseek@ . @select@ (or @poll@) . @ioctl@, @fcntl@, and/or @getsockopt@/@setsockopt@ . This is what leads to the oft-repeated maxim that `in unix, everything is a file.' . ie all devices are represented via descriptor handles that allow certain basic operations . Provides a rudimentary form of `object-oriented' programming... . eg abstract interfaces that reuse existing kernel constructs such as the device switches Opening and Creating I/O Devices . The @open@ routine is used to open (or create) an I/O device for reading and/or writing: int open (char *path, int flags, mode_t mode); . The @flags@ argument is composed by bitwise ORing together the following constants (defined in ):  O_rdonly - open for reading only O_wronly - open for writing only O_rdwr - open for reading and writing O_append - append when writing O_creat - create if file does not exist O_trunc - truncate the file to zero length if opened for writing O_excl - return error if file is to be created and already exists (used as a locking semaphore) O_ndelay and O_nonblock - do not block for I/O operations  Opening and Creating I/O Devices (cont'd) . The third argument to @open@ is only used with the O_CREAT option . It specifies the file protection mode . eg 0666, 0644, 0755, etc. . Note, the mode is bitwise-and'd with the umask value . eg int fd1 = open ("/etc/passwd", O_RDONLY); int fd2 = open ("/tmp/foo", O_WRONLY | O_CREAT | O_TRUNC, 0666); if (fd1 == -1) perror ("/etc/passwd"); . @open@ return values: . -1 on failure . The next lowest available descriptor starting from 0 on success . Many programs depend on this behavior, `cf` the @dup@ system call Closing I/O Devices . The @close@ call closes an open I/O device: int close (int file_descriptor); . @close@ returns -1 on error and 0 on success . Note, I/O devices are automatically closed upon program exit, but it is often useful to conserve descriptors by closing unneeded ones during program execution . Note, this is very important for network server daemons... . Also, consider behavior for closing network connection... . eg if (close (fd1) == -1 || close (fd2) == -1) perror ("close failed"); Reading and Writing I/O Devices . There are four routines for reading from and writing to a file or socket descriptor: int read (int fd, void *ptr, int size); int write (int fd void *ptr, int size); int readv (int fd, struct iovec *iov, int iovcnt); int writev (int fd, struct iovec *iov, int iovcnt); struct iovec { char *iov_base; int iov_len; }; . @readv@ and @writev@ provide a `scatter read` and `gather write` facility, respectively. This cuts down on buffer management and mode switch overhead . On success, @read@ and @readv@ return number of bytes read and placed in buffer. They return -1 on error and 0 on EOF . On success, @write@ and @writev@ return number of bytes written, they return -1 on failure . Remember to always check return values, especially when dealing with network I/O . Due to `short' @read@s and @write@s Writev Example . Consider a function that writes a header and some associated data to a device: . Without @writev@, we must either make two system calls to @write@, eg int write_hdr (int fd, char *header, int hbytes, char *data, int dbytes) { if (write (fd, header, hbytes) != hbytes) return -1; if (write (fd, data, dbytes) != dbytes) return -1; return nbytes + hbytes; } . or else dynamically allocate one buffer, perform several copies, and then do one write, eg int write_hdr (int fd, char *header, int hbytes, char *data, int dbytes) { int total = hbytes + dbytes; char *tmp = malloc (total); memcpy (tmp, header, hbytes); memcpy (tmp + hbytes, data, dbytes); if (write (fd, tmp, total) != total) total = -1; free (tmp); return total; } Writev Example (cont'd) . Using @writev@ we can eliminate both overheads, eg #include #include int write_hdr (int fd, char *header, int hbytes, char *data, int dbytes) { struct iovec iov[2]; int total = hbytes + dbytes; iov[0].iov_base = header; iov[0].iov_len = hbytes; iov[1].iov_base = data; iov[1].iov_len = dbytes; if (writev (fd, &iov[0], 2) != total) return -1; return total; } . Note the OS kernel performs the coalescing of disjoint buffers... Reading and Writing I/O Devices (cont'd) . There are six more routines for reading from and writing to a socket descriptor: . Provide an addition socket-specific @flags@ parameter: int send (int s, char *msg, int len, int flags); int recv (int s, char *buf, int len, int flags); #define MSG_OOB /* process out-of-band data */ #define MSG_PEEK /* peek at incoming message */ #define MSG_DONTROUTE /* don't routing tables */ . Exchange datagram messages: int sendto (int s, char *msg, int len, int flags, struct sockaddr *to, int tolen); int recvfrom (int s, char *buf, int len, int flags, struct sockaddr *from, int *fromlen); Reading and Writing I/O Devices (cont'd) . Socket routines (cont'd) . Kitchen sink interface (also allows passing `access rights,' eg open file descriptors): #include int sendmsg (int s, struct msghdr *msg, int flags); int recvmsg (int s, struct msghdr *msg, int flags); /* Note, made obsolete in BSD 4.4 */ struct msghdr { char *msg_name; /* optional address */ int msg_namelen; /* size of address */ struct iovec *msg_iov; /* scatter/gather array */ int msg_iovlen; /* # elements in msg_iov */ /* access rights sent/received */ char *msg_accrights; int msg_accrightslen; }; Reading and Writing I/O Devices (cont'd) . Finally, SVR4 provides an additional 4 more routines that handle reading and writing from STREAMS devices . int putmsg (int fd, struct strbuf *cntl, struct strbuf *data, int flags); . int getmsg (int fd, struct strbuf *cntl, struct strbuf *data, int *flags); . int putpmsg (int fd, struct strbuf *cntl, struct strbuf *data, int band, int flags); . int getpmsg (int fd, struct strbuf *cntl, struct strbuf *data, int *band, int *flags); . These interfaces provide a message-oriented I/O API Reading and Writing I/O Devices (cont'd) . The @putmsg@ and @getmsg@ routines all utilize the struct @strbuf@ structure struct strbuf { /* no. of bytes in buffer */ int maxlen; /* no. of bytes returned */ int len; /* pointer to data */ char *buf; }; Reading and Writing I/O Devices (cont'd) . The following example reimplements @write_hdr@ in terms of @putmsg@ int write_hdr (int fd, char *header, int hbytes, char *data, int dbytes) { struct strbuf header_buf, data_buf; header_buf.buf = header; header_buf.len = hbytes; data_buf.buf = data; data_buf.len = dbytes; if (putmsg (fd, &header_buf, &data_buf, 0) == -1) return -1; return hbytes + dbytes; } . Note that this is not as general as @writev@... Read and Write Example . `append` #include #include int main (int argc, char *argv[]) { char *progname = argv[0]; if (argc != 3) fprintf (stderr, "Usage: %s" " from-file to-file\n", progname), exit (1); else { int n, from, to; char *input_file = argv[1]; char *output_file = argv[2]; char buf[BUFSIZ]; if ((from = open (input_file, O_RDONLY)) < 0) perror (input_file), exit (1); if ((to = open (output_file, O_WRONLY | O_CREAT | O_APPEND, 0644)) < 0) perror (output_file), exit (1); while ((n = read (from, buf, sizeof buf)) > 0) if (write (to, buf, n) != n) perror (progname), exit (1); if (n != 0) perror (progname); if (close (from) == -1 || close (to) == -1) perror (progname), exit (1); exit (0); } } Controlling I/O Devices . There are several system calls that control open I/O devices: . int fcntl (int fd, int cmd, int arg); . Operate on `open' files and sockets . int ioctl (int fd, int request, char *arg); . Operate on arbitrary devices, including terminals, Streams, etc. . int getsockopt (int s, int level, int optname, char *optval, int *optlen); . int setsockopt (int s, int level, int optname, char *optval, int optlen); . Operate on sockets only . These routines generally have `kitchen-sink semantics.' . ie they handle a `hodge-podge' of operations that do not fit neatly into any other category . Note the `polymorphic' interfaces! Non-blocking I/O . One common use of I/O control functions is to set a descriptor into `non-blocking' mode, eg . To enable non-blocking mode fcntl (fd, F_SETFL, fcntl (fd, F_GETFL, 0) | O_NONBLOCK); . To disable non-blocking mode fcntl (fd, F_SETFL, fcntl (fd, F_GETFL, 0) & ~O_NONBLOCK); . Note, there are subtle differences between O_NDELAY and O_NONBLOCK . On some systems that support both flags (such as SunOS 5.x), using O_NDELAY causes future @read@ calls to return zero if the call would have blocked, whereas using O_NONBLOCK causes @read@ calls to fail with EAGAIN or EWOULDBLOCK Non-blocking I/O . The portable way to write non-blocking I/O code is to use conditional compilation so that either O_NDELAY or O_NONBLOCK can be used, depending on which of these flags is available on the platform, eg #IF #ELIF #ELIF #ENDIF Random Access Methods . @lseek@ moves the file table pointer to a specific location for I/O devices that support random access . eg `block' devices like the file system . Note that the @read/write/lseek@ paradigm for random block I/O has been superseded by @mmap@ in many cases... . Subsequent @read@ and @write@ calls start from the new location . Note, some devices are incapable of seeking . eg pipes and sockets may only be read sequentially Random Access Methods (cont'd) . Interface: int lseek (int fd, long offset, int whence); . There are three values of @whence@: #define SEEK_SET 0 /* beginning of file */ #define SEEK_CUR 1 /* current location in file */ #define SEEK_END 2 /* end of the file */ . Note, negative offsets are allowed! . @lseek@ returns the seek pointer location as measured in bytes from the beginning of file on success. On failure it returns -1 lseek Example . eg #include struct record { int uid; char login[9]; }; char *logins[] = { "user1", "user2", "user3", "user4", "user5" }; void putrec (int fd, int i, struct record *r) { lseek (fd, (long) i * sizeof *r, SEEK_SET); write (fd, r, sizeof *r); } int main (void) { int i, fd; struct record rec; if ((fd = open ("datafile", O_TRUNC O_WRONLY | O_CREAT, 0644)) < 0) perror ("datafile"), exit (1); /* Process in reverse order... */ for (i = 4; i >= 0; i--) { rec.uid = i; strcpy (rec.login, logins[i]); putrec (fd, i, &rec); } if (close (fd) == -1) perror ("main"), exit (1); exit (0); } I/O Multiplexing . There are several types of I/O paradigms available in the UNIX programming model @ `Blocking on a Single Descriptor` . Process waits for completion of I/O on 1 descriptor . Often inadequate for complex event-driven applications @ `Non-blocking` . ie `polling,' process does not wait . Often inefficient if input is not immediately available I/O Multiplexing . I/O paradigms (cont'd) @ `Asynchronous` . ie `signal-driven,' process receives the SIGIO or SIGPOLL signal when I/O is available on a socket or terminal . Gets complicated if more than 1 descriptor @ Separate thread or process per-descriptor . May be an inefficient use of resources... @ `I/O Multiplexing` . Perform a `timed-wait` on multiple I/O descriptors in a `descriptor set` . Used in event-driven network servers and window systems I/O Multiplexing (cont'd) . There are two ways to perform I/O multiplexing in UNIX: . @select@ (BSD) . Works on files, sockets, terminals, etc. . @poll@ (System V) . SVR3 not work on non-STREAMS devices... . SVR4 @poll@ works for all devices . @select@ is the `de facto` standard . @poll@ is the `de jure` standard I/O Multiplexing (cont'd) . @select()@ and @poll()@ provide similar services . Both allow application processes to wait a user-specified timeout interval for I/O events to occur on multiple descriptors . The timeout interval may: @ block indefinitely until I/O events occur `or` an interrupt (signal) is raised @ return immediately . ie a `poll' @ wait a `user-specified' time interval for activity to occur on the descriptor set . Note, @select()@ also serves as a high-resolution timer! I/O Multiplexing (cont'd) . @select()@ and @poll()@ overview (cont'd) . Both system calls are very complicated to understand at first glance . Due to their `kitchen-sink' interfaces... ;-) . However, @poll()@ has a `cleaner,' more general API . @poll()@ also supports additional functionality such as detecting whether priority-band data is pending... . @select()@ uses a value/result API that requires application-level data copying . In general, @poll()@ trades off space to save time, whereas @select()@ trades off time to save space I/O Multiplexing (cont'd) . @select()@ and @poll()@ are generally implemented in the following way: @ They poll all file descriptors once, if none are ready, the calling process is put to sleep @ When a file descriptor being selected on becomes ready for I/O, the driver code resposinble for that descriptor calls the internal routine @selwakeup@, @ @selwakeup@ results in all processes selecting on that address to be woken up . At that point, the @select()@ or @poll()@ call then polls all file descriptors in its masks one more time and returns I/O Multiplexing (cont'd) . @select()@ allows application processes to wait a user-specified timeout interval for multiple descriptors to become available for `reading`, `writing`, and/or `exceptional` events . Interface:  #include #include /* struct timeval { long tv_sec; long tv_usec; }; */ int select ( int maxfdp1, /* Maximum descriptor plus 1 */ fd_set *readfds, /* bit-mask of "read" descriptors */ fd_set *writefds, /* bit-mask of "write" descriptors */ fd_set *exceptfds, /* bit-mask of "exception" descriptors */ struct timeval *tv /* Amount of time to wait for events to occur */ );  I/O Multiplexing (cont'd) . Parameter semantics: . Any of the three @fd_set@ pointer arguments may be NULL . Otherwise, they are `value-result' arguments, so be sure to store copies... . @maxfdp1@ specifies the number of descriptors to be tested . Its value is the maximum descriptor value to be tested, plus 1 . timeout is either: @ a 0 pointer (blocks indefinitely) @ tv_sec = 0 && tv_usec = 0 (poll) @ tv_sec > 0 || tv_usec > 0 (timed wait) . @select@ returns the number of descriptors that became `enabled' I/O Multiplexing (cont'd) . Associated operations on fd_sets  #include typedef long fd_mask; #define nfdbits (sizeof (fd_mask) * nbby) #define howmany(x, y) (((x) + ((y) - 1)) / (y)) typedef struct fd_set { fd_mask fds_bits[howmany(fd_setsize, nfdbits)]; } fd_set; fd_zero (fd_set *fdset); /* clear all bits */ fd_set (int fd, fd_set *fdset); /* turn a bit on */ fd_clr (int fd, fd_set *fdset); /* turn a bit off */ fd_isset (int fd, fd_set *fdset); /* test a bit */ #define fd_set(n, p) \ ((p)->fds_bits[(n)/nfdbits] \ |= (1 << ((n) % nfdbits))) #define fd_clr(n, p) \ ((p)->fds_bits[(n)/nfdbits] \ &= ~(1 << ((n) % nfdbits))) #define fd_isset(n, p) \ ((p)->fds_bits[(n)/nfdbits] & \ (1 << ((n) % nfdbits))) #define fd_zero(p) bzero ((char *)(p), sizeof (*(p)))  I/O Multiplexing (cont'd) . Note that it is possible to increase the size of an fd_set:. eg #define FD_SETSIZE 1024 #include . This is only useful if your UNIX kernel supports large numbers of open devices per-process! . A typical request to @select@ might be: . `Return when any of the descriptors in the set [1,4,5] are ready for reading, or if any of the descriptors in the set [2,7] are ready for writing, or if any of the descriptors in the set [1,4] have an exceptional condition pending.` I/O Multiplexing (cont'd) . eg  fd_set rd_fds, wr_fds, ex_fds; fd_set cp_rd_fds, cp_wr_fds, cp_ex_fds; int width, n; struct timeval tv, *tvp = 0; fd_zero (&rd_fds); fd_zero (&wr_fds;); fd_zero (&ex_fds); fd_set (1, &rd_fds); fd_set (4, &rd_fds); fd_set (5, &rd_fds); fd_set (2, &wr_fds); fd_set (7, &wr_fds); fd_set (1, &ex_fds); fd_set (4, &ex_fds); for (;;) { cp_rd_fds = rd_fds; cp_wr_fds = wr_fds; cp_ex_fds = ex_fds; for (width = 7 + 1; (n = select (width, &cp_rd_fds, &cp_wr_fds, &cp_ex_fds, tvp)) == -1 && errno == eintr; ) continue; switch (n) { case -1: /* handle errors */ case 0: /* handle elapsed timeouts (if tvp points to tv...) */ default: /* handle enabled descriptors */ for (int fd = 0; fd < width && n > 0; fd++) { if (fd_isset (fd, &cp_rd_fds) /* n--; ... */; if (fd_isset (fd, &cp_wr_fds) /* n--; ... */; if (fd_isset (fd, &cp_ex_fds) /* n--; ... */; } } }  I/O Multiplexing (cont'd) . This example rewrites the earlier SIGALRM program using @select@ #include #include #define WORD "thunderbird" int main (int argc, char *argv[]) { char buf[16]; struct timeval tv = {15L, 0L}; fd_set rd_fds; FD_ZERO (&rd_fds); FD_SET (0, &rd_fds); printf ("type a word; if you don't in 15 " "seconds I'll use \"%s\": ", WORD); switch (select (1, &rd_fds, 0, 0, &tv)) { case -1: perror (argv[0]), exit (1); case 0: strcpy (buf, WORD); break; default: fgets (buf, sizeof buf, stdin); break; } printf ("\nword: %s\n", buf); exit (0); } The poll System Call . @poll()@ allows application processes to wait a user-specified timeout interval for multiple I/O events to occur on multiple descriptors . `Programming Tip` . Use @poll@ if possible since it provides a much wider variety of events than @select()@ . Note that SVR4 implements @select@ via @poll@! The poll System Call (cont'd) . Interface  int poll ( struct pollfd fds[], /* Descriptors of interest */ unsigned long nfds, /* Number of descriptors to check */ int timeout /* Length of time to wait (in milliseconds) -1 == block indefinitely 0 == check and return immediately > 0 == wait for "timeout" milliseconds on the list of events, whichever comes first */ );  . Parameters  struct pollfd { int fd; /* file descriptor to poll */ short events; /* events of interest on fd */ short revents; /* events that occurred on fd */ };  The poll System Call (cont'd) . @event@ values  #define pollin 0x0001 /* fd is readable */ #define pollpri 0x0002 /* high priority info at fd */ #define pollout 0x0004 /* fd is writeable (won't block) */ #define pollrdnorm 0x0040 /* normal data is readable */ #define pollwrnorm pollout #define pollrdband 0x0080 /* out-of-band data is readable */ #define pollwrband 0x0100 /* out-of-band data is writeable */ #define pollnorm pollrdnorm  . @revent@ values  #define pollerr 0x0008 /* fd has error condition */ #define pollhup 0x0010 /* fd has been hung up on */ #define pollnval 0x0020 /* invalid pollfd entry */  The poll System Call (cont'd) . This example rewrites the earlier SIGALRM program using @poll@ #include #include #define WORD "thunderbird" int main (int argc, char *argv[]) { char buf[16]; int tv = 15 * 1000; /* tv in milliseconds */ struct pollfd p_fd = {0, POLLIN, 0}; printf ("type a word; if you don't in 15 " "seconds I'll use \"%s\": ", WORD); switch (poll (&p_fd, 1, tv)) { case -1: perror (argv[0]), exit (1); case 0: strcpy (buf, WORD); break; default: fgets (buf, sizeof buf, stdin); break; } printf ("\nword: %s\n", buf); exit (0); } Duplicating Descriptors . Certain applications must have more than one descriptor referring to the same open device . For example, when the shell forks and execs a new process as part of a pipeline, eg . % who | wc @ Shell creates a pipe @ Shell forks twice, once for @who@ and once for @wc@ @ Shell arranges to @stdout@ of @who@ to be piped to @stdin@ of wc @ Note, the entire process is transparent to @who@ and @wc@ Duplicating Descriptors . Descriptors for @who@ and @wc@ pipe Duplicating Descriptors (cont'd) . Two system calls duplicate descriptors, @dup@ and @dup2@ . int dup (int fd); . @dup@ duplicates an existing descriptor . The return value is the lowest numbered descriptor not current in use by the program . Important to perform @dup@s in correct order... . int dup2 (int fd1, int fd2); . @dup2@ uses @fd2@ to specify the desired value of the new descriptor . If @fd2@ is already open it is first closed . close (0); dup (fd); is equivalent to dup2 (fd, 0); . Don't need to keep track of order... Converting Descriptors to FILE Pointers . It is often useful to interoperate low-level I/O with the stdio library facilities . `fdopen` is a library routine that converts an existing low-level I/O descriptor into a buffered stdio library FILE *: . ie FILE *fdopen (int fd, char *mode); . @fd@ refers to a open descriptor, @mode@ tells how the descriptor is to be used (eg for reading "r" or writing "w") . returns a valid file pointer on success and a NULL on failure . The macro @fileno@ from stdio.h gets a descriptor from a FILE *: int fd = open ("/etc/passwd", O_RDONLY); FILE *fp = fdopen (fd, "r"); while (read (fileno (fp), buf, sizeof buf) > 0) /* ... */;