Wednesday, 14 January 2015

Read system call trace (specific to linux kernel 3.16.3)

This post traces the functions called by the read system call. The read system call reads the file synchronously. The EXT3 filesystem has been used as an example. The error condition handling has not been traced in this flow.

The user can perform a read operation via the read system call from the user space. The system call generates a trap and the control switches from user space to the kernel space. In the kernel space, the read call is handled in the read SYSCALL_DEFINE. The parameters passed to the function are fd (which contains the file descriptor of the file, flags and the position to read from), userspace buffer to copy the data read and the size of data to read.


SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos = file_pos_read(f.file);
ret = vfs_read(f.file, buf, count, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
return ret;


The function performs:
  1. Calculates the file descriptor of the file being read, the current position to read from the file (the f_pos feild of the file object will hold the position to read from the file).
  2. Call the vfs_read function, passing the file object pointer, userspace buffer, size of data to read and the position to read from the file.
vfs_read() : This function gets the file object pointer, userspace buffer, the size of data to read and the position from the file to read as its input parameters and returns the number of bytes read as its return value. It does the following stuff:
  1. Verify that the file has been opened in the read mode and the user space pointer is valid or not.
  2. Calls the function rw_verify_area(). This function essentially checks if the position to read from the file is valid, if any locks are set on the portion of file to read and if the data to read is within specified limit.
  3. In case the read file operation in the file object structure is set for the filesystem, call that function. In case of ext3, this is performed by the function new_sync_read(). The way this function works is detailed later.
  4. Else in case the aio_read file operation is set then do_sync_read() function is called.This function initializes kernel io control block structure kiocb object(the file object pointer and the current task_struct are updated). The position to read from the file and the number of bytes to read are also updated in the kiocb structure object and then the file system's asynchronous read function is called. In case of EXT3, till the 3.15 kernel, this was handled by the generic_file_aio_read(). If the io gets queued, the current process is scheduled till the io operation gets completed.
  5. Else the new_sync_read() function is called by default. The way this function works is detailed later.
If the read operation was successful, the fsnotify_access() function is called. In case the file was not opened with the no-notify flag, the parent is notified about the file being accessed. and the control returns from the vfs_read function.

EXT3 file_operations structure (fs/ext3/file.c): 
const struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
.aio_read = generic_file_aio_read,
.aio_write = generic_file_aio_write,
.unlocked_ioctl = ext3_ioctl,
.compat_ioctl = ext3_compat_ioctl,
.mmap = generic_file_mmap,
.open = dquot_file_open,
.release = ext3_release_file,
.fsync = ext3_sync_file,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,


ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
struct iovec iov = { .iov_base = buf, .iov_len = len };
struct kiocb kiocb;
ssize_t ret;

init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
kiocb.ki_nbytes = len;

ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);   ---- generic_file_aio_read
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
return ret;

new_sync_read() : The function gets the file object pointer, userspace buffer, the size of data to read and the position from the file to read as its input parameters and returns the number of bytes read as its return value. It does the following stuff:
  1. The iovec structure object is initialized. This structure stores the userspace buffer to read from and the length of data to read.
  2. It initializes kernel io control block structure kiocb object and updates the position to read from as well as the length of data to read in the kiocb object.
  3. The iov_iter structure object is initialized. This structure stores the type of io operation(read/write), the iovec structure object pointer initialized in step 1 above and the length of data to read.
  4. The read_iter file operation function for the file is called passing the kiocb and iov_iter function object pointers. In case of EXT3, this operation is performed by the function generic_file_read_iter(). The function does the following operations:
    • In case the file is opened in the O_DIRECT mode (which means that the io will be performed directly without using the page cache), the following steps are performed:
      • call the filemap_write_and_wait_range() function. This function first checks if there are pages allocated for the inode in the page cache. In case the inode has some pages allocated from the page cache (for eg. if some other process has opened the same file without O_DIRECT flag), create a writeback_control structure object with sync_mode set as WB_SYNC_ALL, the start field as the start position to read and the end position as (start + data to read) and call the writepages operation function for the address space mapping specific to this inode (the address space mapping is a structure for each inode and contains the page cache information for the inode along with an operation table). The function generic_writepages operation traverses the dirty pages for the given inode's address space and writes them to the file system device.
      • Once all the pages of the inode have been written to the filesystem device, call the direct_IO address space operation function for the inode. In case of EXT3, the ext3_direct_IO function will be called. The same function is used for both read and write operations. This function is explained later.
  • In case the io is not direct, call the do_generic_file_read() function. This function is explained later.
do_generic_file_read : This uses the page cache to perform the read operation. The task performed in this function is:
  1. Get the page index corresponding to required offset (start offset for the first time) and try to find the page corresponding to that offset from the radix tree of the inode. If the page is found in the page cache and readahead flag is set, a readahead algorithm gets the few more pages of the file. If the page is uptodate (the contents of the page are valid) the contents of the page are copied to the user space buffer. and this process keeps repeating for all the required pages until all the data required has been copied to the user space buffer.
    • If the page is found but is not uptodate, the contents of the data need to be updated from the data present on the device. To do this, call the readpage address space operation function from the inode mapping, passing to it the file pointer and the page to which the data needs to be copied to. In case of ext3, this is handled by the ext3_readpage() function which in turn calls the do_mpage_readpage() function. The do_mpage_readpage function essentially creates a bio structure to get the data from the disk blocks and submits them for io. This function is explained in detail later. When the bio is submitted, the control returns to the do_generic_file_read() function and then the lock_page_killable() function is called which waits on the uptodate bit of the page. Once the io is completed from the device, the mpage_end_io() is called as the return handler( This function is executed in interrupt context). In case of a read, this marks the pages for which the bio was sent as Uptodate and the unlock_page() function is called. This function wakes up the control which was waiting for the page to get uptodated. The contents of the page are then copied to user space buffer and the process to read data from the remaining pages continues from step 1.
  2. If the page is not found in the radix tree, its a cache miss. In that case, a new page is allocated from the kernel page pool and is added to the least recently used list and to the radix tree of the inode. The readpage address space operation function is called next to read the page content as explained in the 1.a) above.
Once all the required data has been read from the page cache, the control returns from the function.

static const struct address_space_operations ext3_ordered_aops = {
.readpage = ext3_readpage,
.readpages = ext3_readpages,
.writepage = ext3_ordered_writepage,
.write_begin = ext3_write_begin,
.write_end = ext3_ordered_write_end,
.bmap = ext3_bmap,
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate  = block_is_partially_uptodate,
.is_dirty_writeback = buffer_check_dirty_writeback,
.error_remove_page = generic_error_remove_page,


static int ext3_readpage(struct file *file, struct page *page)
return mpage_readpage(page, ext3_get_block);


int mpage_readpage(struct page *page, get_block_t get_block)
struct bio *bio = NULL;
sector_t last_block_in_bio = 0;
struct buffer_head map_bh;
unsigned long first_logical_block = 0;

map_bh.b_state = 0;
map_bh.b_size = 0;
bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
&map_bh, &first_logical_block, get_block);
if (bio)
mpage_bio_submit(READ, bio);
return 0;


do_mpage_readpage : This function does all the work of mapping the disk blocks and constructs largest possible bio.

ext3_direct_IO : This function performs the direct IO for the ext3 file system. The parameters passed to this function are the flag indicating if its a read or write IO, the kernel io control block structure object pointer, iov_iter structure object pointer and the offset of the file from which the io operation needs to be performed. This function is used for both read and write operation but currently the function trace with respect to the read operation will be explained. To perform the read operation, the function in turn calls the do_blockdev_direct_IO() function which is a generic function used by filesystems to performs direct IO. The do_blockdev_direct_IO does the following stuff:
  1. allocate memory for dio structure object pointer. The dio structure contains information if the io is read or write, inode to perform io from, size of io, return function pointer, bio list and kiocb structure pointer.
  2. In case of read, it calls the filemap_write_and_wait_range(). This function starts the writeback process for all the pages in range of the read operation.
  3. In case of aynchronous io, it creates a work queue for deferred direct io and returns.
  4. Else, increase the direct io count for the inode.
  5. Update the dio_submit structure object. set the value of blkbits with the file system block size, the first block and last block to read the file from, size of the inode, function pointer to get block information etc.
  6. Initialize the block plug. For more information about block device plugging refer to
  7. Call the do_direct_IO function. For each blocks of the file system required to be read, the function
    • Calls the dio_get_page(). The dio_get_page gets userspace pages (64 at a time) in the memory.
    • map the required file system blocks. This is performed by the get_more_block function. This calculates the first and the last block required and calls the ext3_get_block function to get the disk mapping information of the required blocks. update the blocks_available field of the sdio structure object with the number of blocks that have been mapped.
    • call the submit_page_section() function. This will put the current page for IO. A new bio is allocated and the current user space page in which the data needs to be read into is added to the bio and submit_bio is called to perform the io operation. On the io completion, the dio_bio_end_io() handler function will be called. This function will add the completed bio to the list of bios and wake up the process waiting for the bio to get completed when all the bios have been received.
    • The steps are repeated for all the required data to be read.
  8. In case of asynchronous io, the function returns with the EIOCBQUEUED return value which indicates that the io has been queued. Else in case of synchronous io, call the dio_await_completion() function. This function will wait for the io to be completed. It calls the dio_await_one() function which will schedule out the current process and wait till the io has been completed. When the io has been completed and the process wakes up from dio_bio_end_io() function, it will have the list of bios.
  9. Return from the function.
Note about bio : 
struct bio in Linux contains table bi_io_vec, where each element contains pointer to a page bv_page, length of data bv_len and offset inside page bv_offset. Field bi_vcnt shows how many structures of that type is in vector while current index is kept in bi_idx.
bi_io_vec represents a vectored IO concept where scattered IO is submitted using scatter gather IO using DMA. 

No comments:

Post a Comment