Linux posts: January 2015

As discussed in my previous post I would like to continue the Read system call flow from submit_bio in this post.

The submit_bio() function calls the generic_make_request() function to submit the bio to the block device layer for I/O

generic_make_request(): This transfers the bio to the corresponding device driver. The function receives a pointer to the bio structure which is passed to the submit_bio function from the filesystem layer. This function is used to make the IO request for the block device. Its return type is void because the result of successful completion of IO is returned through the bi_end_io function pointer whose address is stored in the bio structure.

generic_make_request() does the following :

Perform some checks on this bio
Add the bio to the list of bios maintained by current if the queue is not empty(this is a bit confusing..why will there be a queue of bios?? shouldnt the the bio be submitted to the scsi layer before current selects the next bio??)
If this is the first bio for the current process then create a list of bios.
Do the following for all the bios in the list:
a) Get the request queue of the block device.
b) Call the make_request_fn for the request queue for this bio. the make_request_fn is a function pointer which in turn calls calls the function blk_queue_bio(). The value to the request_queue structure is initialized in the function blk_init_allocated_queue(). Need to look in the initialization step for this function
c) Remove that request from the queue.

blk_queue_bio(request_queue, bio) -- The make request function

Calls the blk_queue_bounce(). need to investigate what this does.
Check for bio_integrity. If it is enabled, call the bio_integrity_prep() function. need to investigate this further.
In case this is a flush request(REQ_FLUSH flag is set) or REQ_FUA flag is set, the insertion selection queue is ELEVATOR_INSERT_FLUSH.
If the disable_merge flag is set on this bio, call the blk_attempt_plug_merge() function. the blk_attempt_plug_merge function will needed stuffs for block device plugging. Read the doc (http://lwn.net/Articles/438256/) for further information.
Call the elv_merge() function. This function will decide where to merge this bio with the request list maintained to be dispatched.
a) first check if the merging is needed or not. In case its not needed (QUEUE_FLAG_NOMERGES flag is set on the request_queue) return.
b) try the next level of merging. perform some checks (performed in the function elv_rq_merge_ok()) to verify if the bio can be merged with the request. in case the tests pass, call the function blk_try_merge(). This will decide where to merge this bio - in the front, back or cannot be merged at all. The decision is taken on the current entries in the request and the current bio's sector.
c) if the bio cannot be merged with the request, the return with the address of the request structure object pointer assigned to request_queue (i guess this will be the case for the first bio)
d) if the bio can be merged, call the elevator_merge_fn. The function can be cfq_merge() or deadline_merge() based on the io scheduler used. The function will return whether to merge the bio at the end or at the front of the request.
Based on the value returned from the elv_merge function, decide where to merge the bio.
a) in case the bio needs to be merged at the back, call the bio_attempt_back_merge() function. this first calls the ll_back_merge_fn() which first checks if the request goes out of bound or not. (need to check more on this function). After this, add the bio to the tail of request list of bios and update the data_len and priority of the request. on successful merge of the bio to the request, call the elv_bio_merged function which will in turn call the io scheduler specific elevator_bio_merged_fn function. this is only defined for cfq and the function does the statistics updation.
b)When the bio merge operation has been performed, call the attempt_back_merge() function. This function will check if a request is present in the rb tree (with the help of the elv_latter_request() function). In case a request exists, call the attemp_merge function. This will try to merge the request already existing with the request being worked on. To do this, some checks are performed and the bios present in requestobtained from the tree is merged with this request.
After all this, the job is finished. release the lock and leave the blk_queue_bio function.
If the request could not be merged, get a free request from the queue for the bio using the get_request function.
a) The function first gets the request list for the queue. In case the list is not there in the request_queue, create one using the function blkg_lookup_create() (the queue is based on block cgroups if that is enabled.).
Call the init_request_from_bio. This function adds the bio to the request after the request was tagged as non merge-able in the above steps. This includes setting the various fields of request structure and set the bio and biotail fields of request to the current bio.
Now the request is added to the request_queue using add_acct_request() function. The code like list_add(&rq->queuelist, &q->queue_head); adds the request (rq) to the request_queue(q)
if the plug list is maintained for the current task, take the queue lock, and call the __blk_run_queue() function. This function calls the request_fn pointer. For scsi, its the scsi_request_fn.

Will discuss the scsi_request_fn and the HBA interactions in the next blog. Stay tuned !!!

This post traces the functions called by the read system call. The read system call reads the file synchronously. The EXT3 filesystem has been used as an example. The error condition handling has not been traced in this flow.

The user can perform a read operation via the read system call from the user space. The system call generates a trap and the control switches from user space to the kernel space. In the kernel space, the read call is handled in the read SYSCALL_DEFINE. The parameters passed to the function are fd (which contains the file descriptor of the file, flags and the position to read from), userspace buffer to copy the data read and the size of data to read.

fs/read_write.c :

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos = file_pos_read(f.file);
ret = vfs_read(f.file, buf, count, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
}
return ret;

}

The function performs:

Calculates the file descriptor of the file being read, the current position to read from the file (the f_pos feild of the file object will hold the position to read from the file).
Call the vfs_read function, passing the file object pointer, userspace buffer, size of data to read and the position to read from the file.

vfs_read() : This function gets the file object pointer, userspace buffer, the size of data to read and the position from the file to read as its input parameters and returns the number of bytes read as its return value. It does the following stuff:

Verify that the file has been opened in the read mode and the user space pointer is valid or not.
Calls the function rw_verify_area(). This function essentially checks if the position to read from the file is valid, if any locks are set on the portion of file to read and if the data to read is within specified limit.
In case the read file operation in the file object structure is set for the filesystem, call that function. In case of ext3, this is performed by the function new_sync_read(). The way this function works is detailed later.
Else in case the aio_read file operation is set then do_sync_read() function is called.This function initializes kernel io control block structure kiocb object(the file object pointer and the current task_struct are updated). The position to read from the file and the number of bytes to read are also updated in the kiocb structure object and then the file system's asynchronous read function is called. In case of EXT3, till the 3.15 kernel, this was handled by the generic_file_aio_read(). If the io gets queued, the current process is scheduled till the io operation gets completed.
Else the new_sync_read() function is called by default. The way this function works is detailed later.

If the read operation was successful, the fsnotify_access() function is called. In case the file was not opened with the no-notify flag, the parent is notified about the file being accessed. and the control returns from the vfs_read function.

EXT3 file_operations structure (fs/ext3/file.c):

const struct file_operations ext3_file_operations = {

.llseek = generic_file_llseek,

.read = do_sync_read,

.write = do_sync_write,

.aio_read = generic_file_aio_read,

.aio_write = generic_file_aio_write,

.unlocked_ioctl = ext3_ioctl,

#ifdef CONFIG_COMPAT

.compat_ioctl = ext3_compat_ioctl,

#endif

.mmap = generic_file_mmap,

.open = dquot_file_open,

.release = ext3_release_file,

.fsync = ext3_sync_file,

.splice_read = generic_file_splice_read,

.splice_write = generic_file_splice_write,

};

ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)

{

struct iovec iov = { .iov_base = buf, .iov_len = len };

struct kiocb kiocb;

ssize_t ret;

init_sync_kiocb(&kiocb, filp);

kiocb.ki_pos = *ppos;

kiocb.ki_nbytes = len;

ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); ---- generic_file_aio_read

if (-EIOCBQUEUED == ret)

ret = wait_on_sync_kiocb(&kiocb);

*ppos = kiocb.ki_pos;

return ret;

}

new_sync_read() : The function gets the file object pointer, userspace buffer, the size of data to read and the position from the file to read as its input parameters and returns the number of bytes read as its return value. It does the following stuff:

The iovec structure object is initialized. This structure stores the userspace buffer to read from and the length of data to read.
It initializes kernel io control block structure kiocb object and updates the position to read from as well as the length of data to read in the kiocb object.
The iov_iter structure object is initialized. This structure stores the type of io operation(read/write), the iovec structure object pointer initialized in step 1 above and the length of data to read.
The read_iter file operation function for the file is called passing the kiocb and iov_iter function object pointers. In case of EXT3, this operation is performed by the function generic_file_read_iter(). The function does the following operations:

In case the file is opened in the O_DIRECT mode (which means that the io will be performed directly without using the page cache), the following steps are performed:

call the filemap_write_and_wait_range() function. This function first checks if there are pages allocated for the inode in the page cache. In case the inode has some pages allocated from the page cache (for eg. if some other process has opened the same file without O_DIRECT flag), create a writeback_control structure object with sync_mode set as WB_SYNC_ALL, the start field as the start position to read and the end position as (start + data to read) and call the writepages operation function for the address space mapping specific to this inode (the address space mapping is a structure for each inode and contains the page cache information for the inode along with an operation table). The function generic_writepages operation traverses the dirty pages for the given inode's address space and writes them to the file system device.

Once all the pages of the inode have been written to the filesystem device, call the direct_IO address space operation function for the inode. In case of EXT3, the ext3_direct_IO function will be called. The same function is used for both read and write operations. This function is explained later.

In case the io is not direct, call the do_generic_file_read() function. This function is explained later.

do_generic_file_read : This uses the page cache to perform the read operation. The task performed in this function is:

Get the page index corresponding to required offset (start offset for the first time) and try to find the page corresponding to that offset from the radix tree of the inode. If the page is found in the page cache and readahead flag is set, a readahead algorithm gets the few more pages of the file. If the page is uptodate (the contents of the page are valid) the contents of the page are copied to the user space buffer. and this process keeps repeating for all the required pages until all the data required has been copied to the user space buffer.

If the page is found but is not uptodate, the contents of the data need to be updated from the data present on the device. To do this, call the readpage address space operation function from the inode mapping, passing to it the file pointer and the page to which the data needs to be copied to. In case of ext3, this is handled by the ext3_readpage() function which in turn calls the do_mpage_readpage() function. The do_mpage_readpage function essentially creates a bio structure to get the data from the disk blocks and submits them for io. This function is explained in detail later. When the bio is submitted, the control returns to the do_generic_file_read() function and then the lock_page_killable() function is called which waits on the uptodate bit of the page. Once the io is completed from the device, the mpage_end_io() is called as the return handler( This function is executed in interrupt context). In case of a read, this marks the pages for which the bio was sent as Uptodate and the unlock_page() function is called. This function wakes up the control which was waiting for the page to get uptodated. The contents of the page are then copied to user space buffer and the process to read data from the remaining pages continues from step 1.

If the page is not found in the radix tree, its a cache miss. In that case, a new page is allocated from the kernel page pool and is added to the least recently used list and to the radix tree of the inode. The readpage address space operation function is called next to read the page content as explained in the 1.a) above.

Once all the required data has been read from the page cache, the control returns from the function.

fs/ext3/inode.c
static const struct address_space_operations ext3_ordered_aops = {
.readpage = ext3_readpage,
.readpages = ext3_readpages,
.writepage = ext3_ordered_writepage,
.write_begin = ext3_write_begin,
.write_end = ext3_ordered_write_end,
.bmap = ext3_bmap,
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.is_dirty_writeback = buffer_check_dirty_writeback,
.error_remove_page = generic_error_remove_page,

};

static int ext3_readpage(struct file *file, struct page *page)
{
trace_ext3_readpage(page);
return mpage_readpage(page, ext3_get_block);

}

int mpage_readpage(struct page *page, get_block_t get_block)
{
struct bio *bio = NULL;
sector_t last_block_in_bio = 0;
struct buffer_head map_bh;
unsigned long first_logical_block = 0;

map_bh.b_state = 0;
map_bh.b_size = 0;
bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
&map_bh, &first_logical_block, get_block);
if (bio)
mpage_bio_submit(READ, bio);
return 0;
}

EXPORT_SYMBOL(mpage_readpage);

do_mpage_readpage : This function does all the work of mapping the disk blocks and constructs largest possible bio.

ext3_direct_IO : This function performs the direct IO for the ext3 file system. The parameters passed to this function are the flag indicating if its a read or write IO, the kernel io control block structure object pointer, iov_iter structure object pointer and the offset of the file from which the io operation needs to be performed. This function is used for both read and write operation but currently the function trace with respect to the read operation will be explained. To perform the read operation, the function in turn calls the do_blockdev_direct_IO() function which is a generic function used by filesystems to performs direct IO. The do_blockdev_direct_IO does the following stuff:

allocate memory for dio structure object pointer. The dio structure contains information if the io is read or write, inode to perform io from, size of io, return function pointer, bio list and kiocb structure pointer.
In case of read, it calls the filemap_write_and_wait_range(). This function starts the writeback process for all the pages in range of the read operation.
In case of aynchronous io, it creates a work queue for deferred direct io and returns.
Else, increase the direct io count for the inode.
Update the dio_submit structure object. set the value of blkbits with the file system block size, the first block and last block to read the file from, size of the inode, function pointer to get block information etc.
Initialize the block plug. For more information about block device plugging refer to http://lwn.net/Articles/438256/
Call the do_direct_IO function. For each blocks of the file system required to be read, the function

Calls the dio_get_page(). The dio_get_page gets userspace pages (64 at a time) in the memory.
map the required file system blocks. This is performed by the get_more_block function. This calculates the first and the last block required and calls the ext3_get_block function to get the disk mapping information of the required blocks. update the blocks_available field of the sdio structure object with the number of blocks that have been mapped.
call the submit_page_section() function. This will put the current page for IO. A new bio is allocated and the current user space page in which the data needs to be read into is added to the bio and submit_bio is called to perform the io operation. On the io completion, the dio_bio_end_io() handler function will be called. This function will add the completed bio to the list of bios and wake up the process waiting for the bio to get completed when all the bios have been received.
The steps are repeated for all the required data to be read.

In case of asynchronous io, the function returns with the EIOCBQUEUED return value which indicates that the io has been queued. Else in case of synchronous io, call the dio_await_completion() function. This function will wait for the io to be completed. It calls the dio_await_one() function which will schedule out the current process and wait till the io has been completed. When the io has been completed and the process wakes up from dio_bio_end_io() function, it will have the list of bios.
Return from the function.

Note about bio :

struct bio in Linux contains table bi_io_vec, where each element contains pointer to a page bv_page, length of data bv_len and offset inside page bv_offset. Field bi_vcnt shows how many structures of that type is in vector while current index is kept in bi_idx.

bi_io_vec represents a vectored IO concept where scattered IO is submitted using scatter gather IO using DMA.

Linux posts

Articles in this blog

Tuesday, 27 January 2015

Read system call trace (contd.)

Wednesday, 14 January 2015

Read system call trace (specific to linux kernel 3.16.3)