Articles in this blog

Wednesday, 29 August 2018

Linux system call implementation x86_64


Userspace call of the systemcall:

x86_64 user programs invoke a system call by putting the system call number (0 for read) into the RAX register, and the other parameters into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the x86_64 syscall instruction.
http://man7.org/linux/man-pages/man2/syscall.2.html
x86-64      syscall               rax        rax     -        [5]

This instruction causes the processor to transition to ring 0 and invoke the function referenced by the MSR_LSTAR model-specific register.

The MSR_LSTAR model-specific register is initialized at the kernel bootup:


/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
extern char _entry_trampoline[];
extern char entry_SYSCALL_64_trampoline[];

int cpu = smp_processor_id();
unsigned long SYSCALL64_entry_trampoline =
(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
(entry_SYSCALL_64_trampoline - _entry_trampoline);

wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
if (static_cpu_has(X86_FEATURE_PTI))
wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
else
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

#ifdef CONFIG_IA32_EMULATION
wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
/*
* This only works on Intel CPUs.
* On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
* This does not cause SYSENTER to jump to the wrong location, because
* AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
*/
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_entry_stack(cpu) + 1));
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
#else
wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
#endif

/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
       X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
       X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
}

The system_call is the function which gets called for each system call.
entry_64.S file has the definition of the system_call function.
The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in the sys_call_table table.

ENTRY(system_call)
...
movq %r10,%rcx /* fixup for C */
call *sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)
...
END(system_call)


Apart from the change via the SYSCALL, we also change the privilege : 

Privilege levels (CPL DPL RPL):

X86 systems has a feature of privilege levels. This restricts the memory access, IO ports access and ability to execute certain machine instructions. For kernel this privilege level is 0 and for user space programs it is 3. A code executing cannot change its privilege level itself. Change in privilege level can be done using lcall, int, lret and iret instructions. The raise of privilege level is done by lcall and iret instructions and lowering by lret and iret instructions. This explains the use of int 0x80 done while executing any system call. It is this int instruction which elevates the privilege level from user(0) to kernel(3) space.

The change in LDTR, GDTR and IDTR is also done. The RPL DPL are defined from these only. 

Tuesday, 14 August 2018

Linux internals segmentation fault generation


Linux process has many sections. Vaguely speaking these sections can be data, text, stack heap etc.
When a Linux process is created these sections are formed and virtual memory is allocated to these.

Presence of these sections in Linux kernel.

Each of these sections are represented by vm_area_struct in Linux.

The Linux task_struct has memory struct in mm_struct. The mm_struct has vm_area_struct for all these sections.

For all the sections the virtual address will be accessed.

https://en.wikipedia.org/wiki/Memory_management_unit MMU diagram tells the translation of virtual memory to physical memory by CPU/MMU.

The MMU part of CPU checks if the virtual to physical translation is present in the TLB cache.
If no translation exist in TLB it raises an page fault exception.

As a result of the Page fault exception the page fault handler is called.

In Linux the page fault handler is do_page_fault function. In the do_page_fault function faulting address is taken from the cr2 register: read_cr2();

__do_page_fault is called. This function checks if there is vm_area_struct present for the faulting address.

If the virtual area is present in a VMA then good_area handling is invoked, else it goes bad_area function:

bad_area function : 
Eventually a SEGV is generated in __bad_area_nosemaphore function.
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_PF;

force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);


good_area handling :
If we see that the faulting address is correct then handle_mm_fault function is called to allocate a new page frame, similar to the demand paging explained in the Memory Initialization topic.

Sunday, 5 August 2018

Linux Kernel NVMe driver


In this blog we will go through Linux NVMe kernel driver.

The NVMe kernel driver has a table of nvme_id_table. As like the pci devices, this table has Vendor and device ID this driver would support.
When this driver is inserted the nvme_init function will register this id_table to the PCI.

On the insertion of this driver the probe function of this device is called.

nvme_probe is the driver probe function.

nvme_dev is allocated with the queues. Though NVMe supports the 64k queues pnly queues are created equal to number of CPUs existing in the system.

nvme_dev_map does the ioremap of the registers + 4096 :

enum {
NVME_REG_CAP = 0x0000, /* Controller Capabilities */
NVME_REG_VS = 0x0008, /* Version */
NVME_REG_INTMS = 0x000c, /* Interrupt Mask Set */
NVME_REG_INTMC = 0x0010, /* Interrupt Mask Clear */
NVME_REG_CC = 0x0014, /* Controller Configuration */
NVME_REG_CSTS = 0x001c, /* Controller Status */
NVME_REG_NSSR = 0x0020, /* NVM Subsystem Reset */
NVME_REG_AQA = 0x0024, /* Admin Queue Attributes */
NVME_REG_ASQ = 0x0028, /* Admin SQ Base Address */
NVME_REG_ACQ = 0x0030, /* Admin CQ Base Address */
NVME_REG_CMBLOC = 0x0038, /* Controller Memory Buffer Location */
NVME_REG_CMBSZ = 0x003c, /* Controller Memory Buffer Size */
NVME_REG_DBS = 0x1000, /* SQ 0 Tail Doorbell */
};


We can see the three queues admin, submission and completion queues of the NVME device.

Now the NVMe driver creates the DMA pool of Physical Region Pages. The prp_page_pool and prp_small_pool are the pools created for the IOs. These are done in nvme_setup_prp_pools function. From LDD3 "A DMApool is an allocation mechanism for small, coherent DMA mappings. Mappings obtained from dma_alloc_coherent may have a minimum size of one page. If your device needs smaller DMA areas than that, you should probably be using a DMA pool."
Function : nvme_setup_prp_pools
creates pools using dma_pool_create
The pools are used to allocate memory dma_pool_alloc in nvme_setup_prps function.
These DMA pools are used while doing IOs in nvme_setup_prps function.

Next the NVME driver moves to function nvme_init_ctrl to initialize the controller structures.
3 work queues are initiated :
INIT_WORK(&ctrl->scan_work, nvme_scan_work);
INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);


device_create_with_groups creates the sysfs entries.

At the last nvme_probe calls reset_work.

====================================================
nvme_reset_work function : 
====================================================
Calls nvme_pci_enable
calls pci_enable_device_mem --> enables device memory
pci_set_master --> makes the pci device as master
Reads the pci device capability(NVME_REG_CAP) and determines the io_queue_depth
Checks if CMB is enabled by reading the NVME_REG_VS and ioremap the CMB

Note about CMB :
The Controller Memory Buffer (CMB) is a region of general purpose read/write memory on the controller that may be used for a variety of purposes. The controller indicates which purposes the memory may be used for by setting support flags in the CMBSZ register.
Submission Queues in host memory require the controller to perform a PCI Express read from host memory in order to fetch the queue entries. Submission Queues in controller memory enable host software to directly write the entire Submission Queue Entry to the controller's internal memory space, avoiding one read from the controller to the host.

Calls nvme_pci_configure_admin_queue
calls nvme_alloc_queue--> does the allocation for the admin queue. allocates completion queue and the submission queue using dma_zalloc_coherent etc.
calls nvme_init_queue to initialize submission queue tail, completion queue head. Also initializes the doorbell regs.
Configures the completion queue interrupt using queue_request_irq

Calls nvme_alloc_admin_tags
Allocates the block mq tag set. The tag sets are used to force completion on the same CPU where the request was generated.
also initializes the admin_q using blk_mq_init_queue.
dev->ctrl.admin_q = blk_mq_init_queue
So a request_queue is initialized and blk_mq_make_request etc. is allocated to same.

Calls nvme_init_identify
calls nvme_identify_ctrl--> sends nvme_admin_identify = 0x06 command to the controller -->>
command is submitted to the admin queue using __nvme_submit_sync_cmd
this function allocates a request and calls blk_execute_rq to wait for the command to return.
Based on the returned data nvme_init_subnqn initializes the NQN.

Calls nvme_setup_io_queues
Sets the number of queue as number of CPUS present
calls nvme_create_io_queues to create the queues
nvme_alloc_queue allocates CQ and SQ for each CPU(Same as done for the admin queue)
nvme_create_queue --> creates CQ, SQ and registers the IRQ for each queue "queue_request_irq"

Calls nvme_start_ctrl
Starts the nvme_queue_scan -- > launches scan_work --> Calls nvme_scan_work
   nvme_scan_work will first call nvme_identify_ctrl :
    sends : 
c.identify.opcode = nvme_admin_identify;
c.identify.cns = NVME_ID_CNS_CTRL;
   based on the number received from identify a validate or sequential scan is triggered
calls nvme_scan_ns_sequential
calls nvme_validate_ns
calls nvme_alloc_ns --> does blk_mq_init_queue --> sends identify command
forms a disk name and creates a disk using device_add_disk --- >> this will create /dev/ entries
Disk name is formed like :
sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance)
it comes like nvme0n1 etc.
nvme_identify_ns sends following command
c.identify.opcode = nvme_admin_identify;
c.identify.nsid = cpu_to_le32(nsid);
c.identify.cns = NVME_ID_CNS_NS;
nvme_report_ns_ids fills the ns_id, eui, nguid and uuid



Lets see how the IO path looks after these are done by the Linux driver :

A probe at nvme_setup_cmd function  :

[105617.151466]  ? sched_clock_cpu+0x11/0xb0
[105617.155491]  ? __lock_acquire.isra.34+0x259/0xa90
[105617.160307]  blk_mq_try_issue_directly+0xbb/0x110
[105617.165118]  ? blk_mq_make_request+0x382/0x700
[105617.169675]  blk_mq_make_request+0x3b0/0x700
[105617.174045]  ? blk_mq_make_request+0x382/0x700
[105617.178593]  ? blk_queue_enter+0x6a/0x1f0
[105617.182714]  generic_make_request+0x11e/0x2f0
[105617.187176]  ? __lock_acquire.isra.34+0x259/0xa90
[105617.191988]  submit_bio+0x73/0x150
[105617.195494]  ? sched_clock+0x9/0x10
[105617.199083]  ? submit_bio+0x73/0x150
[105617.202757]  ? sched_clock_cpu+0x11/0xb0
[105617.206783]  do_mpage_readpage+0x489/0x7d0
[105617.210983]  ? I_BDEV+0x20/0x20
[105617.214235]  mpage_readpages+0x127/0x1b0
[105617.218260]  ? I_BDEV+0x20/0x20
[105617.221503]  ? I_BDEV+0x20/0x20
[105617.224750]  ? sched_clock_cpu+0x11/0xb0
[105617.228782]  blkdev_readpages+0x1d/0x20
[105617.232718]  __do_page_cache_readahead+0x223/0x2f0
[105617.237625]  ? find_get_entry+0xaf/0x120
[105617.241656]  force_page_cache_readahead+0x8e/0x100
[105617.246548]  ? force_page_cache_readahead+0x8e/0x100
[105617.251622]  page_cache_sync_readahead+0x42/0x50
[105617.256345]  generic_file_read_iter+0x646/0x800
[105617.260988]  ? sched_clock+0x9/0x10
[105617.264577]  ? sched_clock_cpu+0x11/0xb0
[105617.268605]  blkdev_read_iter+0x37/0x40
[105617.272542]  __vfs_read+0xe2/0x140
[105617.276054]  vfs_read+0x96/0x140
[105617.279393]  SyS_read+0x58/0xc0
[105617.282644]  do_syscall_64+0x5a/0x190
[105617.286412]  entry_SYSCALL64_slow_path+0x25/0x25



ftrace looks like :
 0)               |      submit_bio() {
 0)               |        generic_make_request() {
 0)               |          generic_make_request_checks() {
 0)               |            _cond_resched() {
 0)   0.306 us    |              rcu_note_context_switch();
 0)   0.116 us    |              _raw_spin_lock();
 0)   0.130 us    |              update_rq_clock();
 0)               |              pick_next_task_fair() {
 ...
 ...
 ...
 0) * 72635.26 us |          }
 0) * 72636.55 us |        }
 0) * 72637.80 us |      }
 0)               |      blk_mq_make_request() {
 0)   0.343 us    |        blk_queue_bounce();
 0)   9.507 us    |        blk_queue_split();
 0)   0.637 us    |        bio_integrity_prep();
 0)               |        blk_attempt_plug_merge() {
 0)               |          blk_rq_merge_ok() {
 0)   0.316 us    |            blk_integrity_merge_bio();
 0)   1.817 us    |          }
 0)   0.153 us    |          blk_try_merge();
 0)               |          bio_attempt_back_merge() {
 0)   3.396 us    |            ll_back_merge_fn();
 0)   4.443 us    |          }
 0)   9.640 us    |        }
 0)               |        __blk_mq_sched_bio_merge() {
 0)   0.110 us    |          _raw_spin_lock();
 0)   1.883 us    |        }
 0)   1.193 us    |        wbt_wait();
 0)               |        blk_mq_get_request() {
 0)               |          blk_mq_get_tag() {
 0)   0.837 us    |            __blk_mq_get_tag();
 0)   2.130 us    |          }
 0)   0.130 us    |          __blk_mq_tag_busy();
 0)   5.217 us    |        }
 0)               |        blk_init_request_from_bio() {
 0)   0.346 us    |          blk_rq_bio_prep();
 0)   1.843 us    |        }
 0)               |        blk_account_io_start() {
 0)   0.640 us    |          disk_map_sector_rcu();
 0)               |          part_round_stats() {
 0)   0.160 us    |            part_round_stats_single();
 0)   1.060 us    |          }
 0)   3.720 us    |        }
 0)               |        blk_flush_plug_list() {
 0)               |          blk_mq_flush_plug_list() {
 0)   0.133 us    |            plug_ctx_cmp();
 0)               |            blk_mq_sched_insert_requests() {
 0)               |              blk_mq_insert_requests() {
 0)   0.113 us    |                _raw_spin_lock();
 0)   0.370 us    |                blk_mq_hctx_mark_pending.isra.29();
 0)   2.426 us    |              }
 0)               |              blk_mq_run_hw_queue() {
 0)               |                __blk_mq_delay_run_hw_queue() {
 0)               |                  __blk_mq_run_hw_queue() {
 0) ! 175.500 us  |                    blk_mq_sched_dispatch_requests();
 0) ! 177.577 us  |                  }
 0) ! 179.030 us  |                }
 0) ! 180.086 us  |              }
 0) ! 184.673 us  |            }
 0)               |            blk_mq_sched_insert_requests() {
 0)               |              blk_mq_insert_requests() {
 0)   0.113 us    |                _raw_spin_lock();
 0)   0.346 us    |                blk_mq_hctx_mark_pending.isra.29();
 0)   2.257 us    |              }
 0)               |              blk_mq_run_hw_queue() {
 0)               |                __blk_mq_delay_run_hw_queue() {
 0)               |                  __blk_mq_run_hw_queue() {
 0) ! 585.063 us  |                    blk_mq_sched_dispatch_requests();
 0) ! 587.033 us  |                  }
 0) ! 588.216 us  |                }
 0) ! 589.040 us  |              }
 0) ! 593.647 us  |            }
 0) ! 783.930 us  |          }
 0) ! 785.350 us  |        }
 0) ! 829.913 us  |      }
 0) @ 100015.7 us |    }

This is the flow of the events after blk_mq_make_request:
Calls blk_mq_try_issue_directly to issue the IO, it calls nvme_queue_rq to submit the request
nvme_queue_rq function calls nvme_setup_cmd and nvme_setup_rw to setup the read/write
This function sets the command type, namespace ID, LBA Number etc.
nvme_map_data does the DMA mapping using :
blk_rq_map_sg
nvme_setup_prps
After these are done __nvme_submit_cmd writes the command to submission queue tail


static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
struct nvme_command *cmd)
{
u16 tail = nvmeq->sq_tail;

if (nvmeq->sq_cmds_io)
memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
else
memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

if (++tail == nvmeq->q_depth)
tail = 0;
if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
      nvmeq->dbbuf_sq_ei))
writel(tail, nvmeq->q_db);
nvmeq->sq_tail = tail;
}



After the submission is done :
nvme_process_cq will see for the completion of command. It reads the completion queue, fetches the command status and rings the dorrbell.

static void nvme_process_cq(struct nvme_queue *nvmeq)
{
struct nvme_completion cqe;
int consumed = 0;

while (nvme_read_cqe(nvmeq, &cqe)) {
nvme_handle_cqe(nvmeq, &cqe);
consumed++;
}

if (consumed)
nvme_ring_cq_doorbell(nvmeq);
}


Also if directly is not used for the submission, in such a case nvme_irq processes the completion queue:

static irqreturn_t nvme_irq(int irq, void *data)
{
irqreturn_t result;
struct nvme_queue *nvmeq = data;
spin_lock(&nvmeq->q_lock);
nvme_process_cq(nvmeq);
result = nvmeq->cqe_seen ? IRQ_HANDLED : IRQ_NONE;
nvmeq->cqe_seen = 0;
spin_unlock(&nvmeq->q_lock);
return result;
}


Error handling : 

void nvme_complete_rq(struct request *req)
{
if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
nvme_req(req)->retries++;
blk_mq_requeue_request(req, true);
return;
}

blk_mq_end_request(req, nvme_error_status(req));
}
EXPORT_SYMBOL_GPL(nvme_complete_rq);

static blk_status_t nvme_error_status(struct request *req)
{
switch (nvme_req(req)->status & 0x7ff) {
case NVME_SC_SUCCESS:
return BLK_STS_OK;
case NVME_SC_CAP_EXCEEDED:
return BLK_STS_NOSPC;
case NVME_SC_ONCS_NOT_SUPPORTED:
return BLK_STS_NOTSUPP;
case NVME_SC_WRITE_FAULT:
case NVME_SC_READ_ERROR:
case NVME_SC_UNWRITTEN_BLOCK:
case NVME_SC_ACCESS_DENIED:
case NVME_SC_READ_ONLY:
return BLK_STS_MEDIUM;
case NVME_SC_GUARD_CHECK:
case NVME_SC_APPTAG_CHECK:
case NVME_SC_REFTAG_CHECK:
case NVME_SC_INVALID_PI:
return BLK_STS_PROTECTION;
case NVME_SC_RESERVATION_CONFLICT:
return BLK_STS_NEXUS;
default:
return BLK_STS_IOERR;
}
}