Sunday 5 August 2018

Linux Kernel NVMe driver


In this blog we will go through Linux NVMe kernel driver.

The NVMe kernel driver has a table of nvme_id_table. As like the pci devices, this table has Vendor and device ID this driver would support.
When this driver is inserted the nvme_init function will register this id_table to the PCI.

On the insertion of this driver the probe function of this device is called.

nvme_probe is the driver probe function.

nvme_dev is allocated with the queues. Though NVMe supports the 64k queues pnly queues are created equal to number of CPUs existing in the system.

nvme_dev_map does the ioremap of the registers + 4096 :

enum {
NVME_REG_CAP = 0x0000, /* Controller Capabilities */
NVME_REG_VS = 0x0008, /* Version */
NVME_REG_INTMS = 0x000c, /* Interrupt Mask Set */
NVME_REG_INTMC = 0x0010, /* Interrupt Mask Clear */
NVME_REG_CC = 0x0014, /* Controller Configuration */
NVME_REG_CSTS = 0x001c, /* Controller Status */
NVME_REG_NSSR = 0x0020, /* NVM Subsystem Reset */
NVME_REG_AQA = 0x0024, /* Admin Queue Attributes */
NVME_REG_ASQ = 0x0028, /* Admin SQ Base Address */
NVME_REG_ACQ = 0x0030, /* Admin CQ Base Address */
NVME_REG_CMBLOC = 0x0038, /* Controller Memory Buffer Location */
NVME_REG_CMBSZ = 0x003c, /* Controller Memory Buffer Size */
NVME_REG_DBS = 0x1000, /* SQ 0 Tail Doorbell */
};


We can see the three queues admin, submission and completion queues of the NVME device.

Now the NVMe driver creates the DMA pool of Physical Region Pages. The prp_page_pool and prp_small_pool are the pools created for the IOs. These are done in nvme_setup_prp_pools function. From LDD3 "A DMApool is an allocation mechanism for small, coherent DMA mappings. Mappings obtained from dma_alloc_coherent may have a minimum size of one page. If your device needs smaller DMA areas than that, you should probably be using a DMA pool."
Function : nvme_setup_prp_pools
creates pools using dma_pool_create
The pools are used to allocate memory dma_pool_alloc in nvme_setup_prps function.
These DMA pools are used while doing IOs in nvme_setup_prps function.

Next the NVME driver moves to function nvme_init_ctrl to initialize the controller structures.
3 work queues are initiated :
INIT_WORK(&ctrl->scan_work, nvme_scan_work);
INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);


device_create_with_groups creates the sysfs entries.

At the last nvme_probe calls reset_work.

====================================================
nvme_reset_work function : 
====================================================
Calls nvme_pci_enable
calls pci_enable_device_mem --> enables device memory
pci_set_master --> makes the pci device as master
Reads the pci device capability(NVME_REG_CAP) and determines the io_queue_depth
Checks if CMB is enabled by reading the NVME_REG_VS and ioremap the CMB

Note about CMB :
The Controller Memory Buffer (CMB) is a region of general purpose read/write memory on the controller that may be used for a variety of purposes. The controller indicates which purposes the memory may be used for by setting support flags in the CMBSZ register.
Submission Queues in host memory require the controller to perform a PCI Express read from host memory in order to fetch the queue entries. Submission Queues in controller memory enable host software to directly write the entire Submission Queue Entry to the controller's internal memory space, avoiding one read from the controller to the host.

Calls nvme_pci_configure_admin_queue
calls nvme_alloc_queue--> does the allocation for the admin queue. allocates completion queue and the submission queue using dma_zalloc_coherent etc.
calls nvme_init_queue to initialize submission queue tail, completion queue head. Also initializes the doorbell regs.
Configures the completion queue interrupt using queue_request_irq

Calls nvme_alloc_admin_tags
Allocates the block mq tag set. The tag sets are used to force completion on the same CPU where the request was generated.
also initializes the admin_q using blk_mq_init_queue.
dev->ctrl.admin_q = blk_mq_init_queue
So a request_queue is initialized and blk_mq_make_request etc. is allocated to same.

Calls nvme_init_identify
calls nvme_identify_ctrl--> sends nvme_admin_identify = 0x06 command to the controller -->>
command is submitted to the admin queue using __nvme_submit_sync_cmd
this function allocates a request and calls blk_execute_rq to wait for the command to return.
Based on the returned data nvme_init_subnqn initializes the NQN.

Calls nvme_setup_io_queues
Sets the number of queue as number of CPUS present
calls nvme_create_io_queues to create the queues
nvme_alloc_queue allocates CQ and SQ for each CPU(Same as done for the admin queue)
nvme_create_queue --> creates CQ, SQ and registers the IRQ for each queue "queue_request_irq"

Calls nvme_start_ctrl
Starts the nvme_queue_scan -- > launches scan_work --> Calls nvme_scan_work
   nvme_scan_work will first call nvme_identify_ctrl :
    sends : 
c.identify.opcode = nvme_admin_identify;
c.identify.cns = NVME_ID_CNS_CTRL;
   based on the number received from identify a validate or sequential scan is triggered
calls nvme_scan_ns_sequential
calls nvme_validate_ns
calls nvme_alloc_ns --> does blk_mq_init_queue --> sends identify command
forms a disk name and creates a disk using device_add_disk --- >> this will create /dev/ entries
Disk name is formed like :
sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance)
it comes like nvme0n1 etc.
nvme_identify_ns sends following command
c.identify.opcode = nvme_admin_identify;
c.identify.nsid = cpu_to_le32(nsid);
c.identify.cns = NVME_ID_CNS_NS;
nvme_report_ns_ids fills the ns_id, eui, nguid and uuid



Lets see how the IO path looks after these are done by the Linux driver :

A probe at nvme_setup_cmd function  :

[105617.151466]  ? sched_clock_cpu+0x11/0xb0
[105617.155491]  ? __lock_acquire.isra.34+0x259/0xa90
[105617.160307]  blk_mq_try_issue_directly+0xbb/0x110
[105617.165118]  ? blk_mq_make_request+0x382/0x700
[105617.169675]  blk_mq_make_request+0x3b0/0x700
[105617.174045]  ? blk_mq_make_request+0x382/0x700
[105617.178593]  ? blk_queue_enter+0x6a/0x1f0
[105617.182714]  generic_make_request+0x11e/0x2f0
[105617.187176]  ? __lock_acquire.isra.34+0x259/0xa90
[105617.191988]  submit_bio+0x73/0x150
[105617.195494]  ? sched_clock+0x9/0x10
[105617.199083]  ? submit_bio+0x73/0x150
[105617.202757]  ? sched_clock_cpu+0x11/0xb0
[105617.206783]  do_mpage_readpage+0x489/0x7d0
[105617.210983]  ? I_BDEV+0x20/0x20
[105617.214235]  mpage_readpages+0x127/0x1b0
[105617.218260]  ? I_BDEV+0x20/0x20
[105617.221503]  ? I_BDEV+0x20/0x20
[105617.224750]  ? sched_clock_cpu+0x11/0xb0
[105617.228782]  blkdev_readpages+0x1d/0x20
[105617.232718]  __do_page_cache_readahead+0x223/0x2f0
[105617.237625]  ? find_get_entry+0xaf/0x120
[105617.241656]  force_page_cache_readahead+0x8e/0x100
[105617.246548]  ? force_page_cache_readahead+0x8e/0x100
[105617.251622]  page_cache_sync_readahead+0x42/0x50
[105617.256345]  generic_file_read_iter+0x646/0x800
[105617.260988]  ? sched_clock+0x9/0x10
[105617.264577]  ? sched_clock_cpu+0x11/0xb0
[105617.268605]  blkdev_read_iter+0x37/0x40
[105617.272542]  __vfs_read+0xe2/0x140
[105617.276054]  vfs_read+0x96/0x140
[105617.279393]  SyS_read+0x58/0xc0
[105617.282644]  do_syscall_64+0x5a/0x190
[105617.286412]  entry_SYSCALL64_slow_path+0x25/0x25



ftrace looks like :
 0)               |      submit_bio() {
 0)               |        generic_make_request() {
 0)               |          generic_make_request_checks() {
 0)               |            _cond_resched() {
 0)   0.306 us    |              rcu_note_context_switch();
 0)   0.116 us    |              _raw_spin_lock();
 0)   0.130 us    |              update_rq_clock();
 0)               |              pick_next_task_fair() {
 ...
 ...
 ...
 0) * 72635.26 us |          }
 0) * 72636.55 us |        }
 0) * 72637.80 us |      }
 0)               |      blk_mq_make_request() {
 0)   0.343 us    |        blk_queue_bounce();
 0)   9.507 us    |        blk_queue_split();
 0)   0.637 us    |        bio_integrity_prep();
 0)               |        blk_attempt_plug_merge() {
 0)               |          blk_rq_merge_ok() {
 0)   0.316 us    |            blk_integrity_merge_bio();
 0)   1.817 us    |          }
 0)   0.153 us    |          blk_try_merge();
 0)               |          bio_attempt_back_merge() {
 0)   3.396 us    |            ll_back_merge_fn();
 0)   4.443 us    |          }
 0)   9.640 us    |        }
 0)               |        __blk_mq_sched_bio_merge() {
 0)   0.110 us    |          _raw_spin_lock();
 0)   1.883 us    |        }
 0)   1.193 us    |        wbt_wait();
 0)               |        blk_mq_get_request() {
 0)               |          blk_mq_get_tag() {
 0)   0.837 us    |            __blk_mq_get_tag();
 0)   2.130 us    |          }
 0)   0.130 us    |          __blk_mq_tag_busy();
 0)   5.217 us    |        }
 0)               |        blk_init_request_from_bio() {
 0)   0.346 us    |          blk_rq_bio_prep();
 0)   1.843 us    |        }
 0)               |        blk_account_io_start() {
 0)   0.640 us    |          disk_map_sector_rcu();
 0)               |          part_round_stats() {
 0)   0.160 us    |            part_round_stats_single();
 0)   1.060 us    |          }
 0)   3.720 us    |        }
 0)               |        blk_flush_plug_list() {
 0)               |          blk_mq_flush_plug_list() {
 0)   0.133 us    |            plug_ctx_cmp();
 0)               |            blk_mq_sched_insert_requests() {
 0)               |              blk_mq_insert_requests() {
 0)   0.113 us    |                _raw_spin_lock();
 0)   0.370 us    |                blk_mq_hctx_mark_pending.isra.29();
 0)   2.426 us    |              }
 0)               |              blk_mq_run_hw_queue() {
 0)               |                __blk_mq_delay_run_hw_queue() {
 0)               |                  __blk_mq_run_hw_queue() {
 0) ! 175.500 us  |                    blk_mq_sched_dispatch_requests();
 0) ! 177.577 us  |                  }
 0) ! 179.030 us  |                }
 0) ! 180.086 us  |              }
 0) ! 184.673 us  |            }
 0)               |            blk_mq_sched_insert_requests() {
 0)               |              blk_mq_insert_requests() {
 0)   0.113 us    |                _raw_spin_lock();
 0)   0.346 us    |                blk_mq_hctx_mark_pending.isra.29();
 0)   2.257 us    |              }
 0)               |              blk_mq_run_hw_queue() {
 0)               |                __blk_mq_delay_run_hw_queue() {
 0)               |                  __blk_mq_run_hw_queue() {
 0) ! 585.063 us  |                    blk_mq_sched_dispatch_requests();
 0) ! 587.033 us  |                  }
 0) ! 588.216 us  |                }
 0) ! 589.040 us  |              }
 0) ! 593.647 us  |            }
 0) ! 783.930 us  |          }
 0) ! 785.350 us  |        }
 0) ! 829.913 us  |      }
 0) @ 100015.7 us |    }

This is the flow of the events after blk_mq_make_request:
Calls blk_mq_try_issue_directly to issue the IO, it calls nvme_queue_rq to submit the request
nvme_queue_rq function calls nvme_setup_cmd and nvme_setup_rw to setup the read/write
This function sets the command type, namespace ID, LBA Number etc.
nvme_map_data does the DMA mapping using :
blk_rq_map_sg
nvme_setup_prps
After these are done __nvme_submit_cmd writes the command to submission queue tail


static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
struct nvme_command *cmd)
{
u16 tail = nvmeq->sq_tail;

if (nvmeq->sq_cmds_io)
memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
else
memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

if (++tail == nvmeq->q_depth)
tail = 0;
if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
      nvmeq->dbbuf_sq_ei))
writel(tail, nvmeq->q_db);
nvmeq->sq_tail = tail;
}



After the submission is done :
nvme_process_cq will see for the completion of command. It reads the completion queue, fetches the command status and rings the dorrbell.

static void nvme_process_cq(struct nvme_queue *nvmeq)
{
struct nvme_completion cqe;
int consumed = 0;

while (nvme_read_cqe(nvmeq, &cqe)) {
nvme_handle_cqe(nvmeq, &cqe);
consumed++;
}

if (consumed)
nvme_ring_cq_doorbell(nvmeq);
}


Also if directly is not used for the submission, in such a case nvme_irq processes the completion queue:

static irqreturn_t nvme_irq(int irq, void *data)
{
irqreturn_t result;
struct nvme_queue *nvmeq = data;
spin_lock(&nvmeq->q_lock);
nvme_process_cq(nvmeq);
result = nvmeq->cqe_seen ? IRQ_HANDLED : IRQ_NONE;
nvmeq->cqe_seen = 0;
spin_unlock(&nvmeq->q_lock);
return result;
}


Error handling : 

void nvme_complete_rq(struct request *req)
{
if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
nvme_req(req)->retries++;
blk_mq_requeue_request(req, true);
return;
}

blk_mq_end_request(req, nvme_error_status(req));
}
EXPORT_SYMBOL_GPL(nvme_complete_rq);

static blk_status_t nvme_error_status(struct request *req)
{
switch (nvme_req(req)->status & 0x7ff) {
case NVME_SC_SUCCESS:
return BLK_STS_OK;
case NVME_SC_CAP_EXCEEDED:
return BLK_STS_NOSPC;
case NVME_SC_ONCS_NOT_SUPPORTED:
return BLK_STS_NOTSUPP;
case NVME_SC_WRITE_FAULT:
case NVME_SC_READ_ERROR:
case NVME_SC_UNWRITTEN_BLOCK:
case NVME_SC_ACCESS_DENIED:
case NVME_SC_READ_ONLY:
return BLK_STS_MEDIUM;
case NVME_SC_GUARD_CHECK:
case NVME_SC_APPTAG_CHECK:
case NVME_SC_REFTAG_CHECK:
case NVME_SC_INVALID_PI:
return BLK_STS_PROTECTION;
case NVME_SC_RESERVATION_CONFLICT:
return BLK_STS_NEXUS;
default:
return BLK_STS_IOERR;
}
}


3 comments: