Articles in this blog

Friday, 14 September 2018

entry_64.S

entry_64.S is collection of functions written in assembly. This file has many routines which are used in interrupts, system calls etc.
Let us see one by one how many such routines are present in this file.

Functions present in this file :

A) save_rest : Handle the saving of extra registers

B) ret_from_fork : This function is executed by the child process right after its creation through a fork
Called from :
1. copy_thread
 p->thread.ip = (unsigned long) ret_from_fork;
This is done to set the new threads instruction pointer.

2. /* Save restore flags to clear handle leaking NT */
#define switch_to(prev, next, last) \
...
     "jnz   ret_from_fork\n\t"   \
...
From switch_to the function ret_from_fork is called if the program was never executed.


C) system_call : This is the entry of any system call.

MSR_LSTAR is initialised with this function at startup.
This fiunction calls the system call function from the sys_call_table :
call *sys_call_table(,%rax,8)  # XXX: rip relative

Also entry to system call and return from system call is handled.
ret_from_sys_call :
on a return for system call the process scheduling in called :
...
 SCHEDULE_USER
...
# define SCHEDULE_USER call schedule


D)  The various stub functions are usually entries in the sys_call_table

stub_execve, stub_rt_sigreturn, stub_x32_execve
0 syscalls_64.h       60 __SYSCALL_64(59, stub_execve, stub_execve)
1 sys_call_table_64.c 33 #define stub_execve sys_execve

0 syscalls_64.h       16 __SYSCALL_64(15, stub_rt_sigreturn, stub_rt_sigreturn)
1 sys_call_table_64.c 34 #define stub_rt_sigreturn sys_rt_sigreturn

0 syscalls_64.h 309 __SYSCALL_X32(513, stub_x32_rt_sigreturn, stub_x32_rt_sigreturn)

0 syscalls_64.h 316 __SYSCALL_X32(520, stub_x32_execve, stub_x32_execve)


E) interrupt and "irq_entries_start"
ENTRY(interrupt)
ENTRY(irq_entries_start)
In native_init_IRQ the interrupt entry is used to initialize IDT with the defaults :
set_intr_gate(i, interrupt[i - FIRST_EXTERNAL_VECTOR]);

F) Interrupt entry/exit
Now comes the interrupt entry and exit functions. Under common_interrupt label we see that do_IRQ function is called.

If interrupt came when we were in kernel mode then control jumps to retint_kernel which calls for scheduling:

retint_kernel :
 preempt_schedule_irq
 void __sched preempt_schedule_irq(void)
  ...
  local_irq_enable();
  __schedule();
  local_irq_disable();
  ...

If interrupt came when we were in USER mode then control jumps to retint_careful which calls for scheduling:
retint_careful:
...
SCHEDULE_USER
...

G) ENTRY(call_softirq)
Call softirq on interrupt stack. This function is called from do_softirq()

Wednesday, 29 August 2018

Linux system call implementation x86_64


Userspace call of the systemcall:

x86_64 user programs invoke a system call by putting the system call number (0 for read) into the RAX register, and the other parameters into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the x86_64 syscall instruction.
http://man7.org/linux/man-pages/man2/syscall.2.html
x86-64      syscall               rax        rax     -        [5]

This instruction causes the processor to transition to ring 0 and invoke the function referenced by the MSR_LSTAR model-specific register.

The MSR_LSTAR model-specific register is initialized at the kernel bootup:


/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
extern char _entry_trampoline[];
extern char entry_SYSCALL_64_trampoline[];

int cpu = smp_processor_id();
unsigned long SYSCALL64_entry_trampoline =
(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
(entry_SYSCALL_64_trampoline - _entry_trampoline);

wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
if (static_cpu_has(X86_FEATURE_PTI))
wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
else
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

#ifdef CONFIG_IA32_EMULATION
wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
/*
* This only works on Intel CPUs.
* On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
* This does not cause SYSENTER to jump to the wrong location, because
* AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
*/
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_entry_stack(cpu) + 1));
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
#else
wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
#endif

/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
       X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
       X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
}

The system_call is the function which gets called for each system call.
entry_64.S file has the definition of the system_call function.
The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in the sys_call_table table.

ENTRY(system_call)
...
movq %r10,%rcx /* fixup for C */
call *sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)
...
END(system_call)


Apart from the change via the SYSCALL, we also change the privilege : 

Privilege levels (CPL DPL RPL):

X86 systems has a feature of privilege levels. This restricts the memory access, IO ports access and ability to execute certain machine instructions. For kernel this privilege level is 0 and for user space programs it is 3. A code executing cannot change its privilege level itself. Change in privilege level can be done using lcall, int, lret and iret instructions. The raise of privilege level is done by lcall and iret instructions and lowering by lret and iret instructions. This explains the use of int 0x80 done while executing any system call. It is this int instruction which elevates the privilege level from user(0) to kernel(3) space.

The change in LDTR, GDTR and IDTR is also done. The RPL DPL are defined from these only. 

Tuesday, 14 August 2018

Linux internals segmentation fault generation


Linux process has many sections. Vaguely speaking these sections can be data, text, stack heap etc.
When a Linux process is created these sections are formed and virtual memory is allocated to these.

Presence of these sections in Linux kernel.

Each of these sections are represented by vm_area_struct in Linux.

The Linux task_struct has memory struct in mm_struct. The mm_struct has vm_area_struct for all these sections.

For all the sections the virtual address will be accessed.

https://en.wikipedia.org/wiki/Memory_management_unit MMU diagram tells the translation of virtual memory to physical memory by CPU/MMU.

The MMU part of CPU checks if the virtual to physical translation is present in the TLB cache.
If no translation exist in TLB it raises an page fault exception.

As a result of the Page fault exception the page fault handler is called.

In Linux the page fault handler is do_page_fault function. In the do_page_fault function faulting address is taken from the cr2 register: read_cr2();

__do_page_fault is called. This function checks if there is vm_area_struct present for the faulting address.

If the virtual area is present in a VMA then good_area handling is invoked, else it goes bad_area function:

bad_area function : 
Eventually a SEGV is generated in __bad_area_nosemaphore function.
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_PF;

force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);


good_area handling :
If we see that the faulting address is correct then handle_mm_fault function is called to allocate a new page frame, similar to the demand paging explained in the Memory Initialization topic.

Sunday, 5 August 2018

Linux Kernel NVMe driver


In this blog we will go through Linux NVMe kernel driver.

The NVMe kernel driver has a table of nvme_id_table. As like the pci devices, this table has Vendor and device ID this driver would support.
When this driver is inserted the nvme_init function will register this id_table to the PCI.

On the insertion of this driver the probe function of this device is called.

nvme_probe is the driver probe function.

nvme_dev is allocated with the queues. Though NVMe supports the 64k queues pnly queues are created equal to number of CPUs existing in the system.

nvme_dev_map does the ioremap of the registers + 4096 :

enum {
NVME_REG_CAP = 0x0000, /* Controller Capabilities */
NVME_REG_VS = 0x0008, /* Version */
NVME_REG_INTMS = 0x000c, /* Interrupt Mask Set */
NVME_REG_INTMC = 0x0010, /* Interrupt Mask Clear */
NVME_REG_CC = 0x0014, /* Controller Configuration */
NVME_REG_CSTS = 0x001c, /* Controller Status */
NVME_REG_NSSR = 0x0020, /* NVM Subsystem Reset */
NVME_REG_AQA = 0x0024, /* Admin Queue Attributes */
NVME_REG_ASQ = 0x0028, /* Admin SQ Base Address */
NVME_REG_ACQ = 0x0030, /* Admin CQ Base Address */
NVME_REG_CMBLOC = 0x0038, /* Controller Memory Buffer Location */
NVME_REG_CMBSZ = 0x003c, /* Controller Memory Buffer Size */
NVME_REG_DBS = 0x1000, /* SQ 0 Tail Doorbell */
};


We can see the three queues admin, submission and completion queues of the NVME device.

Now the NVMe driver creates the DMA pool of Physical Region Pages. The prp_page_pool and prp_small_pool are the pools created for the IOs. These are done in nvme_setup_prp_pools function. From LDD3 "A DMApool is an allocation mechanism for small, coherent DMA mappings. Mappings obtained from dma_alloc_coherent may have a minimum size of one page. If your device needs smaller DMA areas than that, you should probably be using a DMA pool."
Function : nvme_setup_prp_pools
creates pools using dma_pool_create
The pools are used to allocate memory dma_pool_alloc in nvme_setup_prps function.
These DMA pools are used while doing IOs in nvme_setup_prps function.

Next the NVME driver moves to function nvme_init_ctrl to initialize the controller structures.
3 work queues are initiated :
INIT_WORK(&ctrl->scan_work, nvme_scan_work);
INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);


device_create_with_groups creates the sysfs entries.

At the last nvme_probe calls reset_work.

====================================================
nvme_reset_work function : 
====================================================
Calls nvme_pci_enable
calls pci_enable_device_mem --> enables device memory
pci_set_master --> makes the pci device as master
Reads the pci device capability(NVME_REG_CAP) and determines the io_queue_depth
Checks if CMB is enabled by reading the NVME_REG_VS and ioremap the CMB

Note about CMB :
The Controller Memory Buffer (CMB) is a region of general purpose read/write memory on the controller that may be used for a variety of purposes. The controller indicates which purposes the memory may be used for by setting support flags in the CMBSZ register.
Submission Queues in host memory require the controller to perform a PCI Express read from host memory in order to fetch the queue entries. Submission Queues in controller memory enable host software to directly write the entire Submission Queue Entry to the controller's internal memory space, avoiding one read from the controller to the host.

Calls nvme_pci_configure_admin_queue
calls nvme_alloc_queue--> does the allocation for the admin queue. allocates completion queue and the submission queue using dma_zalloc_coherent etc.
calls nvme_init_queue to initialize submission queue tail, completion queue head. Also initializes the doorbell regs.
Configures the completion queue interrupt using queue_request_irq

Calls nvme_alloc_admin_tags
Allocates the block mq tag set. The tag sets are used to force completion on the same CPU where the request was generated.
also initializes the admin_q using blk_mq_init_queue.
dev->ctrl.admin_q = blk_mq_init_queue
So a request_queue is initialized and blk_mq_make_request etc. is allocated to same.

Calls nvme_init_identify
calls nvme_identify_ctrl--> sends nvme_admin_identify = 0x06 command to the controller -->>
command is submitted to the admin queue using __nvme_submit_sync_cmd
this function allocates a request and calls blk_execute_rq to wait for the command to return.
Based on the returned data nvme_init_subnqn initializes the NQN.

Calls nvme_setup_io_queues
Sets the number of queue as number of CPUS present
calls nvme_create_io_queues to create the queues
nvme_alloc_queue allocates CQ and SQ for each CPU(Same as done for the admin queue)
nvme_create_queue --> creates CQ, SQ and registers the IRQ for each queue "queue_request_irq"

Calls nvme_start_ctrl
Starts the nvme_queue_scan -- > launches scan_work --> Calls nvme_scan_work
   nvme_scan_work will first call nvme_identify_ctrl :
    sends : 
c.identify.opcode = nvme_admin_identify;
c.identify.cns = NVME_ID_CNS_CTRL;
   based on the number received from identify a validate or sequential scan is triggered
calls nvme_scan_ns_sequential
calls nvme_validate_ns
calls nvme_alloc_ns --> does blk_mq_init_queue --> sends identify command
forms a disk name and creates a disk using device_add_disk --- >> this will create /dev/ entries
Disk name is formed like :
sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance)
it comes like nvme0n1 etc.
nvme_identify_ns sends following command
c.identify.opcode = nvme_admin_identify;
c.identify.nsid = cpu_to_le32(nsid);
c.identify.cns = NVME_ID_CNS_NS;
nvme_report_ns_ids fills the ns_id, eui, nguid and uuid



Lets see how the IO path looks after these are done by the Linux driver :

A probe at nvme_setup_cmd function  :

[105617.151466]  ? sched_clock_cpu+0x11/0xb0
[105617.155491]  ? __lock_acquire.isra.34+0x259/0xa90
[105617.160307]  blk_mq_try_issue_directly+0xbb/0x110
[105617.165118]  ? blk_mq_make_request+0x382/0x700
[105617.169675]  blk_mq_make_request+0x3b0/0x700
[105617.174045]  ? blk_mq_make_request+0x382/0x700
[105617.178593]  ? blk_queue_enter+0x6a/0x1f0
[105617.182714]  generic_make_request+0x11e/0x2f0
[105617.187176]  ? __lock_acquire.isra.34+0x259/0xa90
[105617.191988]  submit_bio+0x73/0x150
[105617.195494]  ? sched_clock+0x9/0x10
[105617.199083]  ? submit_bio+0x73/0x150
[105617.202757]  ? sched_clock_cpu+0x11/0xb0
[105617.206783]  do_mpage_readpage+0x489/0x7d0
[105617.210983]  ? I_BDEV+0x20/0x20
[105617.214235]  mpage_readpages+0x127/0x1b0
[105617.218260]  ? I_BDEV+0x20/0x20
[105617.221503]  ? I_BDEV+0x20/0x20
[105617.224750]  ? sched_clock_cpu+0x11/0xb0
[105617.228782]  blkdev_readpages+0x1d/0x20
[105617.232718]  __do_page_cache_readahead+0x223/0x2f0
[105617.237625]  ? find_get_entry+0xaf/0x120
[105617.241656]  force_page_cache_readahead+0x8e/0x100
[105617.246548]  ? force_page_cache_readahead+0x8e/0x100
[105617.251622]  page_cache_sync_readahead+0x42/0x50
[105617.256345]  generic_file_read_iter+0x646/0x800
[105617.260988]  ? sched_clock+0x9/0x10
[105617.264577]  ? sched_clock_cpu+0x11/0xb0
[105617.268605]  blkdev_read_iter+0x37/0x40
[105617.272542]  __vfs_read+0xe2/0x140
[105617.276054]  vfs_read+0x96/0x140
[105617.279393]  SyS_read+0x58/0xc0
[105617.282644]  do_syscall_64+0x5a/0x190
[105617.286412]  entry_SYSCALL64_slow_path+0x25/0x25



ftrace looks like :
 0)               |      submit_bio() {
 0)               |        generic_make_request() {
 0)               |          generic_make_request_checks() {
 0)               |            _cond_resched() {
 0)   0.306 us    |              rcu_note_context_switch();
 0)   0.116 us    |              _raw_spin_lock();
 0)   0.130 us    |              update_rq_clock();
 0)               |              pick_next_task_fair() {
 ...
 ...
 ...
 0) * 72635.26 us |          }
 0) * 72636.55 us |        }
 0) * 72637.80 us |      }
 0)               |      blk_mq_make_request() {
 0)   0.343 us    |        blk_queue_bounce();
 0)   9.507 us    |        blk_queue_split();
 0)   0.637 us    |        bio_integrity_prep();
 0)               |        blk_attempt_plug_merge() {
 0)               |          blk_rq_merge_ok() {
 0)   0.316 us    |            blk_integrity_merge_bio();
 0)   1.817 us    |          }
 0)   0.153 us    |          blk_try_merge();
 0)               |          bio_attempt_back_merge() {
 0)   3.396 us    |            ll_back_merge_fn();
 0)   4.443 us    |          }
 0)   9.640 us    |        }
 0)               |        __blk_mq_sched_bio_merge() {
 0)   0.110 us    |          _raw_spin_lock();
 0)   1.883 us    |        }
 0)   1.193 us    |        wbt_wait();
 0)               |        blk_mq_get_request() {
 0)               |          blk_mq_get_tag() {
 0)   0.837 us    |            __blk_mq_get_tag();
 0)   2.130 us    |          }
 0)   0.130 us    |          __blk_mq_tag_busy();
 0)   5.217 us    |        }
 0)               |        blk_init_request_from_bio() {
 0)   0.346 us    |          blk_rq_bio_prep();
 0)   1.843 us    |        }
 0)               |        blk_account_io_start() {
 0)   0.640 us    |          disk_map_sector_rcu();
 0)               |          part_round_stats() {
 0)   0.160 us    |            part_round_stats_single();
 0)   1.060 us    |          }
 0)   3.720 us    |        }
 0)               |        blk_flush_plug_list() {
 0)               |          blk_mq_flush_plug_list() {
 0)   0.133 us    |            plug_ctx_cmp();
 0)               |            blk_mq_sched_insert_requests() {
 0)               |              blk_mq_insert_requests() {
 0)   0.113 us    |                _raw_spin_lock();
 0)   0.370 us    |                blk_mq_hctx_mark_pending.isra.29();
 0)   2.426 us    |              }
 0)               |              blk_mq_run_hw_queue() {
 0)               |                __blk_mq_delay_run_hw_queue() {
 0)               |                  __blk_mq_run_hw_queue() {
 0) ! 175.500 us  |                    blk_mq_sched_dispatch_requests();
 0) ! 177.577 us  |                  }
 0) ! 179.030 us  |                }
 0) ! 180.086 us  |              }
 0) ! 184.673 us  |            }
 0)               |            blk_mq_sched_insert_requests() {
 0)               |              blk_mq_insert_requests() {
 0)   0.113 us    |                _raw_spin_lock();
 0)   0.346 us    |                blk_mq_hctx_mark_pending.isra.29();
 0)   2.257 us    |              }
 0)               |              blk_mq_run_hw_queue() {
 0)               |                __blk_mq_delay_run_hw_queue() {
 0)               |                  __blk_mq_run_hw_queue() {
 0) ! 585.063 us  |                    blk_mq_sched_dispatch_requests();
 0) ! 587.033 us  |                  }
 0) ! 588.216 us  |                }
 0) ! 589.040 us  |              }
 0) ! 593.647 us  |            }
 0) ! 783.930 us  |          }
 0) ! 785.350 us  |        }
 0) ! 829.913 us  |      }
 0) @ 100015.7 us |    }

This is the flow of the events after blk_mq_make_request:
Calls blk_mq_try_issue_directly to issue the IO, it calls nvme_queue_rq to submit the request
nvme_queue_rq function calls nvme_setup_cmd and nvme_setup_rw to setup the read/write
This function sets the command type, namespace ID, LBA Number etc.
nvme_map_data does the DMA mapping using :
blk_rq_map_sg
nvme_setup_prps
After these are done __nvme_submit_cmd writes the command to submission queue tail


static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
struct nvme_command *cmd)
{
u16 tail = nvmeq->sq_tail;

if (nvmeq->sq_cmds_io)
memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
else
memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

if (++tail == nvmeq->q_depth)
tail = 0;
if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
      nvmeq->dbbuf_sq_ei))
writel(tail, nvmeq->q_db);
nvmeq->sq_tail = tail;
}



After the submission is done :
nvme_process_cq will see for the completion of command. It reads the completion queue, fetches the command status and rings the dorrbell.

static void nvme_process_cq(struct nvme_queue *nvmeq)
{
struct nvme_completion cqe;
int consumed = 0;

while (nvme_read_cqe(nvmeq, &cqe)) {
nvme_handle_cqe(nvmeq, &cqe);
consumed++;
}

if (consumed)
nvme_ring_cq_doorbell(nvmeq);
}


Also if directly is not used for the submission, in such a case nvme_irq processes the completion queue:

static irqreturn_t nvme_irq(int irq, void *data)
{
irqreturn_t result;
struct nvme_queue *nvmeq = data;
spin_lock(&nvmeq->q_lock);
nvme_process_cq(nvmeq);
result = nvmeq->cqe_seen ? IRQ_HANDLED : IRQ_NONE;
nvmeq->cqe_seen = 0;
spin_unlock(&nvmeq->q_lock);
return result;
}


Error handling : 

void nvme_complete_rq(struct request *req)
{
if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
nvme_req(req)->retries++;
blk_mq_requeue_request(req, true);
return;
}

blk_mq_end_request(req, nvme_error_status(req));
}
EXPORT_SYMBOL_GPL(nvme_complete_rq);

static blk_status_t nvme_error_status(struct request *req)
{
switch (nvme_req(req)->status & 0x7ff) {
case NVME_SC_SUCCESS:
return BLK_STS_OK;
case NVME_SC_CAP_EXCEEDED:
return BLK_STS_NOSPC;
case NVME_SC_ONCS_NOT_SUPPORTED:
return BLK_STS_NOTSUPP;
case NVME_SC_WRITE_FAULT:
case NVME_SC_READ_ERROR:
case NVME_SC_UNWRITTEN_BLOCK:
case NVME_SC_ACCESS_DENIED:
case NVME_SC_READ_ONLY:
return BLK_STS_MEDIUM;
case NVME_SC_GUARD_CHECK:
case NVME_SC_APPTAG_CHECK:
case NVME_SC_REFTAG_CHECK:
case NVME_SC_INVALID_PI:
return BLK_STS_PROTECTION;
case NVME_SC_RESERVATION_CONFLICT:
return BLK_STS_NEXUS;
default:
return BLK_STS_IOERR;
}
}


Monday, 25 June 2018

Linux hugepages internals


Linux kernel talks about hugepages in their documentation :
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

Using the mount command we can see the hugetlbfs mounted :
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
the hugetlbfs filesystem is mounted by linux on boot.

The available hugepages can be seen using :
cat /proc/meminfo
...
...
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
...
...

We can add the hugepages using the following command :
echo 32 > /proc/sys/vm/nr_hugepages
OR sysctl -w vm.nr_hugepages=xx
...
...
HugePages_Total:      32
HugePages_Free:       31
HugePages_Rsvd:        1
HugePages_Surp:        0
Hugepagesize:       2048 kB
...
...

Post addition we can use the Huge pages using following C program :

#include<stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

int main ()
{
        int fd = -1,i;
        char * buffer = NULL;
        int buffer_size = 1024;
        fd = open("/dev/hugepages/my_map", O_CREAT | O_RDWR, 0755);

        buffer = (void *)mmap(0, buffer_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
        printf("mmapped the buffer = %p\n", buffer);
        for (i=0; i<1000;i++)
                buffer[i] = 'S';
        return 0;

}

Now lets see what happens internally for these steps :

========================================================================
Adding the huge pages : 
========================================================================
We can add the hugepages using the following command :
echo 32 > /proc/sys/vm/nr_hugepages
sysctl -w vm.nr_hugepages=xx

As a result of these steps lets see how the huge pages are allocated :

The hugetlb_sysctl_handler function is called when we echo to /proc/sys/vm/nr_hugepages

cd /sys/kernel/debug/tracing
cat /dev/null > trace
echo hugetlb_sysctl_handler > set_graph_function
echo 10 > max_graph_depth
echo function_graph > current_tracer
echo 1 > tracing_on
echo 32 > /proc/sys/vm/nr_hugepages
sleep 1
cp trace ~/trace_nr_hugepages_f_graph
echo 0 > tracing_on
echo > set_graph_function
echo 0 > max_graph_depth
cat /dev/null >  trace

hugetlb_sysctl_handler calls hugetlb_sysctl_handler_common
which eventually calls set_max_huge_pages
set_max_huge_pages calls alloc_fresh_huge_page 32 times 
alloc_fresh_huge_page_node uses alloc_pages to allocate memory from the buddy allocator
also updates internal stats using prep_new_huge_page.


 2)               |  hugetlb_sysctl_handler() {
 2)               |    hugetlb_sysctl_handler_common() {
 ...
 ...
 2)               |      set_max_huge_pages() {
 2)   0.056 us    |        _raw_spin_lock();
 2)   0.044 us    |        _cond_resched();
 2)               |        alloc_fresh_huge_page() {
 2)               |          hstate_next_node_to_alloc.isra.47() {
 2)   0.038 us    |            get_valid_node_allowed();
 2)   0.187 us    |            next_node_allowed();
 2)   1.272 us    |          }
 2)               |          __alloc_pages_nodemask() {
 2)   0.038 us    |            _cond_resched();
 ...
 ...
 2)               |          prep_new_huge_page() {
 2)   0.039 us    |            _raw_spin_lock();
 2)               |            put_page() {
...
...

========================================================================
MMAP of huge page
========================================================================
Inode.c :

const struct file_operations hugetlbfs_file_operations = {
.read = hugetlbfs_read,
.mmap = hugetlbfs_file_mmap,
.fsync = noop_fsync,
.get_unmapped_area = hugetlb_get_unmapped_area,
.llseek = default_llseek,
.fallocate = hugetlbfs_fallocate,
};

When mmap of a file is called in hugetlbfs hugetlbfs_file_mmap function is called.
Any mmap creates a VMA.

While the VMA is created :
vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
vma->vm_ops = &hugetlb_vm_ops;
Length of the VMA is fetched
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);

}
 3)               |  hugetlbfs_file_mmap() {
 3)               |    mutex_lock() {
 3)   0.242 us    |      _cond_resched();
 3)   0.773 us    |    }
 3)               |    touch_atime() {
 3)               |      current_fs_time() {
 3)   0.043 us    |        current_kernel_time();
 3)   0.409 us    |      }
 3)   0.904 us    |    }
 3)               |    hugetlb_reserve_pages() {
 3)               |      region_chg() {
 3)   0.100 us    |        _raw_spin_lock();
 3)               |        kmem_cache_alloc_trace() {
 3)   0.035 us    |          _cond_resched();
 3)   0.369 us    |        }
 3)   0.034 us    |        _raw_spin_lock();
 3)   1.643 us    |      }
 3)   0.104 us    |      hugepage_subpool_get_pages();
 3)               |      hugetlb_acct_memory() {
 3)   0.035 us    |        _raw_spin_lock();
 3)   0.650 us    |      }
 3)               |      region_add() {
 3)   0.035 us    |        _raw_spin_lock();
 3)   0.373 us    |      }
 3)   4.172 us    |    }
 3)   0.046 us    |    mutex_unlock();
 3)   7.342 us    |  }



========================================================================
Fault handling of hugepage:
========================================================================
When we write to the mmapped area the fault handler is called. If the fault is on the hugetlb page hugetlb_fault is called.

__handle_mm_fault calls :
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

hugetlb_fault uses huge_pte_alloc to allocate a page table entry. It will internally to __pud_alloc and pmd_alloc
It moves ahead and calls hugetlb_no_page this calls alloc_huge_page.

alloc_huge_page may call __alloc_buddy_huge_page_with_mpol for further allocation. __hugetlb_alloc_buddy_huge_page calls alloc_pages_node
to alloc pages from normal buddy allocator.

The page allocated is returned and passed to make_huge_pte. Here pte_mkhuge sets _PAGE_PSE flag in the pte.
for more about the page size extension please go through following link.
https://en.wikipedia.org/wiki/Page_Size_Extension


 
cd /sys/kernel/debug/tracing
cat /dev/null > trace
echo hugetlb_fault > set_graph_function
echo 10 > max_graph_depth
echo function_graph > current_tracer
echo 1 > tracing_on
~/./a.out
sleep 1
cp trace ~/trace_hugetlb_fault_graph
echo 0 > tracing_on
echo > set_graph_function
echo 0 > max_graph_depth
cat /dev/null >  trace

 3)               |  hugetlb_fault() {
 3)   0.127 us    |    huge_pte_offset();
 3)               |    huge_pte_alloc() {
 3)               |      __pud_alloc() {
 3)               |        get_zeroed_page() {
 3)               |          __get_free_pages() {
 3)               |            alloc_pages_current() {
...
 3)   5.428 us    |        }
 3)   0.035 us    |        _raw_spin_lock();
 3)   6.151 us    |      }
 3)               |      huge_pmd_share() {
 3)   0.052 us    |        find_vma();
 3)               |        __pmd_alloc() {
 3)               |          alloc_pages_current() {
...
...
 3)   5.012 us    |        }
 3)   5.887 us    |      }
 3) + 14.981 us   |    }
 3)   0.149 us    |    hugetlb_fault_mutex_hash();
 3)               |    mutex_lock() {
 3)   0.035 us    |      _cond_resched();
 3)   0.483 us    |    }
 3)               |    find_lock_page() {
 3)               |      __find_lock_page() {
 3)   0.080 us    |        __find_get_page();
 3)   0.383 us    |      }
 3)   0.675 us    |    }
 3)               |    alloc_huge_page() {
 3)               |      __vma_reservation_common() {
 3)               |        region_chg() {
 3)   0.055 us    |          _raw_spin_lock();
 3)   0.035 us    |          kfree();
 3)   0.738 us    |        }
 3)   1.240 us    |      {
...
...
 3)               |        region_add() {
 3)   0.042 us    |          _raw_spin_lock();
 3)   0.418 us    |        }
 3)   0.767 us    |      }
 3) + 11.440 us   |    }
 3)               |    clear_huge_page() {
 3)               |    huge_add_to_page_cache() {
 3)               |      add_to_page_cache_locked() {
 3)               |        __add_to_page_cache_locked() {
...
...
 3)   0.035 us    |      page_waitqueue();
 3)   0.045 us    |      __wake_up_bit();
 3)   1.258 us    |    }
 3)   0.109 us    |    mutex_unlock();
 3) ! 1542.838 us |  }

Thursday, 14 June 2018

Linux native multipath internals

Linux native multipath provides multipathing capability in Linux. It gives the benefits of increased throughput and path failover.

Many details for it has been given in https://www.kernel.org/doc/ols/2005/ols2005v1-pages-155-176.pdf

In this document we will go through the various user/kernel components of Linux native multipath.

The major user space components are libdevmapper shared library and multipathd daemon. The kernel modules of dm-mod and dm-multipath. Below Lower Level  Device Driver of libiscsi and scsi_transport_iscsi handle the submission to wire.


MULTIPATHD :

Lets first see the multipathd.

multipathd is installed in Linux systems like a service. The service start starts the daemon /etc/multipathd


Like any C program the main function of multipathd is invoked :
main --> Calls child
--> init_checkers
--> add_checker adds DIRECTIO checker
--> init_prio adds DEFAULT_PRIO
--> signal_init
--> set_oom_adj -- adjusts the OOM

Then it invokes 4 pthreads :
1. ueventloop --- Listens for the UDEV event in uevent_listen function. This adds the udev events to the queue
2. uxlsnrloop --- uxsock_listen waits for commands from users
3. checkerloop --- does check_path for all the available paths
4. uevqloop --- This takes out the events oushed in the queue.

Basically ueventloop listens for the events and uevqloop processes it.

uevqloop processes the UDEV event and eventually calls :
uev_trigger
--> uev_add_path is called for any ADD event
--> calls ev_add_path
--> calls domap

domap function for a new device creation calls dm_addmap_create
dm_addmap(DM_DEVICE_CREATE, TGT_MPATH, mpp, params, 1, ro);
will call dm_task_create which is a libdevmapper function. This is similar to calling a dmsetup create command.

This creation will trigger an ioctl to the kernel space. This is done to device /dev/mapper/control device.


Linux Kernel life of the device creation :

Linux Kernel has the following misc driver created for the ioctl communication:

#ifdef CONFIG_COMPAT
static long dm_compat_ctl_ioctl(struct file *file, uint command, ulong u)
{
return (long)dm_ctl_ioctl(file, command, (ulong) compat_ptr(u));
}
#else
#define dm_compat_ctl_ioctl NULL
#endif

static const struct file_operations _ctl_fops = {
.open = nonseekable_open,
.unlocked_ioctl = dm_ctl_ioctl,
.compat_ioctl = dm_compat_ctl_ioctl,
.owner = THIS_MODULE,
.llseek  = noop_llseek,
};

static struct miscdevice _dm_misc = {
.minor = MAPPER_CTRL_MINOR,
.name  = DM_NAME,
.nodename = DM_DIR "/" DM_CONTROL_NODE,
.fops  = &_ctl_fops
};


dm_ctl_ioctl calls ctl_ioctl :
--> lookup_ioctl -- fetches the function corresponding to the device creation.
{DM_DEV_CREATE_CMD, IOCTL_FLAGS_NO_PARAMS, dev_create}
dev_create calls dm_create
--> alloc_dev
--> dm_init_md_queue(md);
--> md->disk->fops = &dm_blk_dops;

static const struct block_device_operations dm_blk_dops = {
.open = dm_blk_open,
.release = dm_blk_close,
.ioctl = dm_blk_ioctl,
.direct_access = dm_blk_direct_access,
.getgeo = dm_blk_getgeo,
.pr_ops = &dm_pr_ops,
.owner = THIS_MODULE
};

Creation of /dev/dm-X device : 
Lets trace the creation of /dev/dm-X device.
Our favourite commands for the Ftrace :)

cd /sys/kernel/debug/tracing
cat /dev/null >  trace
echo dev_create > set_graph_function
echo 10 > max_graph_depth
echo function_graph > current_tracer
echo 1 > tracing_on
dmsetup create my_dev --table "0 14680064 multipath 2 queue_if_no_path retain_attached_hw_handler 1 alua 2 1 service-time 0 1 2 65:176 1 1 service-time 0 1 2 65:192 1 1"
cp trace ~/create_dev_depth10
echo 0 > tracing_on
echo > set_graph_function
echo 0 > max_graph_depth
cat /dev/null >  trace

Here is the trace :
 1)               |  dev_create [dm_mod]() {
 1)               |    dm_create [dm_mod]() {
 1)   0.035 us    |      dm_use_blk_mq_default [dm_mod]();
 1)   0.041 us    |      __mutex_init();
 1)   0.040 us    |      __mutex_init();
 1)   0.037 us    |      __mutex_init();
 1)               |      blk_alloc_queue_node() {
 1)               |        blkcg_init_queue() {
 1)   0.037 us    |          _cond_resched();
 1)               |          blk_throtl_init() {
 1)               |            kmem_cache_alloc_node_trace() {
 1)   0.042 us    |              _cond_resched();
 1)   0.482 us    |            }
 1)   0.036 us    |            init_timer_key();
 1)               |            blkcg_activate_policy() {
 1)   0.038 us    |              blkcg_policy_enabled();
 1)               |              blkg_alloc() {
 1)               |      alloc_disk_node() {
  1)               |        device_initialize() {
 1)   0.035 us    |          __mutex_init();
 1)               |          device_pm_sleep_init() {
 1)   0.034 us    |            __init_waitqueue_head();
 1)               |            complete_all() {
 1)   0.036 us    |              _raw_spin_lock_irqsave();
 1)   0.045 us    |              __wake_up_common();
 1)   0.046 us    |              _raw_spin_unlock_irqrestore();
 1)   0.980 us    |            }
 1)   1.645 us    |          }
 1)               |          pm_runtime_init() {
 1)   0.035 us    |            init_timer_key();
 1)   0.035 us    |            __init_waitqueue_head();
 1)   0.732 us    |          }
 1)   9.326 us    |        }
 1) + 17.547 us   |      }
 1)   0.035 us    |      __init_waitqueue_head();
 1)   0.034 us    |      __init_waitqueue_head();
 1)   0.035 us    |      __init_waitqueue_head();
 1)               |      add_disk() { ---------- This will send UDEV of KOBJ_ADD for creation of /dev/dm-X device
 1)   0.044 us    |        blk_alloc_devt();
 1)               |        bdi_register_dev() {
 1)               |      bdget_disk() {
 1)               |        disk_get_part() {
 1)   0.075 us    |          get_device();
 1)   0.575 us    |        }
 1)               |        bdget() {
 1)               |          iget5_locked() {
 1)   0.130 us    |            _raw_spin_lock();
 1)   0.285 us    |            find_inode();
 1)               |            alloc_inode() {
 1)               |      dm_stats_init [dm_mod]() {
 1)   0.035 us    |        __mutex_init();
 1)               |        __alloc_percpu() {
 1)               |          pcpu_alloc() {
 1)               |      dm_sysfs_init [dm_mod]() {
 1)   0.035 us    |        dm_disk [dm_mod]();
 1)   0.036 us    |        dm_kobject [dm_mod]();


{DM_TABLE_LOAD_CMD, 0, table_load}
 table_load function is called for initialising  the device.
r = dm_setup_md_queue(md, t);
dm_setup_md_queue -- calls dm_old_init_request_queue
Here request_fn for the device mapper is completely initialized.
blk_init_allocated_queue assigns the dm_request_fn
q->request_fn = rfn;
make_request_fn is initialised to blk_queue_bio

Now the function dm_init_normal_md_queue is called. This function initialises the softirq_done_fn with dm_softirq_done.
Also prep_rq_fn is initialized to dm_old_prep_fn.
At last this function calls elv_register_queue.

Yeah, why to look code flow we can ftrace it!! Here we go with the ftrace calls for table_load.

 1)               |  table_load [dm_mod]() {
 1)               |    find_device [dm_mod]() {
 1)               |      down_read() {
 1)   0.073 us    |        _cond_resched();
 1)   0.583 us    |      }
 1)               |      __find_device_hash_cell [dm_mod]() {
 1)               |        __get_name_cell [dm_mod]() {
 1)   0.123 us    |          dm_get [dm_mod]();
 1)   0.757 us    |        }
 1)   1.399 us    |      }
 1)   0.045 us    |      up_read();
 1)   3.060 us    |    }
 1)               |    dm_table_create [dm_mod]() {
 1)               |      kmem_cache_alloc_trace() {
 1)   0.040 us    |        _cond_resched();
 1)   0.499 us    |      }
 1)               |    dm_table_add_target [dm_mod]() {
 1)               |      dm_get_target_type [dm_mod]() {
 1)               |        get_target_type [dm_mod]() {
 1)               |          down_read() {
 1)   0.042 us    |            _cond_resched();
 1)   0.521 us    |          }
 1)   0.310 us    |          try_module_get();
 1)   0.045 us    |          up_read();
 1)   3.042 us    |        }
 1)   3.575 us    |      }
 1)               |      dm_split_args [dm_mod]() {
 1)               |        realloc_argv [dm_mod]() {
 1)               |          __kmalloc() {
 1)   0.058 us    |            kmalloc_slab();
 1)   0.040 us    |            _cond_resched();
 1)   0.776 us    |          }
 1)   0.051 us    |          kfree();
 1)   1.684 us    |        }
 1)   4.387 us    |      }
 1)               |      multipath_ctr [dm_multipath]() {
 1)   0.121 us    |        dm_get_reserved_rq_based_ios [dm_mod]();
 1)               |        kmem_cache_alloc_trace() {
 1)   0.041 us    |          _cond_resched();
 1)   0.416 us    |        }
 1)               |        ql_create [dm_queue_length]() {
 1)               |          kmem_cache_alloc_trace() {
 1)   0.040 us    |            _cond_resched();
 1)   0.475 us    |          }
 1)   0.807 us    |        }
 1)               |        ql_add_path [dm_queue_length]() {
 1)               |          kmem_cache_alloc_trace() {
 1)   0.042 us    |            _cond_resched();
 1)   0.391 us    |          }
 1)   0.951 us    |        }
 1)   0.041 us    |        dm_consume_args [dm_mod]();
 1) ! 539.424 us  |      }
 1)   0.134 us    |      kfree();
 1) ! 549.508 us  |    }
 0)   0.066 us    |    dm_get_immutable_target_type [dm_mod]();
 0)   0.093 us    |    dm_get_md_type [dm_mod]();
 0)   0.049 us    |    dm_table_get_type [dm_mod]();
 0)   0.044 us    |    dm_set_md_type [dm_mod]();
 0)               |    dm_setup_md_queue [dm_mod]() {
 0)               |      blk_init_allocated_queue() {
 0)               |        kmem_cache_alloc_trace() {
 0)   0.042 us    |          _cond_resched();
 0)   0.571 us    |        }
 0)               |        blk_init_rl() {
 0)               |        elevator_init() {
 0)               |          elevator_get() {
 0)   0.087 us    |            _raw_spin_lock();
 0)   0.391 us    |            elevator_find();
 0)   0.051 us    |            try_module_get();
 0)   0.040 us    |            _raw_spin_unlock();
 0)   1.862 us    |          }
 0)               |          deadline_init_queue() {
 0)               |            elevator_alloc() {
 0)               |              kmem_cache_alloc_node_trace() {
 0)   0.041 us    |                _cond_resched();
 0)   0.648 us    |              }
 0)   0.042 us    |              __mutex_init();
 0)   1.568 us    |            }
 0)               |            kmem_cache_alloc_node_trace() {
 0)   0.041 us    |              _cond_resched();
 0)   0.525 us    |            }
 0)   0.044 us    |            _raw_spin_lock_irq();
 0)   3.691 us    |          }
 0) + 45.638 us   |        }
 0)   0.047 us    |        mutex_unlock();
 0) + 62.761 us   |      }
 0)               |      dm_init_md_queue [dm_mod]() {
 0)               |        blk_queue_make_request() {
 0)   0.042 us    |          blk_queue_congestion_threshold();
 0)   0.044 us    |          blk_queue_bounce_limit();
 0)   0.721 us    |        }
 0)   0.044 us    |        blk_queue_bounce_limit();
 0)   0.042 us    |        blk_queue_merge_bvec();
 0)   1.827 us    |      }
 0)   0.043 us    |      blk_queue_softirq_done();
 0)   0.042 us    |      blk_queue_prep_rq();
 0)   0.042 us    |      blk_queue_lld_busy();
 0)               |      elv_register_queue() {
 0)               |    down_write() {
 0)   0.040 us    |      _cond_resched();
 0)   0.407 us    |    }
 0)   0.040 us    |    dm_get_mdptr [dm_mod]();
 0)   0.047 us    |    up_write();
 0)               |    __dev_status [dm_mod]() {
 0)   0.040 us    |      dm_disk [dm_mod]();
 0)   0.040 us    |      dm_suspended_md [dm_mod]();
 0)   0.040 us    |      dm_test_deferred_remove_flag [dm_mod]();
 0)   0.042 us    |      dm_open_count [dm_mod]();
 0)   0.047 us    |      dm_get_event_nr [dm_mod]();
 0)               |      dm_get_live_table [dm_mod]() {
 0)   0.134 us    |        __srcu_read_lock();
 0)   0.478 us    |      }
 0)               |      dm_put_live_table [dm_mod]() {
 0)   0.069 us    |        __srcu_read_unlock();
 0)   0.423 us    |      }
 0)   3.391 us    |    }
 0)   0.045 us    |    dm_put [dm_mod]();
 0) ! 3115.589 us |  } /* table_load [dm_mod] */


IO Flow:
Why are we doing all this, Yes for the IO flow.

Lets see how the IO Flows through our devmapper device. Now I am tracing dm_request_fn.

 0)               |  dm_request_fn [dm_mod]() {
 0)   0.240 us    |    __srcu_read_lock();
 0)               |    blk_peek_request() {
 0)               |      noop_dispatch() {
 0)   0.150 us    |        elv_dispatch_sort();
 0)   0.786 us    |      }
 0)               |      dm_prep_fn [dm_mod]() {
 0)               |        mempool_alloc() {
 0)               |          mempool_alloc_slab() {
 0)   0.412 us    |            kmem_cache_alloc();
 0)   1.019 us    |          }
 0)   1.589 us    |        }
 0)               |        blk_rq_prep_clone() {
 0)   0.280 us    |          blk_rq_init();
  0)   0.160 us    |    dm_table_find_target [dm_mod]();
 0)               |    multipath_busy [dm_multipath]() {
 0)   0.157 us    |      _raw_spin_lock_irqsave();
 0)               |      dm_underlying_device_busy [dm_mod]() {
 0)               |        blk_lld_busy() {
 0)   0.193 us    |          scsi_lld_busy();
 0)   1.056 us    |        }
 0)   1.905 us    |      }
 0)   0.096 us    |      _raw_spin_unlock_irqrestore();
 0)   5.042 us    |    }
 0)               |    dm_start_request [dm_mod]() {
 0)               |      blk_start_request() {
 0)   0.160 us    |        blk_dequeue_request();
 0)               |        blk_add_timer() {
 0)   0.180 us    |          __blk_add_timer();
 0)   1.032 us    |        }
 0)   2.910 us    |      }
 0)   0.150 us    |      dm_get [dm_mod]();
 0)   4.795 us    |    }
 0)   0.110 us    |    _raw_spin_unlock();
 0)               |    multipath_map [dm_multipath]() {
 0)               |      mempool_alloc() {
 0)               |        mempool_alloc_slab() {
 0)   0.193 us    |          kmem_cache_alloc();
 0)   1.152 us    |        }
 0)   2.015 us    |      }
 0)               |      map_io [dm_multipath]() {
 0)   0.107 us    |        _raw_spin_lock_irqsave();
 0)   0.136 us    |        ql_start_io [dm_queue_length]();
 0)   0.140 us    |        _raw_spin_unlock_irqrestore();
 0)   3.407 us    |      }
 0)   7.233 us    |    }
 0)               |    dm_dispatch_request [dm_mod]() {
 0)               |      blk_insert_cloned_request() {
 0)   0.120 us    |        blk_rq_check_limits();
 0)   0.103 us    |        _raw_spin_lock_irqsave();
 0)               |        blk_account_io_start() {
 0)   0.197 us    |          disk_map_sector_rcu();
 0)               |          part_round_stats() {
 0)   0.230 us    |            part_round_stats_single();
 0)   1.072 us    |          }
 0)   2.854 us    |        }
 0)               |        __elv_add_request() {
 0)               |          elv_drain_elevator() {
 0)   0.183 us    |            deadline_dispatch_requests();
 0)   1.003 us    |          }
 0)               |          __blk_run_queue() {
 0)               |            scsi_request_fn() {
 0)   0.277 us    |              get_device();
 0)               |              blk_peek_request() {
 0)               |                scsi_prep_fn() {
 0)               |                  scsi_get_command() {
 0)   0.100 us    |                    get_device();
 0)               |                    __scsi_get_command() {
 0)               |                      scsi_host_alloc_command() {
 0)               |                        scsi_pool_alloc_command() {
 0)   0.399 us    |                          kmem_cache_alloc();
 0)               |                          kmem_cache_alloc() {
 0)   0.922 us    |                            __slab_alloc();
 0)   1.871 us    |                          }
 0)   3.830 us    |                        }
 0)   4.778 us    |                      }
 0)   5.584 us    |                    }
 0)   0.163 us    |                    init_timer_key();
 0)   0.107 us    |                    _raw_spin_lock_irqsave();
 0)   0.110 us    |                    _raw_spin_unlock_irqrestore();
 0)   9.678 us    |                  }
 0)               |                  sd_init_command [sd_mod]() {
 0)               |                    scsi_setup_fs_cmnd() {
 0)   0.124 us    |                      alua_prep_fn();
 0)               |                      scsi_init_io() {
 0)               |                        scsi_init_sgtable() {
 0)               |                          scsi_alloc_sgtable() {
 0)               |                            scsi_sg_alloc() {
 0)               |                              mempool_alloc() {
 0)               |                                mempool_alloc_slab() {
 0)   0.300 us    |                                  kmem_cache_alloc();
 0)   1.112 us    |                                }
 0)   2.161 us    |                              }
 0)   2.964 us    |                            }
 0)   4.269 us    |                          }
0)               |              blk_start_request() {
 0)   0.210 us    |                blk_dequeue_request();
 0)               |                blk_add_timer() {
 0)               |                  __blk_add_timer() {
 0)   0.113 us    |                    round_jiffies_up();
 0)   1.482 us    |                  }
 0)   2.410 us    |                }
 0)   4.226 us    |              }
 0)   0.093 us    |              _raw_spin_unlock();
 0)   0.087 us    |              _raw_spin_lock();
 0)   0.403 us    |              scsi_init_cmd_errh();
 0)               |              scsi_dispatch_cmd() {
 0)   0.230 us    |                scsi_log_send();
 0)               |                iscsi_queuecommand [libiscsi]() {
 0)               |                  _raw_spin_lock_bh() {
 0)   0.074 us    |                    local_bh_disable();
 0)   0.702 us    |                  }
 0)               |                  iscsi_session_chkready [scsi_transport_iscsi]() {
 0)   0.086 us    |                    _raw_spin_lock_irqsave();
 0)   0.187 us    |                    _raw_spin_unlock_irqrestore();
 0)   1.562 us    |                  }
 0)   0.123 us    |                  queue_work_on();
 0)               |                  _raw_spin_unlock_bh() {
 0)   0.246 us    |                    local_bh_enable_ip();
 0)   0.839 us    |                  }
 0) + 17.190 us   |                }
 0) + 19.251 us   |              }
 0)   0.077 us    |              _raw_spin_lock_irq();
 0)               |              blk_peek_request() {
 0)   0.179 us    |                deadline_dispatch_requests();
 0)   0.873 us    |              }
 0)   0.130 us    |              put_device();
 0)   0.080 us    |              _raw_spin_lock_irq();
 0) + 87.287 us   |            }
 0) + 87.941 us   |          }
 0) + 90.672 us   |        }
 0)   0.096 us    |        _raw_spin_unlock_irqrestore();
 0) + 97.425 us   |      }
 0) + 98.507 us   |    }
 0)   0.077 us    |    _raw_spin_lock();
 0)               |    blk_peek_request() {
 0)   0.259 us    |      noop_dispatch();
 0)   1.063 us    |    }
 0)               |    blk_delay_queue() {
 0)   0.080 us    |      msecs_to_jiffies();
 0)   0.163 us    |      queue_delayed_work_on();
 0)   1.305 us    |    }
 0)   0.169 us    |    __srcu_read_unlock();
 0) ! 290.409 us  |  }


Wednesday, 13 June 2018

Linux kernel crash dump analysis

In this blog we will see how to analyse a deadlock using crash tool in Linux kernel.

We have written a simple kernel module to generate a deadlock:

Program to create a deadlock :
#include<linux/module.h>
#include<linux/version.h>
#include<linux/kernel.h>
#include<linux/semaphore.h>

#include<linux/kthread.h>
#include<linux/sched.h>
#include<linux/delay.h>
#include<linux/slab.h>

struct semaphore sync1;
struct semaphore sync2;
struct task_struct *task1;
struct task_struct *task2;
int shared_var;
int data;

int thread_function_one(void *data)
{
    int ret = 10;
    int cpu1;
    printk(KERN_INFO "IN THREAD FUNCTION 1 \n");

    while(!kthread_should_stop()){
       down(&sync1);
       cpu1 = get_cpu();
       put_cpu();
       printk("t1 cpu = %d shared_var = %d\n",cpu1,shared_var);
       msleep(1000);
       down(&sync2);
       up(&sync2);
       up(&sync1);
    }
    printk(KERN_INFO "EXIT from thread function 1\n");
    return ret;
}

int thread_function_two(void *data)
{
    int ret = 10;
    int cpu2;
    printk(KERN_INFO "IN THREAD FUNCTION 2 \n");

    while(!kthread_should_stop()){
        down(&sync2);
        cpu2 = get_cpu();
        put_cpu();
        printk("t2 cpu = %d shared_var = %d\n",cpu2,shared_var);
        msleep(2000);
        down(&sync1);
        up(&sync1);
        up(&sync2);
    }
    printk(KERN_INFO "EXIT from thread function 2\n");
    return ret;
}

static int kernel_init(void)
{
   int cpu3;
   sema_init(&sync1, 1);
   sema_init(&sync2, 1);
   printk(KERN_INFO "module_init address of sync1 = %p sync2 = %p\n",&sync1, &sync2);

   cpu3 = get_cpu();
   put_cpu();
   printk("main thread cpu = %d \n",cpu3);

   shared_var = 0;
   task1 = kthread_create(&thread_function_one,(void *)&data,"one");
   kthread_bind(task1, cpu3);
   wake_up_process(task1);

   cpu3 = 3;
   task2 = kthread_create(&thread_function_two,(void *)&data,"two");
   kthread_bind(task2, cpu3);
   wake_up_process(task2);

   return 0;
}

static void kernel_exit(void)
{
   kthread_stop(task1);
   kthread_stop(task2);
   printk(KERN_INFO "module_exit\n");
}

module_init(kernel_init);
module_exit(kernel_exit);

MODULE_AUTHOR("K_K");
MODULE_DESCRIPTION("SIMPLE MODULE");
MODULE_LICENSE("GPL");


Here we launch 2 kernel threads and try to take the 2 semaphores:

In thread "one" the semaphores are taken in the order  sync1 --> sync2. Also they are released like sync2 --> sync1
       down(&sync1);
       cpu1 = get_cpu();
...
...
       down(&sync2);
       up(&sync2);
       up(&sync1);

In thread 2 we take the semaphores in opposite order  sync2 --> sync1. Also they are released like sync1 --> sync2
        down(&sync2);
        cpu2 = get_cpu();
...
...
        down(&sync1);
        up(&sync1);
        up(&sync2);

If we launch this program we see that these 2 threads gets entangled in a deadlock. This is a classic case of ABBA deadlock.
(The solution for this is correct lock ordering)

The print from the dmseg shows following :

Dec 12 02:49:25 localhost kernel: device-mapper: multipath: kundan message: error getting device 8:0
Dec 12 02:49:25 localhost kernel: threads: module verification failed: signature and/or required key missing - tainting kernel
Dec 12 02:49:25 localhost kernel: module_init address of sync1 = ffffffffa0536280 sync2 = ffffffffa0536260
Dec 12 02:49:25 localhost kernel: main thread cpu = 1
Dec 12 02:49:25 localhost kernel: IN THREAD FUNCTION 1
Dec 12 02:49:25 localhost kernel: t1 cpu = 1 shared_var = 0
Dec 12 02:49:25 localhost kernel: IN THREAD FUNCTION 2
Dec 12 02:49:25 localhost kernel: t2 cpu = 3 shared_var = 0

Also  ps command shows 2 kernel thread in UNINTERRUPTIBLE SLEEP :
ps :
   3580      2   1  ffff88007b1aa220  UN   0.0       0      0  [one]
   3581      2   3  ffff88007b1ac440  UN   0.0       0      0  [two]

We can get the backtrace of these threads from following command :
echo t > /proc/sysrq-trigger

It generates lot of output in kernel logs. It dumps the stack of all the kernel processes in kernel logs.

For our 2 threads we get following call stack :
Dec 12 03:21:59 localhost kernel: one             D ffff88007fc94580     0  3580      2 0x00000080
Dec 12 03:21:59 localhost kernel: ffff880079573d50 0000000000000046 ffff880079573fd8 0000000000014580
Dec 12 03:21:59 localhost kernel: ffff880079573fd8 0000000000014580 ffff88007b1aa220 ffffffffa0536260
Dec 12 03:21:59 localhost kernel: 7fffffffffffffff ffff88007b1aa220 0000000000000002 0000000000000000
Dec 12 03:21:59 localhost kernel: Call Trace:
Dec 12 03:21:59 localhost kernel: [<ffffffff815f1a39>] schedule+0x29/0x70
Dec 12 03:21:59 localhost kernel: [<ffffffff815ef939>] schedule_timeout+0x209/0x2d0
Dec 12 03:21:59 localhost kernel: [<ffffffff8106e37b>] ? lock_timer_base.isra.35+0x2b/0x50
Dec 12 03:21:59 localhost kernel: [<ffffffff8106f06e>] ? try_to_del_timer_sync+0x5e/0x90
Dec 12 03:21:59 localhost kernel: [<ffffffff815f0f32>] __down_common+0xd2/0x14a
Dec 12 03:21:59 localhost kernel: [<ffffffffa0534000>] ? 0xffffffffa0533fff
Dec 12 03:21:59 localhost kernel: [<ffffffff815f0fc7>] __down+0x1d/0x1f
Dec 12 03:21:59 localhost kernel: [<ffffffff8108b8c1>] down+0x41/0x50
Dec 12 03:21:59 localhost kernel: [<ffffffffa0534073>] thread_function_one+0x73/0x90 [threads]
Dec 12 03:21:59 localhost kernel: [<ffffffff81085b0f>] kthread+0xcf/0xe0
Dec 12 03:21:59 localhost kernel: [<ffffffff81085a40>] ? kthread_create_on_node+0x140/0x140
Dec 12 03:21:59 localhost kernel: [<ffffffff815fc8ec>] ret_from_fork+0x7c/0xb0
Dec 12 03:21:59 localhost kernel: [<ffffffff81085a40>] ? kthread_create_on_node+0x140/0x140
Dec 12 03:21:59 localhost kernel: two             D ffff88007fd94580     0  3581      2 0x00000080
Dec 12 03:21:59 localhost kernel: ffff880034835d50 0000000000000046 ffff880034835fd8 0000000000014580
Dec 12 03:21:59 localhost kernel: ffff880034835fd8 0000000000014580 ffff88007b1ac440 ffffffffa0536280
Dec 12 03:21:59 localhost kernel: 7fffffffffffffff ffff88007b1ac440 0000000000000002 0000000000000000
Dec 12 03:21:59 localhost kernel: Call Trace:
Dec 12 03:21:59 localhost kernel: [<ffffffff815f1a39>] schedule+0x29/0x70
Dec 12 03:21:59 localhost kernel: [<ffffffff815ef939>] schedule_timeout+0x209/0x2d0
Dec 12 03:21:59 localhost kernel: [<ffffffff8106e37b>] ? lock_timer_base.isra.35+0x2b/0x50
Dec 12 03:21:59 localhost kernel: [<ffffffff8106f06e>] ? try_to_del_timer_sync+0x5e/0x90
Dec 12 03:21:59 localhost kernel: [<ffffffff815f0f32>] __down_common+0xd2/0x14a
Dec 12 03:21:59 localhost kernel: [<ffffffffa0534090>] ? thread_function_one+0x90/0x90 [threads]
Dec 12 03:21:59 localhost kernel: [<ffffffff815f0fc7>] __down+0x1d/0x1f
Dec 12 03:21:59 localhost kernel: [<ffffffff8108b8c1>] down+0x41/0x50
Dec 12 03:21:59 localhost kernel: [<ffffffffa0534103>] thread_function_two+0x73/0x90 [threads]
Dec 12 03:21:59 localhost kernel: [<ffffffff81085b0f>] kthread+0xcf/0xe0
Dec 12 03:21:59 localhost kernel: [<ffffffff81085a40>] ? kthread_create_on_node+0x140/0x140
Dec 12 03:21:59 localhost kernel: [<ffffffff815fc8ec>] ret_from_fork+0x7c/0xb0
Dec 12 03:21:59 localhost kernel: [<ffffffff81085a40>] ? kthread_create_on_node+0x140/0x140


We see that they are both waiting at __down_common which is the semaphore acquiring function.


Now we can use linux crash tool to view the same. Linux crash tool has the advantage of seeing the values of stack variables.

A live system can be debugged using the crash utility. Following is the command for same : 
1. crash /usr/lib/debug/lib/modules/3.15.X.XX-XXX.XX.X.el7.x86_64/vmlinux
If we want to debug a core generated because of kernel crash we need to give the same command with core generated.
2. crash /usr/lib/debug/lib/modules/3.15.X.XX-XXX.XX.X.el7.x86_64/vmlinux /path/to/vmcore

The first command will give a prompt to start the live debugging of kernel.
1. ps command : We can see the current threads run by the linux kernel using the crash ps command. This is similar to the shell ps sommand but give task_struct address for the running threads.

2. set command : We can set the context of any thread using the set command

crash> set ffff88007b1aa220
    PID: 3580
COMMAND: "one"
   TASK: ffff88007b1aa220  [THREAD_INFO: ffff880079572000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE

crash> set ffff88007b1ac440
    PID: 3581
COMMAND: "two"
   TASK: ffff88007b1ac440  [THREAD_INFO: ffff880034834000]
    CPU: 3
  STATE: TASK_UNINTERRUPTIBLE

3. bt command : For getting the thread stack of a context we can use the "bt" command. bt --> backtrace
crash> bt
PID: 3580   TASK: ffff88007b1aa220  CPU: 1   COMMAND: "one"
 #0 [ffff880079573cf0] __schedule at ffffffff815f14fd
 #1 [ffff880079573d58] schedule at ffffffff815f1a39
 #2 [ffff880079573d68] schedule_timeout at ffffffff815ef939
 #3 [ffff880079573e10] __down_common at ffffffff815f0f32
 #4 [ffff880079573e88] __down at ffffffff815f0fc7
 #5 [ffff880079573e98] down at ffffffff8108b8c1
 #6 [ffff880079573eb8] thread_function_one at ffffffffa0534073 [threads]
 #7 [ffff880079573ec8] kthread at ffffffff81085b0f
 #8 [ffff880079573f50] ret_from_fork at ffffffff815fc8ec
crash> bt
PID: 3581   TASK: ffff88007b1ac440  CPU: 3   COMMAND: "two"
 #0 [ffff880034835cf0] __schedule at ffffffff815f14fd
 #1 [ffff880034835d58] schedule at ffffffff815f1a39
 #2 [ffff880034835d68] schedule_timeout at ffffffff815ef939
 #3 [ffff880034835e10] __down_common at ffffffff815f0f32
 #4 [ffff880034835e88] __down at ffffffff815f0fc7
 #5 [ffff880034835e98] down at ffffffff8108b8c1
 #6 [ffff880034835eb8] thread_function_two at ffffffffa0534103 [threads]
 #7 [ffff880034835ec8] kthread at ffffffff81085b0f
 #8 [ffff880034835f50] ret_from_fork at ffffffff815fc8ec

4. bt -f : We also might need to get the value of local variables in the current context. The addresses of the variables can be fetched using the "bt -f" command.
Lets see the output of this for our hung thread:

crash> bt -f
PID: 3581   TASK: ffff88007b1ac440  CPU: 3   COMMAND: "two"
 #0 [ffff880034835cf0] __schedule at ffffffff815f14fd
    ffff880034835cf8: 0000000000000046 ffff880034835fd8
    ffff880034835d08: 0000000000014580 ffff880034835fd8
    ffff880034835d18: 0000000000014580 ffff88007b1ac440
    ffff880034835d28: ffffffffa0536280 7fffffffffffffff
    ffff880034835d38: ffff88007b1ac440 0000000000000002
    ffff880034835d48: 0000000000000000 ffff880034835d60
    ffff880034835d58: ffffffff815f1a39
 #1 [ffff880034835d58] schedule at ffffffff815f1a39
    ffff880034835d60: ffff880034835e08 ffffffff815ef939
 #2 [ffff880034835d68] schedule_timeout at ffffffff815ef939
    ffff880034835d70: 7fffffffffffffff ffff880034835da8
    ffff880034835d80: ffffffff8106e37b ffff880034835e10
    ffff880034835d90: ffff88007c514000 0000000000000286
    ffff880034835da0: 0000000000000000 ffff880034835dd8
    ffff880034835db0: ffffffff8106f06e 0000000000000286
    ffff880034835dc0: 000000001daef83c ffff880034835e10
    ffff880034835dd0: ffff880034835e10 000000001daef83c
    ffff880034835de0: ffffffffa0536280 7fffffffffffffff
    ffff880034835df0: ffff88007b1ac440 0000000000000002
    ffff880034835e00: 0000000000000000 ffff880034835e80
    ffff880034835e10: ffffffff815f0f32
 #3 [ffff880034835e10] __down_common at ffffffff815f0f32
    ffff880034835e18: 0000000000000000 ffffffffa0536288
    ffff880034835e28: ffffffffa0536288 ffff88007b1ac440
    ffff880034835e38: ffff88007b1ac400 ffffffffffffffff
    ffff880034835e48: 000000001daef83c 0000000000000000
    ffff880034835e58: ffffffffa0536280 ffffffffa0536240
    ffff880034835e68: ffffffffa0534090 0000000000000000
    ffff880034835e78: 0000000000000000 ffff880034835e90
    ffff880034835e88: ffffffff815f0fc7
 #4 [ffff880034835e88] __down at ffffffff815f0fc7
    ffff880034835e90: ffff880034835eb0 ffffffff8108b8c1
 #5 [ffff880034835e98] down at ffffffff8108b8c1
    ffff880034835ea0: 0000000000000286 ffff880079ac7cb8
    ffff880034835eb0: ffff880034835ec0 ffffffffa0534103
 #6 [ffff880034835eb8] thread_function_two at ffffffffa0534103 [threads]
    ffff880034835ec0: ffff880034835f48 ffffffff81085b0f
 #7 [ffff880034835ec8] kthread at ffffffff81085b0f
    ffff880034835ed0: 0000000000000000 0000000000000000
    ffff880034835ee0: ffffffffa0536240 0000000000000000
    ffff880034835ef0: 0000000000000000 ffff880034835ef8
    ffff880034835f00: ffff880034835ef8 ffff880000000000
    ffff880034835f10: ffff880000000000 ffff880034835f18
    ffff880034835f20: ffff880034835f18 000000001daef83c
    ffff880034835f30: ffffffff81085a40 0000000000000000
    ffff880034835f40: 0000000000000000 ffff880079ac7cb8
    ffff880034835f50: ffffffff815fc8ec
 #8 [ffff880034835f50] ret_from_fork at ffffffff815fc8ec

Here the addresses are pointing to the local variables. These are hell lot of addresses, how to trace our variable. Generally because of optimizations these variables appear at differenet places in the stack.
So what I did at the start of our modules is print the addresses of these variables : ;
Dec 12 02:49:25 localhost kernel: module_init address of sync1 = ffffffffa0536280 sync2 = ffffffffa0536260


5. Fetching the value of variables :
Once the variable address is sure to us we can just give commands in following fashion, to fetch the values of variable.

crash> struct semaphore ffffffffa0536280
struct semaphore {
  lock = {
    raw_lock = {
      {
        head_tail = 262148,
        tickets = {
          head = 4,
          tail = 4
        }
      }
    }
  },
  count = 0,
  wait_list = {
    next = 0xffff880034835e20,
    prev = 0xffff880034835e20
  }
}

crash> struct semaphore ffffffffa0536260
struct semaphore {
  lock = {
    raw_lock = {
      {
        head_tail = 262148,
        tickets = {
          head = 4,
          tail = 4
        }
      }
    }
  },
  count = 0,
  wait_list = {
    next = 0xffff880079573e20,
    prev = 0xffff880079573e20
  }
}