Sunday 7 May 2017

Linux memory initialization (contd.)


In last blog we saw the construction of e820 map.

setup_arch then calls :
memblock_x86_fill();
This function goes through the memory map provided by the e820 and adds memory regions reserved by the kernel to the memblock with the memblock_add function.


setup_arch then calls :
         init_mem_mapping 
         init_mem_mapping calls memory_map_bottom_up or memory_map_top_down

this calls init_range_memory_mapping which calls init_memory_mapping 

init_memory_mapping : Setup the direct mapping of the physical memory at PAGE_OFFSET. (0xffff880000000000). From here kernel_physical_mapping_init is done.

physical memory to kernel virtual address space, a total of max_low_pfn pages, by creating page tables starting from address PAGE_OFFSET.

Required pgd pud are created as required.
pgd_populate(&init_mm, pgd, pud);

for 32 bit systems : init_mem_mapping now calls early_ioremap_page_table_range_init to  Build a proper pagetable for the kernel mappings. If we're booting on native hardware, this will be a pagetable constructed in arch/x86/kernel/head_32.S

setup_arch then calls : 
initmem_init();
initmem_init calls : x86_numa_init
as CONFIG_ACPI_NUMA is y it calls numa_init  with init function x86_acpi_numa_init
numa_init first calls the x86_acpi_numa_init
x86_acpi_numa_init calls acpi_numa_init
acpi_numa_init parses the  Static Resource Affinity Table
sets x2apic affinity, processor affinity, memory_affinity
acpi_numa_memory_affinity_init
numa_init then calls : numa_register_memblks
all the memblocks of numa are registered and then setup_node_data is called 
setup_node_data : Initializes NODE_DATA for a node on the local memory
Mar  7 02:51:29 localhost kernel: Initmem setup node 0 [mem 0x00000000-0x7fffffff] is printed for node 
Node physical addresses are printed 
printk(KERN_INFO "  NODE_DATA [mem %#010Lx-%#010Lx]\n",    nd_pa, nd_pa + nd_size - 1);
the virtual address is put in node_data[nid] = nd; node_data is of type struct pglist_data
Now the macro NODE_DATA points to array of node specific data structure 
unsigned long node_start_pfn and unsigned long node_spanned_pages; /* total size of physical page range, including holes */
these 2 variables are initialized.
numa_init calls numa_init_array to initialize cpu->node array


Now the paging pagetable_init function is called : 
  x86_init.paging.pagetable_init(); this is a function pointer to paging_init in arch/x86/mm/init_64.c
.paging = {
.pagetable_init = native_pagetable_init,
},

#ifdef CONFIG_X86_32
extern void native_pagetable_init(void);
#else
#define native_pagetable_init        paging_init
#endif

 paging_init calls : 
 zone_sizes_init()
In this function zones are initialized and max zones page frame numbers are initialized for  4 zones : 
ZONE_DMA - MAX_DMA_PFN
/* 16MB ISA DMA zone */
#define MAX_DMA_PFN   ((16 * 1024 * 1024) >> PAGE_SHIFT)

ZONE_DMA32 -  MAX_DMA32_PFN

/* 4GB broken PCI/AGP hardware bus master zone */
#define MAX_DMA32_PFN ((4UL * 1024 * 1024 * 1024) >> PAGE_SHIFT)

ZONE_NORMAL - max_low_pfn

ZONE_HIGHMEM - max_pfn
Then this array is passed to free_area_init_nodes function 
free_area_init_nodes
This will call free_area_init_node() for each active node in the system. Using the page ranges provided by add_active_range(), the size of each zone in each node and their holes is calculated. If the maximum PFN between two adjacent zones match, it is assumed that the zone is empty.
This tries to find the active regions from the memblock regions.
ii print sthe early memory node ranges from memblocks
Mar  7 02:51:29 localhost kernel: Early memory node ranges
Mar  7 02:51:29 localhost kernel:  node   0: [mem 0x00001000-0x0009efff]
Mar  7 02:51:29 localhost kernel:  node   0: [mem 0x00100000-0x7feeffff]
Mar  7 02:51:29 localhost kernel:  node   0: [mem 0x7ff00000-0x7fffffff]
free_area_init_node 
alloc_node_mem_map uses boot mem to allocate the mem_map
Thi function fiest allocates the pg_data_t *pgdat = NODE_DATA(nid);
fills the pgdat structure : 
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
Calculates total pages in node using calculate_node_totalpages
pgdat->node_present_pages = realtotalpages;
Now this function allocates the node specific mem_map using alloc_node_mem_map

Kernel Memory allocators
 Now lets see the START_KERNEL function .. which called our setup_arch
 Then it calls : build_all_zonelists
 Then it calls mm_init
 This function sets up the kernel memory allocators :

 static void __init mm_init(void)
 {
  /*
   * page_cgroup requires contiguous pages,
   * bigger than MAX_ORDER unless SPARSEMEM.
   */
  page_cgroup_init_flatmem();
  mem_init();
  kmem_cache_init();
  percpu_init_late();
  pgtable_init();
  vmalloc_init();

 }
Good info in stack overflow :
https://stackoverflow.com/questions/4528568/how-does-the-linux-kernel-manage-less-than-1gb-physical-memory
https://stackoverflow.com/questions/33557691/kernel-space-and-user-space-layout-in-page-table
https://stackoverflow.com/questions/27604089/why-does-kernel-add-kernel-master-page-table-to-processs-page-table
https://stackoverflow.com/questions/27222060/while-forking-a-process-why-does-linux-kernel-copy-the-content-of-kernel-page

User space memory allocation : 
used kernel code 3.10 for this illustration :
Just to recollect that any program initially just tries to access a virtual memory location.
If the virtual location does not have the corresponding physical location then MMU will generate a fault.
This fault is captured by do_page_fault handler. This handler checks if the faulting area is within the
ranges of vm area start-end. If yes then it is a good fault and a RAM physical page shall be allocated for same(Demand paging).

do_page_fault is called for any page fault. The page fault address is fetched from the CR2 register.
CR2 register contains a value called Page Fault Linear Address (PFLA).
When a page fault occurs, the address the program attempted to access is stored in the CR2 register.

unsigned long address = read_cr2();

Now do_page_fault calls __do_page_fault.
__do_page_fault checks the validity of the address by checking if it exists after vma->vm_start.
It also does many other checks and raises SEGV_MAPERR or SEGV_ACCERR.

If it is a good area then it calls handle_mm_fault.
__handle_mm_fault finds/allocates the PGD, PUD, PMD entries and calls :
handle_pte_fault code looks like :

if (!pte_present(entry)) {
if (pte_none(entry)) {
if (!vma_is_anonymous(vma))
return do_linear_fault(mm, vma, address,
pte, pmd, flags, entry);
return do_anonymous_page(mm, vma, address,
pte, pmd, flags);
}
if (pte_file(entry))
return do_nonlinear_fault(mm, vma, address,
pte, pmd, flags, entry);
return do_swap_page(mm, vma, address,
pte, pmd, flags, entry);
}

Say for the anonymous pages do_anonymous_page is getting called.

For Writes it allocates new page using alloc_zeroed_user_highpage_movable :
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;

Eventually it shall call the alloc_page and allocate a page fron the buddy allocator
static inline struct page *
alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
unsigned long vaddr)
{
return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr);
}

#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)

#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)

#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
alloc_pages(gfp_mask, order)

No comments:

Post a Comment