Articles in this blog

Sunday, 7 May 2017

Linux memory initialization (contd.)


In last blog we saw the construction of e820 map.

setup_arch then calls :
memblock_x86_fill();
This function goes through the memory map provided by the e820 and adds memory regions reserved by the kernel to the memblock with the memblock_add function.


setup_arch then calls :
         init_mem_mapping 
         init_mem_mapping calls memory_map_bottom_up or memory_map_top_down

this calls init_range_memory_mapping which calls init_memory_mapping 

init_memory_mapping : Setup the direct mapping of the physical memory at PAGE_OFFSET. (0xffff880000000000). From here kernel_physical_mapping_init is done.

physical memory to kernel virtual address space, a total of max_low_pfn pages, by creating page tables starting from address PAGE_OFFSET.

Required pgd pud are created as required.
pgd_populate(&init_mm, pgd, pud);

for 32 bit systems : init_mem_mapping now calls early_ioremap_page_table_range_init to  Build a proper pagetable for the kernel mappings. If we're booting on native hardware, this will be a pagetable constructed in arch/x86/kernel/head_32.S

setup_arch then calls : 
initmem_init();
initmem_init calls : x86_numa_init
as CONFIG_ACPI_NUMA is y it calls numa_init  with init function x86_acpi_numa_init
numa_init first calls the x86_acpi_numa_init
x86_acpi_numa_init calls acpi_numa_init
acpi_numa_init parses the  Static Resource Affinity Table
sets x2apic affinity, processor affinity, memory_affinity
acpi_numa_memory_affinity_init
numa_init then calls : numa_register_memblks
all the memblocks of numa are registered and then setup_node_data is called 
setup_node_data : Initializes NODE_DATA for a node on the local memory
Mar  7 02:51:29 localhost kernel: Initmem setup node 0 [mem 0x00000000-0x7fffffff] is printed for node 
Node physical addresses are printed 
printk(KERN_INFO "  NODE_DATA [mem %#010Lx-%#010Lx]\n",    nd_pa, nd_pa + nd_size - 1);
the virtual address is put in node_data[nid] = nd; node_data is of type struct pglist_data
Now the macro NODE_DATA points to array of node specific data structure 
unsigned long node_start_pfn and unsigned long node_spanned_pages; /* total size of physical page range, including holes */
these 2 variables are initialized.
numa_init calls numa_init_array to initialize cpu->node array


Now the paging pagetable_init function is called : 
  x86_init.paging.pagetable_init(); this is a function pointer to paging_init in arch/x86/mm/init_64.c
.paging = {
.pagetable_init = native_pagetable_init,
},

#ifdef CONFIG_X86_32
extern void native_pagetable_init(void);
#else
#define native_pagetable_init        paging_init
#endif

 paging_init calls : 
 zone_sizes_init()
In this function zones are initialized and max zones page frame numbers are initialized for  4 zones : 
ZONE_DMA - MAX_DMA_PFN
/* 16MB ISA DMA zone */
#define MAX_DMA_PFN   ((16 * 1024 * 1024) >> PAGE_SHIFT)

ZONE_DMA32 -  MAX_DMA32_PFN

/* 4GB broken PCI/AGP hardware bus master zone */
#define MAX_DMA32_PFN ((4UL * 1024 * 1024 * 1024) >> PAGE_SHIFT)

ZONE_NORMAL - max_low_pfn

ZONE_HIGHMEM - max_pfn
Then this array is passed to free_area_init_nodes function 
free_area_init_nodes
This will call free_area_init_node() for each active node in the system. Using the page ranges provided by add_active_range(), the size of each zone in each node and their holes is calculated. If the maximum PFN between two adjacent zones match, it is assumed that the zone is empty.
This tries to find the active regions from the memblock regions.
ii print sthe early memory node ranges from memblocks
Mar  7 02:51:29 localhost kernel: Early memory node ranges
Mar  7 02:51:29 localhost kernel:  node   0: [mem 0x00001000-0x0009efff]
Mar  7 02:51:29 localhost kernel:  node   0: [mem 0x00100000-0x7feeffff]
Mar  7 02:51:29 localhost kernel:  node   0: [mem 0x7ff00000-0x7fffffff]
free_area_init_node 
alloc_node_mem_map uses boot mem to allocate the mem_map
Thi function fiest allocates the pg_data_t *pgdat = NODE_DATA(nid);
fills the pgdat structure : 
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
Calculates total pages in node using calculate_node_totalpages
pgdat->node_present_pages = realtotalpages;
Now this function allocates the node specific mem_map using alloc_node_mem_map

Kernel Memory allocators
 Now lets see the START_KERNEL function .. which called our setup_arch
 Then it calls : build_all_zonelists
 Then it calls mm_init
 This function sets up the kernel memory allocators :

 static void __init mm_init(void)
 {
  /*
   * page_cgroup requires contiguous pages,
   * bigger than MAX_ORDER unless SPARSEMEM.
   */
  page_cgroup_init_flatmem();
  mem_init();
  kmem_cache_init();
  percpu_init_late();
  pgtable_init();
  vmalloc_init();

 }
Good info in stack overflow :
https://stackoverflow.com/questions/4528568/how-does-the-linux-kernel-manage-less-than-1gb-physical-memory
https://stackoverflow.com/questions/33557691/kernel-space-and-user-space-layout-in-page-table
https://stackoverflow.com/questions/27604089/why-does-kernel-add-kernel-master-page-table-to-processs-page-table
https://stackoverflow.com/questions/27222060/while-forking-a-process-why-does-linux-kernel-copy-the-content-of-kernel-page

User space memory allocation : 
used kernel code 3.10 for this illustration :
Just to recollect that any program initially just tries to access a virtual memory location.
If the virtual location does not have the corresponding physical location then MMU will generate a fault.
This fault is captured by do_page_fault handler. This handler checks if the faulting area is within the
ranges of vm area start-end. If yes then it is a good fault and a RAM physical page shall be allocated for same(Demand paging).

do_page_fault is called for any page fault. The page fault address is fetched from the CR2 register.
CR2 register contains a value called Page Fault Linear Address (PFLA).
When a page fault occurs, the address the program attempted to access is stored in the CR2 register.

unsigned long address = read_cr2();

Now do_page_fault calls __do_page_fault.
__do_page_fault checks the validity of the address by checking if it exists after vma->vm_start.
It also does many other checks and raises SEGV_MAPERR or SEGV_ACCERR.

If it is a good area then it calls handle_mm_fault.
__handle_mm_fault finds/allocates the PGD, PUD, PMD entries and calls :
handle_pte_fault code looks like :

if (!pte_present(entry)) {
if (pte_none(entry)) {
if (!vma_is_anonymous(vma))
return do_linear_fault(mm, vma, address,
pte, pmd, flags, entry);
return do_anonymous_page(mm, vma, address,
pte, pmd, flags);
}
if (pte_file(entry))
return do_nonlinear_fault(mm, vma, address,
pte, pmd, flags, entry);
return do_swap_page(mm, vma, address,
pte, pmd, flags, entry);
}

Say for the anonymous pages do_anonymous_page is getting called.

For Writes it allocates new page using alloc_zeroed_user_highpage_movable :
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;

Eventually it shall call the alloc_page and allocate a page fron the buddy allocator
static inline struct page *
alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
unsigned long vaddr)
{
return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr);
}

#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)

#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)

#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
alloc_pages(gfp_mask, order)

Linux Memory initialization

Linux memory init:


start_kernel calls setup_arch. setup_arch is a big function where many of the early boot memory allocations and initializations happen.

When not many of the kernel memory allocation apis are not available at boot, kernel uses memblock APIs to do the needed allocations.

One more mode of allocations is using the early ioremap functionality :

EARLY IOREMAP initialization : 

early_ioremap_init :
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
fills the slot_virt array with virtual addresses of early fixmaps.

static unsigned long slot_virt[FIX_BTMAPS_SLOTS] __initdata;
static void __iomem *prev_map[FIX_BTMAPS_SLOTS] __initdata;
static unsigned long prev_size[FIX_BTMAPS_SLOTS] __initdata;

early_ioremap_init function fills the slot_virt array with the virtual addresses of the early fixmaps. There are 8 slots with 64 boot time map each. Altogether 512 maps.
pmd_populate_kernel fetches the pmd for FIX_BTMAP_BEGIN.

pmd_populate_kernel function populates the page middle directory (pmd) provided as an argument with the given page table entries (bm_pte).

As soon as early ioremap has been setup successfully, we can use it. It provides two functions:

early_ioremap
early_iounmap

More details here :
https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md#use-of-early-ioremap

MEMBLOCKS :
Linux memblock initialization : 
Initially we have array of 128 regions for memory init and 128 regions for reserved init regions.
memblock_reserve region add memblock to reserve_init regions.

static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;

struct memblock memblock __initdata_memblock = {
.memory.regions = memblock_memory_init_regions,
.memory.cnt = 1, /* empty dummy entry */
.memory.max = INIT_MEMBLOCK_REGIONS,

.reserved.regions = memblock_reserved_init_regions,
.reserved.cnt = 1, /* empty dummy entry */
.reserved.max = INIT_MEMBLOCK_REGIONS,

.bottom_up = false,

.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};

#define INIT_MEMBLOCK_REGIONS 128

Memblock APIs :

memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
This function takes a physical base address and the size of the memory region as arguments and add them to the memblock



SETUP ARCH : 

reserves memory in for of memblock for _text, _data _bss of kernel using :
memblock_reserve(__pa_symbol(_text),
(unsigned long)__bss_stop - (unsigned long)_text);


Then it does the memblock reserve for initrd :
early_reserve_initrd

SETUP MEMORY MAP : 
Next memory related function is setup_memory_map. To understand this function we need to first go through the basics of e820 map.

e820 is used by BIOS to report the memory map to linux.
The memory map is built in function setup_arch which is called by start_kernel(). It is accessed via the int 15h call, by setting the AX register to value E820 in hexadecimal. Source (https://en.wikipedia.org/wiki/E820)

arch/x86/kernel/setup.c

start_kernel()
setup_arch()
setup_memory_map()
default_machine_specific_memory_setup()
Here the entries are taken from boot_params e820_map. Regions are sanitized and saved to struct e820map e820;

struct e820map {
__u32 nr_map;
struct e820entry map[E820_X_MAX];
};

append_e820_map copies all the BIOS entries into a safe place i.e struct e820map e820;
The same map is then printed on the Linux kernel messages:

Jul 13 21:42:42 localhost kernel: e820: BIOS-provided physical RAM map:
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009f7ff] usable
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x00000000000dc000-0x00000000000fffff] reserved
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x0000000000100000-0x000000007feeffff] usable
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x000000007fef0000-0x000000007fefefff] ACPI data
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x000000007feff000-0x000000007fefffff] ACPI NVS
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x000000007ff00000-0x000000007fffffff] usable
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Jul 13 21:42:42 localhost kernel: BIOS-e820: [mem 0x00000000fffe0000-0x00000000ffffffff] reserved

Lets see how many MBs gets used by all these regions :
region 1 9f7ff locations, meaning 9f7ff  8 bytes : 653311 bytes -- .65 MB
region 2 7FFF , 2047  8bytes 0.002 MBs
region 3 23FFF locations, .147 MB
region 4 7fDEFFFF locations, 2145320959 bytes 2.14 GB
region 5   .061 MB
region 6  fff 0.04 MB
region 7 1048575 1 MB
region 8 134mb
region 9 0.04mb
region 10 0.04mb
region 11 0.065 MB


Seems OK  as I have 2048 assigned for my OS in my VM

Following are the types of memory in e820 map :
01h memory, available to OS
02h reserved, not available (e.g. system ROM, memory-mapped device)
03h ACPI Reclaim Memory (usable by OS after reading ACPI tables)
04h ACPI NVS Memory (OS is required to save this memory between NVS sessions)

extended map is parsed and printed using parse_e820_ext

The setup_arch function then assigns the following variables
init_mm.start_code = (unsigned long) _text;
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = _brk_end;

May  7 20:55:51 localhost kernel: _text = 0xffffffff81000000
May  7 20:55:51 localhost kernel: _etext = 0xffffffff81600f65
May  7 20:55:51 localhost kernel: _edata = 0xffffffff81a00800
code_resource.start = 1000000
code_resource.end = 1600f64
data_resource.start = 1600f65
data_resource.end = 1a007ff
bss_resource.start = 1b95000
bss_resource.end = 1e2cfff

e820_add_kernel_range() adds the kernel range.
Apr  1 20:10:35 localhost kernel: e820: last_pfn = 0x80000 max_arch_pfn = 0x400000000
The last pfn comes as 128 MB amd max_arch_pfn comes to be 64TB.