Wednesday 29 August 2018

Linux system call implementation x86_64


Userspace call of the systemcall:

x86_64 user programs invoke a system call by putting the system call number (0 for read) into the RAX register, and the other parameters into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the x86_64 syscall instruction.
http://man7.org/linux/man-pages/man2/syscall.2.html
x86-64      syscall               rax        rax     -        [5]

This instruction causes the processor to transition to ring 0 and invoke the function referenced by the MSR_LSTAR model-specific register.

The MSR_LSTAR model-specific register is initialized at the kernel bootup:


/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
extern char _entry_trampoline[];
extern char entry_SYSCALL_64_trampoline[];

int cpu = smp_processor_id();
unsigned long SYSCALL64_entry_trampoline =
(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
(entry_SYSCALL_64_trampoline - _entry_trampoline);

wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
if (static_cpu_has(X86_FEATURE_PTI))
wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
else
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

#ifdef CONFIG_IA32_EMULATION
wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
/*
* This only works on Intel CPUs.
* On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
* This does not cause SYSENTER to jump to the wrong location, because
* AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
*/
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_entry_stack(cpu) + 1));
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
#else
wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
#endif

/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
       X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
       X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
}

The system_call is the function which gets called for each system call.
entry_64.S file has the definition of the system_call function.
The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in the sys_call_table table.

ENTRY(system_call)
...
movq %r10,%rcx /* fixup for C */
call *sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)
...
END(system_call)


Apart from the change via the SYSCALL, we also change the privilege : 

Privilege levels (CPL DPL RPL):

X86 systems has a feature of privilege levels. This restricts the memory access, IO ports access and ability to execute certain machine instructions. For kernel this privilege level is 0 and for user space programs it is 3. A code executing cannot change its privilege level itself. Change in privilege level can be done using lcall, int, lret and iret instructions. The raise of privilege level is done by lcall and iret instructions and lowering by lret and iret instructions. This explains the use of int 0x80 done while executing any system call. It is this int instruction which elevates the privilege level from user(0) to kernel(3) space.

The change in LDTR, GDTR and IDTR is also done. The RPL DPL are defined from these only. 

No comments:

Post a Comment