Wednesday, 29 August 2018

Linux system call implementation x86_64

Userspace call of the systemcall:

x86_64 user programs invoke a system call by putting the system call number (0 for read) into the RAX register, and the other parameters into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the x86_64 syscall instruction.
x86-64      syscall               rax        rax     -        [5]

This instruction causes the processor to transition to ring 0 and invoke the function referenced by the MSR_LSTAR model-specific register.

The MSR_LSTAR model-specific register is initialized at the kernel bootup:

/* May not be marked __init: used by software suspend */
void syscall_init(void)
extern char _entry_trampoline[];
extern char entry_SYSCALL_64_trampoline[];

int cpu = smp_processor_id();
unsigned long SYSCALL64_entry_trampoline =
(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
(entry_SYSCALL_64_trampoline - _entry_trampoline);

wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
if (static_cpu_has(X86_FEATURE_PTI))
wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
* This only works on Intel CPUs.
* On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
* This does not cause SYSENTER to jump to the wrong location, because
* AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_entry_stack(cpu) + 1));
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);

/* Flags to clear on syscall */

The system_call is the function which gets called for each system call.
entry_64.S file has the definition of the system_call function.
The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in the sys_call_table table.

movq %r10,%rcx /* fixup for C */
call *sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)

