Tuesday 8 March 2016

Linux Scheduling Internals

Lets start with Linux scheduling :

Linux Scheduling Classes :
In Linux scheduling is determined by the scheduling class to which the process belong.

The sched_class data structure can be found in include/linux/sched.h:
All existing scheduling classes in the kernel are in a list.
stop_sched_class → rt_sched_class → fair_sched_class → idle_sched_class → NULL

Stop and Idle are special scheduling classes. Stop is used to schedule the per-cpu stop task. It pre-empts everything and can be pre-empted by nothing, and Idle is used to schedule the per-cpu idle task (also called swapper task) which is run if no other task is runnable. The other two are for real time and normal tasks.

fair_sched_class
kernel\sched\fair.c implements the CFS scheduler described above.
rt_sched_class
kernel\sched\rt.c implements SCHED_FIFO and SCHED_RR semantics

Initialisation of task_struct : 
We can see the elements related to scheduling in task_struct
struct task_struct {
..
int prio, static_prio, normal_prio;
unsigned int rt_priority;
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
struct sched_dl_entity dl;
...

The task struct is initialized with sched_class when a process is forked
sched_fork :
} else if (rt_prio(p->prio)) {
p->sched_class = &rt_sched_class;
} else {
p->sched_class = &fair_sched_class;
}


Scheduler Policies : 
The POSIX standard specifies three scheduling policies, one of which is the “usual” or normal policy and is always the default. The other two are (soft) realtime scheduling policies. They are:
○ SCHED_NORMAL (or SCHED_OTHER) ← Default scheduling policy
○ SCHED_RR
○ SCHED_FIFO


SCHED_NORMAL(Complete fair scheduling - CFS) and Process Vruntime :
In CFS the virtual runtime is expressed and tracked via the per-task p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately timestamp and measure the "expected CPU time" a task should have gotten.
CFS's task picking logic is based on this p->se.vruntime value and it is thus very simple, it always tries to run the task with the smallest p->se.vruntime value

CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the p->se.vruntime key. This key is updated in function entity_tick->update_curr


Runqueues
The basic data structure in the scheduler is the runqueue. The runqueue is defined in kernel/sched.c as struct rq. The runqueue is the list of runnable processes on a given processor; there is one runqueue per processor.

runqueue data structures for fair and real time scheduling classes
struct cfs_rq cfs;
struct rt_rq rt;


TIF_NEED_RESCHED :
The timer interrupt sets the need_resched flag of the task_struct indicating the schedule function to be called.

How exactly (and where in the kernel codebase) is the TIF_NEED_RESCHED flag set?

tick_setup_device
tick_setup_periodic
tick_set_periodic_handler
  dev->event_handler = tick_handle_periodic;
tick_handle_periodic(struct clock_event_device *dev)
tick_periodic
update_process_times(int user_tick)
scheduler_tick();

scheduler_tick calls task_tick
..
curr->sched_class->task_tick(rq, curr, 0);
..

For CFS task_tick function is task_tick_fair which calls entity_tick

entity_tick(cfs_rq, se, queued);
update_curr(cfs_rq); -- Update the current task's runtime statistics and calls resched_task
if (queued) {
resched_task(rq_of(cfs_rq)->curr);
return;
}
TIF_NEED_RESCHED getting set :
void resched_task(struct task_struct *p)
..
set_tsk_need_resched(p);
..

hrtimer way of updating process time and setting TIF_NEED_RESCHED
run_timer_softirq
hrtimer_run_pending
hrtimer_switch_to_hres
tick_setup_sched_timer
ts->sched_timer.function = tick_sched_timer;
-- tick_sched_timer
--tick_sched_handle
update_process_times(user_mode(regs));

TIF_NEED_RESCHED flag is checked on interrupt and userspace return. If this flag is set then the current process is scheduled out and a call to __schedule is made.

Scheduler Entry points :

1. Based on the TIF_NEED_RESCHED flag scheduling function schedule() is called from these places :
A) upon returning to user-space (system call return path). If it is set, the kernel invokes the scheduler before continuing.
Snippet from entry_64.S
ret_from_sys_call
..
sysret_careful:
bt $TIF_NEED_RESCHED,%edx
jnc sysret_signal
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
pushq_cfi %rdi
SCHEDULE_USER

#ifdef CONFIG_CONTEXT_TRACKING
# define SCHEDULE_USER call schedule_user
#else
# define SCHEDULE_USER call schedule
#endif

B)  upon returning from a hardware interrupt, the need_resched flag is checked. If it is set And preempt_count is zero (meaning we're in a preemtible region of the kernel and no locks are held), the kernel invokes the scheduler before continuing
Schedule() function getting called :
While returning from interrupt in entry_64.S function :
ENTRY(retint_kernel)
cmpl $0,PER_CPU_VAR(__preempt_count)
jnz  retint_restore_args
bt   $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc  retint_restore_args
call preempt_schedule_irq
jmp exit_intr

preempt_schedule_irq

void __sched preempt_schedule_irq(void)
...
local_irq_enable();
__schedule();
local_irq_disable();
...
2. Schedule is called when currently running task goes to sleep
/* ‘q’ is the wait queue we wish to sleep on */
DEFINE_WAIT(wait);
add_wait_queue(q, &wait);
while (!condition) { /* condition is the event that we are waiting for */
prepare_to_wait(&q, &wait, TASK_INTERRUPTIBLE);
if (signal_pending(current))
/* handle signal */
schedule();
} finish_wait(&q, &
wait);


3. Sleeping task wakes up
The code that causes the event the sleeping task is waiting for typically calls wake_up() on the corresponding wait queue which eventually ends up in the scheduler function try_to_wake_up()

try_to_wake_up ->
ttwu_queue ->
ttwu_do_activate- >
ttwu_activate ->
activate_task->
enqueue_task->
p->sched_class->enqueue_task(rq, p, flags);



No comments:

Post a Comment