进程切换

1. __schedule
2. pick_next_task
3. context_switch

__schedule是进程切换(调度)的核心函数，它先调用pick_next_task函数算出需要运行哪个进程，然后调用context_switch函数完成切换。

驱动调度器并进入__schedule这个核心调度函数主要有以下一些方式：

显式阻塞，例如互斥锁（mutex）、信号量（semaphore）、等待队列（waitqueue）等，任务主动睡眠进入等待状态，比如进程调用了mutex_lock、down或wait_event等函数可能会导致自己进入等待状态，需要调度器选择下一个可运行的任务。
定时器的中断处理程序sched_tick可以设置TIF_NEED_RESCHED标志来驱动任务间的抢占调度，该标志会在中断返回和用户态返回路径上被检查。
唤醒任务时并不会直接导致schedule被调用，它只是将任务添加到运行队列（run-queue）中。但如果新加入的任务优先级更高，导致抢占，那么唤醒代码会设置TIF_NEED_RESCHED，然后schedule将在最近的时机被调用，这种时机又分内核是否开启抢占而有所不同：
- 如果内核是可抢占的（CONFIG_PREEMPTION=y）：在系统调用或异常（syscall/exception）上下文中，会在最外层preempt_enable时触发抢占调度，这可能会在wake_up的spin_unlock之后立即发生。在中断（IRQ）上下文中，会在从中断处理程序返回到可抢占上下文时进行调度。
- 如果内核是不可抢占的（CONFIG_PREEMPTION未启用），那么调度将在以下情况下发生，1显式主动调用cond_resched，2显式主动调用schedule，3从系统调用或异常返回到用户态，4从中断处理程序返回到用户态。

调用__schedule时必须禁用抢占（preemption disabled），但其父函数可能会处理抢占的问题，比如__schedule__loop就会禁用抢占。

__schedule函数的参数sched_mode指明了以何种模式进入的调度器，它的取值可以有：

/*
 * Constants for the sched_mode argument of __schedule().
 *
 * The mode argument allows RT enabled kernels to differentiate a
 * preemption from blocking on an 'sleeping' spin/rwlock.
 */
#define SM_IDLE			(-1)
#define SM_NONE			0
#define SM_PREEMPT		1
#define SM_RTLOCK_WAIT		2

内核调度器在调度空闲任务时会使用SM_IDLE模式，表示当前没有可运行的普通任务，CPU可能会进入低功耗模式。

SM_NONE是普通调度模式，不涉及特殊的抢占或锁等待情况。这是最常见的调度模式，表示当前任务只是正常调度，而不是因为抢占或等待锁。

SM_PREEMPT表示任务被抢占（preempted）。当更高优先级的任务就绪时，内核可以抢占当前任务。比如普通任务被抢占式内核（Preemptible Kernel）或实时内核抢占时使用此模式。软实时（SCHED_RR）或硬实时（SCHED_FIFO）任务可能会触发此模式。

SM_RTLOCK_WAIT表示任务正在等待RT（实时）锁。实时内核（RT内核）提供了一种机制，使spinlock和rwlock可以在高优先级任务上睡眠（普通内核的自旋锁是不会睡眠的）。当任务因为获取实时锁（RT lock）而阻塞时，会使用这个模式。

总的来说这些调度模式主要用于区分不同的调度情况，特别是在抢占式调度和RT内核下，普通任务切换使用SM_NONE，表示任务因为时间片耗尽或显式调度而切换。抢占（Preemption）使用SM_PREEMPT，表示任务因为更高优先级任务到来而被抢占。空闲调度使用SM_IDLE，表示CPU进入空闲模式。实时锁等待使用SM_RTLOCK_WAIT，表示任务在RT互斥锁（如rtmutex）上睡眠等待。

比如preempt_schedule_common函数调用_schedule时就指明了参数SM_PREEMPT，表明本次进入调度器是因为抢占。

1. __schedule

以下是__schedule的实现：

static void __sched notrace __schedule(int sched_mode)
{
	struct task_struct *prev, *next;
	/*
	 * On PREEMPT_RT kernel, SM_RTLOCK_WAIT is noted
	 * as a preemption by schedule_debug() and RCU.
	 */
	bool preempt = sched_mode > SM_NONE;
	unsigned long *switch_count;
	unsigned long prev_state;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;

	schedule_debug(prev, preempt);

	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
		hrtick_clear(rq);

	local_irq_disable();
	rcu_note_context_switch(preempt);

	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up():
	 *
	 * __set_current_state(@state)		signal_wake_up()
	 * schedule()				  set_tsk_thread_flag(p, TIF_SIGPENDING)
	 *					  wake_up_state(p, state)
	 *   LOCK rq->lock			    LOCK p->pi_state
	 *   smp_mb__after_spinlock()		    smp_mb__after_spinlock()
	 *     if (signal_pending_state())	    if (p->state & @state)
	 *
	 * Also, the membarrier system call requires a full memory barrier
	 * after coming from user-space, before storing to rq->curr; this
	 * barrier matches a full barrier in the proximity of the membarrier
	 * system call exit.
	 */
	rq_lock(rq, &rf);
	smp_mb__after_spinlock();

	/* Promote REQ to ACT */
	rq->clock_update_flags <<= 1;
	update_rq_clock(rq);
	rq->clock_update_flags = RQCF_UPDATED;

	switch_count = &prev->nivcsw;

	/* Task state changes only considers SM_PREEMPT as preemption */
	preempt = sched_mode == SM_PREEMPT;

	/*
	 * We must load prev->state once (task_struct::state is volatile), such
	 * that we form a control dependency vs deactivate_task() below.
	 */
	prev_state = READ_ONCE(prev->__state);
	if (sched_mode == SM_IDLE) {
		/* SCX must consult the BPF scheduler to tell if rq is empty */
		if (!rq->nr_running && !scx_enabled()) {
			next = prev;
			goto picked;
		}
	} else if (!preempt && prev_state) {
		try_to_block_task(rq, prev, prev_state);
		switch_count = &prev->nvcsw;
	}

	next = pick_next_task(rq, prev, &rf);
	rq_set_donor(rq, next);
picked:
	clear_tsk_need_resched(prev);
	clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
	rq->last_seen_need_resched_ns = 0;
#endif

	if (likely(prev != next)) {
		rq->nr_switches++;
		/*
		 * RCU users of rcu_dereference(rq->curr) may not see
		 * changes to task_struct made by pick_next_task().
		 */
		RCU_INIT_POINTER(rq->curr, next);
		/*
		 * The membarrier system call requires each architecture
		 * to have a full memory barrier after updating
		 * rq->curr, before returning to user-space.
		 *
		 * Here are the schemes providing that barrier on the
		 * various architectures:
		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC,
		 *   RISC-V.  switch_mm() relies on membarrier_arch_switch_mm()
		 *   on PowerPC and on RISC-V.
		 * - finish_lock_switch() for weakly-ordered
		 *   architectures where spin_unlock is a full barrier,
		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
		 *   is a RELEASE barrier),
		 *
		 * The barrier matches a full barrier in the proximity of
		 * the membarrier system call entry.
		 *
		 * On RISC-V, this barrier pairing is also needed for the
		 * SYNC_CORE command when switching between processes, cf.
		 * the inline comments in membarrier_arch_switch_mm().
		 */
		++*switch_count;

		migrate_disable_switch(rq, prev);
		psi_account_irqtime(rq, prev, next);
		psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
					     prev->se.sched_delayed);

		trace_sched_switch(preempt, prev, next, prev_state);

		/* Also unlocks the rq: */
		rq = context_switch(rq, prev, next, &rf);
	} else {
		rq_unpin_lock(rq, &rf);
		__balance_callbacks(rq);
		raw_spin_rq_unlock_irq(rq);
	}
}

该函数本身比较简单，每个cpu都有一关联的runqueues，首先是通过smp_processor_id获取当前运行cpu的编号：

# define smp_processor_id() __smp_processor_id()

不同架构有不同的__smp_processor_id实现，对于x86架构来说，每个cpu都维护有一个pcpu_hot结构体，里面有cpu_number成员记录了当前运行的cpu号，cpu_number在初始化的时候通过start_kernel->setup_per_cpu_areas去设置：

for_each_possible_cpu(cpu) {
        ...
        per_cpu(pcpu_hot.cpu_number, cpu) = cpu;
        ...
}

而对于其它架构比如arm64，则是在当前运行线程中的一个成员进行记录：

#define raw_smp_processor_id() (current_thread_info()->cpu)

有了cpu号，就可以通过cpu_rq获得对应当前运行cpu的rq运行队列了：

DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

通过DECLARE就为每个cpu都开辟了rq的空间，这样per_cpu就可以依据cpu号取到对应的rq运行队列。

sched_debug

继续分析schedule_debug：

/*
 * Various schedule()-time debugging checks and statistics:
 */
static inline void schedule_debug(struct task_struct *prev, bool preempt)
{
#ifdef CONFIG_SCHED_STACK_END_CHECK
	if (task_stack_end_corrupted(prev))
		panic("corrupted stack end detected inside scheduler\n");

	if (task_scs_end_corrupted(prev))
		panic("corrupted shadow stack detected inside scheduler\n");
#endif

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
	if (!preempt && READ_ONCE(prev->__state) && prev->non_block_count) {
		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
			prev->comm, prev->pid, prev->non_block_count);
		dump_stack();
		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
	}
#endif

	if (unlikely(in_atomic_preempt_off())) {
		__schedule_bug(prev);
		preempt_count_set(PREEMPT_DISABLED);
	}
	rcu_sleep_check();
	SCHED_WARN_ON(ct_state() == CT_STATE_USER);

	profile_hit(SCHED_PROFILING, __builtin_return_address(0));

	schedstat_inc(this_rq()->sched_count);
}

该函数主要是做一些在调度时的debug检查，第一个检查就是看看即将要被切换出去的prev其内核栈顶是否被污染了（笔者环境开了CONFIG_SCHED_STACK_END_CHECK）：

#ifdef CONFIG_SCHED_STACK_END_CHECK
	if (task_stack_end_corrupted(prev))
		panic("corrupted stack end detected inside scheduler\n");

	if (task_scs_end_corrupted(prev))
		panic("corrupted shadow stack detected inside scheduler\n");
#endif

#define task_stack_end_corrupted(task) \
		(*(end_of_stack(task)) != STACK_END_MAGIC)

对于开启了CONFIG_THREAD_INFO_IN_TASK配置的end_of_stack实现如下：

static __always_inline unsigned long *end_of_stack(const struct task_struct *task)
{
#ifdef CONFIG_STACK_GROWSUP
	return (unsigned long *)((unsigned long)task->stack + THREAD_SIZE) - 1;
#else
	return task->stack;
#endif
}

可以看到这里返回了task的内核栈底，x86架构上内核栈自顶（大地址处）向下（小地址处）生长，而task->stack通过如下代码分配获取出来就是页面的小地址处，也就是end_of_stack中返回的内核栈结束的地方：

static int alloc_thread_stack_node(struct task_struct *tsk, int node)
{
	unsigned long *stack;
	stack = kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
	stack = kasan_reset_tag(stack);
	tsk->stack = stack;
	return stack ? 0 : -ENOMEM;
}

alloc_thread_stack_node被dup_task_struct函数在创建进程的时候调用。对于配置了CONFIG_THREAD_INFO_IN_TASK，进程的thread_info结构体就在task_struct里，而不是传统的放到内核栈task->stack处，这样可以简化内核栈的管理，不用在栈上去处理thread_info的逻辑（比如加偏移取相应的thread_info里的成员）。

task_stack_end_corrupted主要就是检查栈底处的值是不是STACK_END_MAGIC，该值同样通过dup_task_struct->set_task_stack_end_magic去设置：

void set_task_stack_end_magic(struct task_struct *tsk)
{
	unsigned long *stackend;

	stackend = end_of_stack(tsk);
	*stackend = STACK_END_MAGIC;	/* for overflow detection */
}

对于顺序下来的over write，若写到了栈底就会在调度出去时被检测发现而panic，调度出是一个恰当的时机，这样避免了下次调度到该进程时遇到一个被破坏了的栈。

task_scs_end_corrupted是类似的原理，不再详细介绍。

sched_debug

再往下看schedule_debug的实现：

if (unlikely(in_atomic_preempt_off())) {
	__schedule_bug(prev);
	preempt_count_set(PREEMPT_DISABLED);
}

这个条件大概率是不会满足的，这里主要想分析这个条件是在判断什么：

/*
 * Check whether we were atomic before we did preempt_disable():
 * (used by the scheduler)
 */
#define in_atomic_preempt_off() (preempt_count() != PREEMPT_DISABLE_OFFSET)

static __always_inline int preempt_count(void)
{
	return raw_cpu_read_4(pcpu_hot.preempt_count) & ~PREEMPT_NEED_RESCHED;
}

/*
 * The preempt_count offset after preempt_disable();
 */
#if defined(CONFIG_PREEMPT_COUNT)
# define PREEMPT_DISABLE_OFFSET	PREEMPT_OFFSET
#else
# define PREEMPT_DISABLE_OFFSET	0
#endif

#define PREEMPT_OFFSET	(1UL << PREEMPT_SHIFT)

#define PREEMPT_SHIFT	0

对于开启了CONFIG_PREEMPT_COUNT来说，在x86上就是检查preempt_count必须为1，这样条件就是满足，不会进入__schedule_bug，这其实是说运行到此处调度器的代码时，父函数已经调用过一次preempt_disable，而在调用它之前，preempt_count必须为0，也就是不处于原子上下文或已经调用过了preempt_disable，换言之，原子上下文（比如软硬中断中）不允许调度发生。

继续往下看一个warn判断：

SCHED_WARN_ON(ct_state() == CT_STATE_USER);

ct_state函数在启用CONFIG_CONTEXT_TRACKING_USER配置时会返回有意义的值，它主要用来追踪cpu的当前上下文，可能的状态有：

enum ctx_state {
	CT_STATE_DISABLED	= -1,	/* returned by ct_state() if unknown */
	CT_STATE_KERNEL		= 0,
	CT_STATE_IDLE		= 1,
	CT_STATE_USER		= 2,
	CT_STATE_GUEST		= 3,
	CT_STATE_MAX		= 4,
};

这些状态主动在相应上下文切换的代码流程里被切换，RCU可以利用这些状态确认当前CPU是否处于用户态，以便判断是否可以将该CPU视为“非活跃”，从而进行RCU回收，另外在某些架构（如NO_HZ_FULL模式）中，内核会在用户态禁用定时器中断，以减少上下文切换的开销。但是，这样会导致cputime统计变得不准确，因为CPU进入用户态后不会有定时器中断来更新时间统计。CONTEXT_TRACKING_USER通过显式追踪进入/退出用户态的时间，使CPU时间统计能够在NO_HZ_FULL模式下仍然保持准确。

至于本warn判断本身，调度器（schedule）运行时一定是在内核态，如果ct_state() == CT_STATE_USER，说明context tracking机制出了问题，所以这里加了SCHED_WARN_ON作为一个调试检查。

继续往下看是增加一个profile计数：

profile_hit(SCHED_PROFILING, __builtin_return_address(0));

/*
 * Single profiler hit:
 */
static inline void profile_hit(int type, void *ip)
{
	/*
	 * Speedup for the common (no profiling enabled) case:
	 */
	if (unlikely(prof_on == type))
		profile_hits(type, ip, 1);
}

可以进行profiling的包括三个模块：

#define CPU_PROFILING	1
#define SCHED_PROFILING	2
#define KVM_PROFILING	4

prof_on在启动时依据profile=的参数设置调用profile_setup来填为上面三个类型中的一个。__builtin_return_address是编译器内置的函数，可以返回当前函数的返回地址。 profile_hits实现如下：

static void do_profile_hits(int type, void *__pc, unsigned int nr_hits)
{
	unsigned long pc;
	pc = ((unsigned long)__pc - (unsigned long)_stext) >> prof_shift;
	if (pc < prof_len)
		atomic_add(nr_hits, &prof_buffer[pc]);
}

prof_buffer是在初始化函数profile_init里开的空间，可以看到do_profile_hits就是记录了内核text段某个pc执行的次数，是一种性能统计的功能，它实现在CONFIG_PROFILING配置下，并可以通过/proc访问文件的方式。

最后是schedstat_inc自增rq的sched_count，当然这需要开启CONFIG_SCHEDSTATS配置，这样/proc/schedstat就可以反应一些调度统计信息了：

schedstat_inc(this_rq()->sched_count);

#define   schedstat_inc(var)		do { if (schedstat_enabled()) { var++; } } while (0)

注意这里是宏展开，不是函数调用，所以最后对sched_count的自增一定会生效，展开就是文本替换，在预编译阶段完成：

#define   schedstat_inc(var)		do { if (schedstat_enabled()) { this_rq()->sched_count++; } } while (0)

sched_debug

回到__schedule，继续往下看：

if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
    hrtick_clear(rq);

HRTICK是高精度调度定时器功能，默认情况下这个功能是关闭的，但是sched_feat宏本身是有定义的，尤其是笔者的环境开了CONFIG_SCHED_DEBUG以及CONFIG_JUMP_LABEL两个配置，sched_feat使用静态分支判断的方法：

#define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))

通过sched_feat这里主要是想介绍上面提到的两个配置的功能，至于HRTICK这个调度器feature本身，默认是不开启的：

SCHED_FEAT(HRTICK, false)

CONFIG_SCHED_DEBUG开启了调度器的调试信息，这样就会在/sys/kernel/debug/sched/下展示很多关于调度器的信息，开启这个代价很小，所以笔者环境默认开启了，比如/sys/kernel/debug/sched/features可以查看当前scheduler开启哪些feature：

PLACE_LAG PLACE_DEADLINE_INITIAL PLACE_REL_DEADLINE RUN_TO_PARITY PREEMPT_SHORT NO_NEXT_BUDDY PICK_BUDDY CACHE_HOT_BUDDY DELAY_DEQUEUE DELAY_ZERO WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE SIS_UTIL NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS UTIL_EST NO_LATENCY_WARN

前面有NO的就是没有开启的，比如NO_HRTICK，当然在运行时可以动态的修改这个文件以开启某个feature。另外要介绍的一个配置就是CONFIG_JUMP_LABEL，它优化了almost-always-true和almost-always-false这样的分支预测，传统上，这样的分支预测还是有cmp指令进行比较并决定是否跳转的，这样在硬件上就有分支预测，但是一旦分支预测错误面临的性能损失比较大，jump label优化的做法是一开始都将这样的分支预测编译为nop指令，然后在运行时，通过static_key_enable/disable这样的接口一路向下到text_poke去动态的调整nop为jmp到对应函数（或调整为nop，表示条件不满足）。

至于上面的条件自然是不满足就不深入分析了。

接着后面是关闭本地cpu的中断响应，local_irq_disable在x86架构上使用的是cli指令，它只会禁止对可屏蔽的外部中断的响应，而异常和NMI（不可屏蔽中断）还是会响应的。

然后通过rq_lock获取操作runqueues的自旋锁，也就是操作rq需要有锁保护，防止并发操作带来的数据一致性问题。

继续看对smp_mb__after_spinlock的调用，它是针对arm64这样的弱一致内存模型的同步操作，因为在调用__schedule前通常通过__set_current_state设置了进程状态比如为TASK_INTERRUPTIBLE，而另外的独立路径可能设置进程有信号需要处理，比如通过set_tsk_thread_flag(p, TIF_SIGPENDING)，在之前一样的调用__schedule的路径下，通常也会调用signal_pending_state去检查进程是否有需要待处理的pending信号：

static inline int signal_pending_state(unsigned int state, struct task_struct *p)
{
	if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
		return 0;
	if (!signal_pending(p))
		return 0;

	return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
}

static inline int task_sigpending(struct task_struct *p)
{
	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
}

static inline int signal_pending(struct task_struct *p)
{
	/*
	 * TIF_NOTIFY_SIGNAL isn't really a signal, but it requires the same
	 * behavior in terms of ensuring that we break out of wait loops
	 * so that notify signal callbacks can be processed.
	 */
	if (unlikely(test_tsk_thread_flag(p, TIF_NOTIFY_SIGNAL)))
		return 1;
	return task_sigpending(p);
}

可以看到在能够检查另外独立路径设置的TIF_SIGPENDING前（通过signal_pending->task_sigpending），signal_pending_state需要先检查进程状态是否为TASK_INTERRUPTIBLE，如果进程不处于可中断的状态，根本不会进一步检查是否有pending的信号，这样独立路径设置的需要处理的TIF_SIGPENDING标志，就因为TASK_INTERRUPTIBLE没有及时写入而错过了处理。

先检查进程为TASK_INTERRUPTIBLE是因为只有可中断状态，才有意义可以接受待处理的信号，而TASK_UNINTERRUPTIBLE是不响应信号的。

独立路径写入TIF_SIGPENDING只是可能引起这个问题表现的条件/场景，不是根本原因，根本原因在于TASK_INTERRUPTIBLE的写入与TIF_SIGPENDING读取之间的并发问题，导致signal_pending_state可能在错误的时间点返回错误的结果。

update_rq_clock

调用update_rq_clock前对clock_update_flags左移一位，clock_update_flags有三种取值可能：

/*
 * rq::clock_update_flags bits
 *
 * %RQCF_REQ_SKIP - will request skipping of clock update on the next
 *  call to __schedule(). This is an optimisation to avoid
 *  neighbouring rq clock updates.
 *
 * %RQCF_ACT_SKIP - is set from inside of __schedule() when skipping is
 *  in effect and calls to update_rq_clock() are being ignored.
 *
 * %RQCF_UPDATED - is a debug flag that indicates whether a call has been
 *  made to update_rq_clock() since the last time rq::lock was pinned.
 *
 * If inside of __schedule(), clock_update_flags will have been
 * shifted left (a left shift is a cheap operation for the fast path
 * to promote %RQCF_REQ_SKIP to %RQCF_ACT_SKIP), so you must use,
 *
 *	if (rq-clock_update_flags >= RQCF_UPDATED)
 *
 * to check if %RQCF_UPDATED is set. It'll never be shifted more than
 * one position though, because the next rq_unpin_lock() will shift it
 * back.
 */
#define RQCF_REQ_SKIP		0x01
#define RQCF_ACT_SKIP		0x02
#define RQCF_UPDATED		0x04

其中取值为RQCF_REQ_SKIP时，再左移一位就是RQCF_ACT_SKIP，这个值在update_rq_clock里会引起它的直接返回，也就是说内核其它地方可以通过将clock_update_flags设置为RQCF_REQ_SKIP以指示__schedule函数里调用update_rq_clock实际不进行调度时钟值的更新，这样可以省去一部分开销，当然在设置为RQCF_REQ_SKIP的路径上就必然调用过uodate_rq_clock对时钟进行了更新。

接下来分析下update_rq_clock函数：

void update_rq_clock(struct rq *rq)
{
	s64 delta;
	u64 clock;

	lockdep_assert_rq_held(rq);

	if (rq->clock_update_flags & RQCF_ACT_SKIP)
		return;

#ifdef CONFIG_SCHED_DEBUG
	if (sched_feat(WARN_DOUBLE_CLOCK))
		SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
	rq->clock_update_flags |= RQCF_UPDATED;
#endif
	clock = sched_clock_cpu(cpu_of(rq));
	scx_rq_clock_update(rq, clock);

	delta = clock - rq->clock;
	if (delta < 0)
		return;
	rq->clock += delta;

	update_rq_clock_task(rq, delta);
}

这里可以看到如果clock_update_flags为RQCF_ACT_SKIP，那么就直接返回了，而不更新rq里的对应时钟，然后本次更新了时钟，就将clock_update_flags设置为RQCF_UPDATED。

update_rq_clock函数通过sched_clock_cpu获得当前的时间戳（在x86架构上一般就是rdtsc指令），和上次记录的clock相减后得到本次增加的delta时间，关于sched_clock_cpu的实现涉及内核clock模块，参见笔者其它文章介绍。

在这里简单介绍下cpu_of的实现：

static inline int cpu_of(struct rq *rq)
{
#ifdef CONFIG_SMP
	return rq->cpu;
#else
	return 0;
#endif
}

调度初始化函数sched_init里完成了rq->cpu的设置：

for_each_possible_cpu(i) {
	rq->cpu = i;
}

随后的scx_rq_clock_update是在配置了CONFIG_SCHED_CLASS_EXT时有有效定义，该配置主要是一个基于BPF（Berkeley Packet Filter）的可扩展调度框架，旨在让开发者能够快速编写、部署和实验新的调度策略，而不需要修改Linux内核的核心代码。可以像开发普通BPF程序一样调整调度逻辑，无需重新编译内核，应用程序可以定义专属的CPU调度策略，提高性能，做到无中断地切换调度策略，而不需要重启系统或重编译内核。具体的实现细节可以参考笔者其它文章。

rq->clock是真实的物理时间增量，所以直接往上增加delta，但update_rq_clock_task里更新的rq里的成员需要对delta进行调整：

static void update_rq_clock_task(struct rq *rq, s64 delta)
{
/*
 * In theory, the compile should just see 0 here, and optimize out the call
 * to sched_rt_avg_update. But I don't trust it...
 */
	s64 __maybe_unused steal = 0, irq_delta = 0;

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
	if (irqtime_enabled()) {
		irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

		/*
		 * Since irq_time is only updated on {soft,}irq_exit, we might run into
		 * this case when a previous update_rq_clock() happened inside a
		 * {soft,}IRQ region.
		 *
		 * When this happens, we stop ->clock_task and only update the
		 * prev_irq_time stamp to account for the part that fit, so that a next
		 * update will consume the rest. This ensures ->clock_task is
		 * monotonic.
		 *
		 * It does however cause some slight miss-attribution of {soft,}IRQ
		 * time, a more accurate solution would be to update the irq_time using
		 * the current rq->clock timestamp, except that would require using
		 * atomic ops.
		 */
		if (irq_delta > delta)
			irq_delta = delta;

		rq->prev_irq_time += irq_delta;
		delta -= irq_delta;
		delayacct_irq(rq->curr, irq_delta);
	}
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
	if (static_key_false((&paravirt_steal_rq_enabled))) {
		u64 prev_steal;

		steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
		steal -= rq->prev_steal_time_rq;

		if (unlikely(steal > delta))
			steal = delta;

		rq->prev_steal_time_rq = prev_steal;
		delta -= steal;
	}
#endif

	rq->clock_task += delta;

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
		update_irq_load_avg(rq, irq_delta + steal);
#endif
	update_rq_clock_pelt(rq, delta);
}

CONFIG_IRQ_TIME_ACCOUNTING配置控制的代码主要是要从任务的运行时间里扣除其用于中断的时间，delayacct_irq的实现还可以将这个花费在中断上的时间（也就是由irq引起的delay）给统计到task_struct::delays::irq_delay，当然前提是开启了CONFIG_TASK_DELAY_ACCT，当然开启这个配置能统计到的delay不只是irq，还有比如blkio，swap等。

一个问题是如果这个中断就是任务引起的，是不是也应当属于任务自己的时间。

CONFIG_PARAVIRT_TIME_ACCOUNTING主要针对虚拟化环境下，如果一个VCPU被Hypervisor抢占，那么需要从运行的delta时间里扣除这部分，因为实际上这部分时间任务并没有运行，使得进程的CPU使用率更加准确。

CONFIG_HAVE_SCHED_AVG_IRQ在笔者环境一般没有配置。

最后update_rq_clock_task->update_rq_clock_pelt里会按cpu的算力以及频率对delta物理时间进行缩放，也就是经过相同的delta时间，算力强频率高的cpu实际负载更大：

/*
 * The clock_pelt scales the time to reflect the effective amount of
 * computation done during the running delta time but then sync back to
 * clock_task when rq is idle.
 *
 *
 * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
 * @ max capacity  ------******---------------******---------------
 * @ half capacity ------************---------************---------
 * clock pelt      | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
 *
 */
static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
{
	if (unlikely(is_idle_task(rq->curr))) {
		_update_idle_rq_clock_pelt(rq);
		return;
	}

	/*
	 * When a rq runs at a lower compute capacity, it will need
	 * more time to do the same amount of work than at max
	 * capacity. In order to be invariant, we scale the delta to
	 * reflect how much work has been really done.
	 * Running longer results in stealing idle time that will
	 * disturb the load signal compared to max capacity. This
	 * stolen idle time will be automatically reflected when the
	 * rq will be idle and the clock will be synced with
	 * rq_clock_task.
	 */

	/*
	 * Scale the elapsed time to reflect the real amount of
	 * computation
	 */
	delta = cap_scale(delta, arch_scale_cpu_capacity(cpu_of(rq)));
	delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq)));

	rq->clock_pelt += delta;
}

#define cap_scale(v, s)		((v)*(s) >> SCHED_CAPACITY_SHIFT)

update_rq_clock_pelt按频率以及算力缩放的值放在rq::clock_pelt量里。

update_rq_clock

arch_scale_cpu_capacity用于获取cpu的计算能力：

unsigned long arch_scale_cpu_capacity(int cpu)
{
	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
		return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity);

	return SCHED_CAPACITY_SCALE;
}

hybrid一般就是针对arm架构开启了big.LITTLE，而针对Intel x86就是P-core/E-core，针对笔者的配置，就是返回默认的SCHED_CAPACITY_SCALE，也就是没有区分不同cpu的算力：

# define SCHED_FIXEDPOINT_SHIFT		10
# define SCHED_CAPACITY_SHIFT		SCHED_FIXEDPOINT_SHIFT
# define SCHED_CAPACITY_SCALE		(1L << SCHED_CAPACITY_SHIFT)

这里可以看到，固定CPU算力就是1024，第一次通过cap_scale对算力进行scale，不过cpu_scale又右移动了，等于delta过的多少时间就是多少负载，因为算力一样。

第二个arch_scale_freq_capacity是根据频率来缩放delta时间，这个宏对于x86架构来说就是读取percpu变量arch_freq_scale，其设置会经由scale_freq_tick函数调用this_cpu_write，本质上是MSR_IA32_APERF和MSR_IA32_MPERF这两个msr寄存器分别各自做出差值delta，将delta差值做比值得商，写入arch_scale_freq_capacity是为当前频率需要缩放的比例，MSR_IA32_APERF（Actual PerformanceFrequency）是实际实际工作的时钟周期数，而MSR_IA32_MPERF（Maximum Performance Frequency）是理论最大工作周期数。比值商大于1代表当前处于turbo模式，睿频运行，小于1代表没有满频率运行。

注意scale_freq_tick里通过check_shl_overflow先将acnt（MSR_IA32_APERF）乘以2**20进行放大，然后通过check_mul_overflow将mcnt（MSR_IA32_MPERF）乘以2**10也进行放大（假如没有开启hybrid异构算力架构），最终其实是把delta(acnt)和delta(mcnt)的比值本来是个小数，乘以了1024，这样就是1024是比值1，小于1024的代表小数，大于1024的话代表比值大于1，cpu频率处于turbo状态。这是内核优化小数运行的常见手段，将小数乘以一个1024这样整数，避免内核做浮点运算，当然最小精度只有1/1024，可以调整为2048等等大的数值，可以将更小的小数表示到整数范围，也可以做截断处理，当发现小数小于1/1024，直接赋值为0。

最后的rq::clock_pelt是两者做了scale的和。

回到__schedule继续分析，先取出了nivcsw，这个计数表示非自愿上下文切换（non-voluntary context switch），“非自愿”的意思是任务被抢占或被强制切换了。参数sched_mode表示了调度模式，最后preempt保存了是否是抢占模式调度，SM_PREEMPT代表抢占模式。prev->__state是任务的状态，比如TASK_RUNNING、TASK_INTERRUPTIBLE等。

如果是因为空闲调度模式进入调度器，那么表示这时可能没有其它任务要执行了，并且rq运行队列里也没有可运行的任务了，并且没有启用BPF调度器时，那么实际没有切换任务，因为next = prev，然后就直接goto picked了，不用经过pick_next_task去挑选下一个需要运行的任务了，而如果不是抢占切换，这代表当前任务主动让出cpu想要block阻塞自己，那么调用try_to_block_task将当前任务标记为阻塞状态，并从运行队列里移除，这种情况下要切换计数的指针为nvcsw，表示自愿上下文切换，因为后面要自增这个计数，到底自增哪个计数，需要根据不同情况选择。

2. pick_next_task

pick_next_task的实现依据是否开启CONFIG_SCHED_CORE配置有不同的实现，这个配置是支持SMT的，比如Intel的Hyper-Threading，一般没有开启这个配置，这样它实现如下：

static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	return __pick_next_task(rq, prev, rf);
}

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	const struct sched_class *class;
	struct task_struct *p;

	rq->dl_server = NULL;

	if (scx_enabled())
		goto restart;

	/*
	 * Optimization: we know that if all tasks are in the fair class we can
	 * call that function directly, but only if the @prev task wasn't of a
	 * higher scheduling class, because otherwise those lose the
	 * opportunity to pull in more work from other CPUs.
	 */
	if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
		   rq->nr_running == rq->cfs.h_nr_queued)) {

		p = pick_next_task_fair(rq, prev, rf);
		if (unlikely(p == RETRY_TASK))
			goto restart;

		/* Assume the next prioritized class is idle_sched_class */
		if (!p) {
			p = pick_task_idle(rq);
			put_prev_set_next_task(rq, prev, p);
		}

		return p;
	}

restart:
	prev_balance(rq, prev, rf);

	for_each_active_class(class) {
		if (class->pick_next_task) {
			p = class->pick_next_task(rq, prev);
			if (p)
				return p;
		} else {
			p = class->pick_task(rq);
			if (p) {
				put_prev_set_next_task(rq, prev, p);
				return p;
			}
		}
	}

	BUG(); /* The idle class should always have a runnable task. */
}

该函数主要分两部分，restart前的部分是针对CFS调度的优化，就是通用rq里正在运行的所有进程数量等于CFS队列里排队的进程数量并且即将被换出去的进程prev所属的调度类优先级不高于fair_sched_class类时，这样说明当前cpu的rq里只有CFS的进程，那么直接调用pick_next_task_fair选合适的进程来运行即可，这算是一种优化，不用每次都遍历所有调度类。

如果不能满足上述条件，那么就会走restart标签的代码去遍历所有调度类找到一个合适的进程去运行，下面针对这两部分分别详细分析。

先看下sche_class_above的实现：

#define sched_class_above(_a, _b)	((_a) < (_b))

sched_class_above就是判断即将要调度出去的prev所属的调度类prev->sched_class的优先级是否高于fair_sched_class的优先级，若是就返回true，否则返回false。所以调度类的优先级高低，实际就是看调度类变量的地址小的优先级高，为什么由调度类变量的地址的大小就能确定一个调度类的优先级呢？这实际跟调度类的初始化代码以及链接器角度有关。

以fair_sched_class的定义为例：

/*
 * All the scheduling class methods:
 */
DEFINE_SCHED_CLASS(fair) = {

	.enqueue_task		= enqueue_task_fair,
	.dequeue_task		= dequeue_task_fair,
	.yield_task		= yield_task_fair,
	.yield_to_task		= yield_to_task_fair,

      ...
};

/*
 * Helper to define a sched_class instance; each one is placed in a separate
 * section which is ordered by the linker script:
 *
 *   include/asm-generic/vmlinux.lds.h
 *
 * *CAREFUL* they are laid out in *REVERSE* order!!!
 *
 * Also enforce alignment on the instance, not the type, to guarantee layout.
 */
#define DEFINE_SCHED_CLASS(name) \
const struct sched_class name##_sched_class \
	__aligned(__alignof__(struct sched_class)) \
	__section("__" #name "_sched_class")

以上定义会使得fair_sched_class这个量被放到__fair_sched_class这个section里，其它调度类也是类似的定义方式：

DEFINE_SCHED_CLASS(rt)
DEFINE_SCHED_CLASS(idle)
DEFINE_SCHED_CLASS(stop)
DEFINE_SCHED_CLASS(dl)
DEFINE_SCHED_CLASS(ext)

这些调度类会被链接脚本按序放置：

/*
 * The order of the sched class addresses are important, as they are
 * used to determine the order of the priority of each sched class in
 * relation to each other.
 */
#define SCHED_DATA				\
	STRUCT_ALIGN();				\
	__sched_class_highest = .;		\
	*(__stop_sched_class)			\
	*(__dl_sched_class)			\
	*(__rt_sched_class)			\
	*(__fair_sched_class)			\
	*(__ext_sched_class)			\
	*(__idle_sched_class)			\
	__sched_class_lowest = .;

比如，*(__dl_sched_class)表示把所有放入__dl_sched_class段的对象，链接到当前地址，链接地址按每行依次增大，这样具有较小地址的调度类拥有更高的优先级，这样sched_class_above采用直接比较地址大小的办法来确定不同调度类的优先级就有了根据。

现在简单分析下另一个条件就是：

rq->nr_running == rq->cfs.h_nr_queued

条件本身是简单明晰的，就是现在通用rq运行队列里的所有正在运行的进程数量等于cfs队列里排队的进程数量，各个调度类在入队/出队都是可以操作rq::nr_running计数的，比如stop调度类出队进程时：

dequeue_task_stop->sub_nr_running

stop调度类入队时：

enqueue_task_stop->add_nr_running

对于fair CFS调度类来说也是一样，比如出队时：

dequeue_entities->sub_nr_running

入队时：

enqueue_task_fair->add_nr_running

同时，如果CFS操作了rq::nr_running，那么也会同步操作cfs_rq::h_nr_queued，比如在dequeue_entities里先会对cfs_rq::h_nr_queued进行递减，才会调用sub_nr_running去递减通用rq::nr_running计数，所以如果上述的条件满足，就意味着没有其它调度类的进程被enqueue到通用rq队列里，这样就可以直接调用CFS类的pick_next_task_fair函数了。如果从CFS队列里没能选出进程，那么就可以调用pick_task_idle选一个空闲进程来运行了，当然前提是fair调度类的下一个优先级的调度类就是idle类。

restart标签下的代码，先做了下负载均衡，prev_balance会从prev所属的调度类往下的优先级去遍历调度类，然后调用这些调度类的balance函数（如果有的话），当然restart下最重要的逻辑还是从最高优先级的调度类往最低优先级的调度类遍历，去寻找一个可以运行的进程：

#define for_each_active_class(class)						\
	for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
#define for_active_class_range(class, _from, _to)				\
	for (class = (_from); class != (_to); class = next_active_class(class))
/*
 * Iterate only active classes. SCX can take over all fair tasks or be
 * completely disabled. If the former, skip fair. If the latter, skip SCX.
 */
static inline const struct sched_class *next_active_class(const struct sched_class *class)
{
	class++;
#ifdef CONFIG_SCHED_CLASS_EXT
	if (scx_switched_all() && class == &fair_sched_class)
		class++;
	if (!scx_enabled() && class == &ext_sched_class)
		class++;
#endif
	return class;
}

有了前面对于使用调度类地址反应调度优先级的原理说明，现在看这个调度类遍历宏的实现就简单多了，就不必过多介绍了。这里只是再提下sched_class::pick_next_task和sched_class::pick_task在这里的不同调用表现，前者一般就只有CFS调度类实现了，使用pick_next_task就无需调度核心框架（kernel/sched/core.c）再去调用put_prev_set_next_task函数了。

最后idle调度类的进程一定可以有一个进程运行，所以最后有个BUG。

下面开始分析pick_next_task_fair函数：

struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct sched_entity *se;
	struct task_struct *p;
	int new_tasks;

again:
	p = pick_task_fair(rq);
	if (!p)
		goto idle;
	se = &p->se;

#ifdef CONFIG_FAIR_GROUP_SCHED
	if (prev->sched_class != &fair_sched_class)
		goto simple;

	__put_prev_set_next_dl_server(rq, prev, p);

	/*
	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
	 * likely that a next task is from the same cgroup as the current.
	 *
	 * Therefore attempt to avoid putting and setting the entire cgroup
	 * hierarchy, only change the part that actually changes.
	 *
	 * Since we haven't yet done put_prev_entity and if the selected task
	 * is a different task than we started out with, try and touch the
	 * least amount of cfs_rqs.
	 */
	if (prev != p) {
		struct sched_entity *pse = &prev->se;
		struct cfs_rq *cfs_rq;

		while (!(cfs_rq = is_same_group(se, pse))) {
			int se_depth = se->depth;
			int pse_depth = pse->depth;

			if (se_depth <= pse_depth) {
				put_prev_entity(cfs_rq_of(pse), pse);
				pse = parent_entity(pse);
			}
			if (se_depth >= pse_depth) {
				set_next_entity(cfs_rq_of(se), se);
				se = parent_entity(se);
			}
		}

		put_prev_entity(cfs_rq, pse);
		set_next_entity(cfs_rq, se);

		__set_next_task_fair(rq, p, true);
	}

	return p;

simple:
#endif
	put_prev_set_next_task(rq, prev, p);
	return p;

idle:
	if (!rf)
		return NULL;

	new_tasks = sched_balance_newidle(rq, rf);

	/*
	 * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is
	 * possible for any higher priority task to appear. In that case we
	 * must re-start the pick_next_entity() loop.
	 */
	if (new_tasks < 0)
		return RETRY_TASK;

	if (new_tasks > 0)
		goto again;

	/*
	 * rq is about to be idle, check if we need to update the
	 * lost_idle_time of clock_pelt
	 */
	update_idle_rq_clock_pelt(rq);

	return NULL;
}

该函数主体逻辑有三方面，一是通过pick_task_fair函数去选下一个运行的进程，二是处理组调度的逻辑，三是如果没有可运行的进程，还需要调用sched_balance_newidle去拉取其它rq（cpu）上的任务。

pick_task_fair留后面分析。如果这个函数选不出进程了，也就是p为空，就会跳到idle标签，其下的逻辑就是通过sched_balance_newidle去拉取其它cpu上任务，如果能拉取成功（返回值大于0），就会跳到again通过pick_task_fair再次挑选任务运行。

在开启了组调度并且要被切换的prev进程所属的调度类是CFS时，就会有一些逻辑处理组调度，如果条件不满足就是跳到simple标签，也就是只需要简单的对prev进行put操作就行。

处理组调度的逻辑相对复杂一些，首先组调度有一个depth层级的概念，所有调度实体组成一棵调度树，叶子节点是可以运行的进程实体，而中间的层代表了其下任务的总行为。对于挑选出来即将运行p和即将切换出去的prev，如果它们不同，就会对它们循环向上降低深度，对于prev路径上的任务要进行put操作，而对于p路径上的任务要进行set操作，每次迭代期间，哪个较深，就通过parent_entity找到对应sched_entity的父调度实体，其实质也就是sched_entity::depth会减小。is_same_group会判断迭代的路径上两个sched_entity是否属于相同的组了：

/* Do the two (enqueued) entities belong to the same group ? */
static inline struct cfs_rq *
is_same_group(struct sched_entity *se, struct sched_entity *pse)
{
	if (se->cfs_rq == pse->cfs_rq)
		return se->cfs_rq;

	return NULL;
}

如果迭代到两个sched_entity属于同一个cfs rq时，这时put/set操作就可以停下来了，这算是一种优化，不用操作整个调度树。

前面提到了set/put操作，具体的函数实现分别就是set_next_entity和put_prev_entity，下面分析下这两个函数。首先是set_next_entity：

static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	clear_buddies(cfs_rq, se);

	/* 'current' is not kept within the tree. */
	if (se->on_rq) {
		/*
		 * Any task has to be enqueued before it get to execute on
		 * a CPU. So account for the time it spent waiting on the
		 * runqueue.
		 */
		update_stats_wait_end_fair(cfs_rq, se);
		__dequeue_entity(cfs_rq, se);
		update_load_avg(cfs_rq, se, UPDATE_TG);

		set_protect_slice(se);
	}

	update_stats_curr_start(cfs_rq, se);
	WARN_ON_ONCE(cfs_rq->curr);
	cfs_rq->curr = se;

	/*
	 * Track our maximum slice length, if the CPU's load is at
	 * least twice that of our own weight (i.e. don't track it
	 * when there are only lesser-weight tasks around):
	 */
	if (schedstat_enabled() &&
	    rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
		struct sched_statistics *stats;

		stats = __schedstats_from_se(se);
		__schedstat_set(stats->slice_max,
				max((u64)stats->slice_max,
				    se->sum_exec_runtime - se->prev_sum_exec_runtime));
	}

	se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

内核调度器在pick出一个进程运行前，都需要先对它进行set操作，所谓set操作，主要包括，如果进程还在等待队列上（on_rq被设置，在红黑树上就绪等待），那么需要将pick出的进程进行出队，因为一个马上就要运行的进程是不会同时又在红黑就绪树上等待的，这里可能会有点歧义，“出队”反而是代表进程要获得cpu运行了，正确的理解是，每个进程在运行前都需要enqueue到rq（红黑就绪树），在要获得cpu运行时需要dequeue，这里的enqueue/dequeue是针对红黑就绪树而说的，而不是上cpu运行来说enqueue/dequeue，一个典型的流程是：

唤醒 -> enqueue_entity -> 等待红黑树调度 -> 被选中 -> dequeue_entity -> 上CPU

总结起来就是，在CFS中，“出队（dequeue）”不是表示进程失去了运行权，而恰恰相反，是调度器确认它将获得运行权的标志性动作。enqueue/dequeue操作本质上是对红黑树的数据结构维护，而非进程运行与否的直接体现。

set操作还有就是更新即将运行进程的一些统计状态，最后一个比较重要的动作是设置cfs_rq::curr为当前选择出来即将运行的sched_entity，curr代表当前在cfs_rq上运行的实体，如果没有运行的进程就设置为NULL。

下面详细分析下set_next_entity的细节。clear_buddies用于清除sched_entity所在cfs_rq的buddy信息。

接下来如果sched_entity::on_rq非零，代表调度实体还在红黑就绪树上，那么这时需要通过update_stats_wait_end_fair函数更新它花在rq上的等待时间，随后__dequeue_entity出队rq，update_load_avg更新一下负载，set_protect_slice设置保护时间片，依次分析。

static inline void
update_stats_wait_end_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	struct sched_statistics *stats;
	struct task_struct *p = NULL;

	if (!schedstat_enabled())
		return;

	stats = __schedstats_from_se(se);

	/*
	 * When the sched_schedstat changes from 0 to 1, some sched se
	 * maybe already in the runqueue, the se->statistics.wait_start
	 * will be 0.So it will let the delta wrong. We need to avoid this
	 * scenario.
	 */
	if (unlikely(!schedstat_val(stats->wait_start)))
		return;

	if (entity_is_task(se))
		p = task_of(se);

	__update_stats_wait_end(rq_of(cfs_rq), p, stats);
}

schedstat_enabled在没有开启CONFIG_SCHEDSTATS时都是0，否则就是sched_schedstats分支变量：

#define   schedstat_enabled()		static_branch_unlikely(&sched_schedstats)

笔者的环境开启了CONFIG_SCHEDSTATS配置，但是sched_schedstats默认是0的：

DEFINE_STATIC_KEY_FALSE(sched_schedstats);

所以通常情况下，update_stats_wait_end_fair在schedstat_enabled判断就该返回了，也就是cat /proc/schedstat默认是看不到wait_max，wait_count以及wait_sum这些成员的。

分析sched_schedstats这个分支变量，可以通过set_schedstats这个函数开关它：

static void set_schedstats(bool enabled)
{
	if (enabled)
		static_branch_enable(&sched_schedstats);
	else
		static_branch_disable(&sched_schedstats);
}

调用set_schedstats（也就是设置sched_schedstats）有两种手段一是通过启动命令行添加schedstats=enable，再有就是修改/proc/sys/kernel/sched_schedstats为1。

假设现在开启了sched_schedstats，那么往下就会通过__schedstats_from_se去获得sched_statistics调度统计结构体：

static inline struct sched_statistics *
__schedstats_from_se(struct sched_entity *se)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
	if (!entity_is_task(se))
		return &container_of(se, struct sched_entity_stats, se)->stats;
#endif
	return &task_of(se)->stats;
}

该函数的实现分两种情况，一是配置了组调度并且当前实体并不是一个实际的任务，那么sched_statistics结构体实际是内嵌在了sched_entity_stats结构体，所以可以用container_of这种结构体偏移的办法获得调度统计结构体。

这里判断一个实体是否是实际任务的办法很简单（当然是针对开启组调度CONFIG_FAIR_GROUP_SCHED而说的，没有开启这个配置时，任何调度实体sched_entity都是可上CPU运行的）：

#define entity_is_task(se)	(!se->my_q)

my_q其实就是在开启组调度时（非空），该调度实体下拥有的cfs调度队列。

没有开启组调度时，__schedstats_from_se，sched_statistics就是内嵌在task_struct里的。

拿到了sched_statistics后，就有一个防御性检查判断，那就是wait_start要非0，wait_start是可能为0的，比如按如上的使用sysctl变量的方法开启该功能时，wait_start之前没有开启时是0，这样后面再算调度实体的等待运行时间就会变得很大（或者离谱）。

再往后的代码拿了下对应调度实体的task_struct结构体。

最后是真正的更新统计动作：

void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
			     struct sched_statistics *stats)
{
	u64 delta = rq_clock(rq) - schedstat_val(stats->wait_start);

	if (p) {
		if (task_on_rq_migrating(p)) {
			/*
			 * Preserve migrating task's wait time so wait_start
			 * time stamp can be adjusted to accumulate wait time
			 * prior to migration.
			 */
			__schedstat_set(stats->wait_start, delta);

			return;
		}

		trace_sched_stat_wait(p, delta);
	}

	__schedstat_set(stats->wait_max,
			max(schedstat_val(stats->wait_max), delta));
	__schedstat_inc(stats->wait_count);
	__schedstat_add(stats->wait_sum, delta);
	__schedstat_set(stats->wait_start, 0);
}

这个函数的逻辑是显而易见的了，用rq_clock获得的clock减去之前记录的开始等待的时间wait_start（入队时会设置），就是当前调度实体等待了多久才获得cpu，同时设置新的可能变化的wait_max，也即最大等待时间，同时wait_count自增，wait_sum累加，由于__update_stats_wait_end结束后调度实体马上会上CPU运行，所以要设置wait_start为0。

这里需要说明说明的是，task_struct::on_rq其实有三种状态，为0代表正在cpu上运行，为1代表在rq上排队，而为2代表在迁移到另一个rq的过程：

/* task_struct::on_rq states: */
#define TASK_ON_RQ_QUEUED	1
#define TASK_ON_RQ_MIGRATING	2

task_on_rq_queued定义如下：

static inline int task_on_rq_migrating(struct task_struct *p)
{
	return READ_ONCE(p->on_rq) == TASK_ON_RQ_MIGRATING;
}

处理在迁移状态进程的等待时间是一个边界条件，在某个cpuA的rqA上等待了一段时间delta_A，还没有上cpuA运行时，就被迁移到了另一个cpuB时，在新的rqB上又等待了一段时间delta_B，那么总的等待时间应该是delta_A + delta_B，在__update_stats_wait_end里只是将这个delta时间给到了wait_start，也就是这时wait_start的语义其实发生了变化，不再是一个时间戳，而是一段时间长度，然后就返回了。而在__update_stats_wait_start的实现里，设置起始的开始等待时间时，会将wait_start减去这个已经等待的时间prev_wait_start（就是delta_A + delta_B）：

void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
			       struct sched_statistics *stats)
{
	u64 wait_start, prev_wait_start;

	wait_start = rq_clock(rq);
	prev_wait_start = schedstat_val(stats->wait_start);

	if (p && likely(wait_start > prev_wait_start))
		wait_start -= prev_wait_start;

	__schedstat_set(stats->wait_start, wait_start);
}

这样相当于把开始等待的时间往前拨了些，以达到累计等待时间的效果。对__update_stats_wait_start的调用可以是：

update_stats_enqueue_fair->update_stats_wait_start_fair->__update_stats_wait_start

也可以是：

put_prev_entity->update_stats_wait_start_fair->__update_stats_wait_start

前者是调度实体入队的时候，后者是从CPU上撤下重新put到rq时，这些时机都需要记录开始等待的时间戳。

set调度实体的下一个动作是将调度实体dequeue出rq红黑树队列：

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
				  &min_vruntime_cb);
	avg_vruntime_sub(cfs_rq, se);
}

static __always_inline void
rb_erase_augmented_cached(struct rb_node *node, struct rb_root_cached *root,
			  const struct rb_augment_callbacks *augment)
{
	if (root->rb_leftmost == node)
		root->rb_leftmost = rb_next(node);
	rb_erase_augmented(node, &root->rb_root, augment);
}

cfs_rq里的tasks_timeline成员就是大名鼎鼎的任务挂载的红黑树，其定义如下：

struct rb_root_cached {
	struct rb_root rb_root;
	struct rb_node *rb_leftmost;
};

rb_leftmost是缓存的整个树的所有进程里，下一个即将运行的进程，而rb_root是所有进程形成树的根节点，sched_entity::run_node用来代表调度实体往树上挂载，如果当前出队的进程就刚好是rb_leftmost的话，那么rb_leftmost需要更新，因为当前进程出队了。rb_erase_augmented里是真正的删除节点的操作，里面还会维护二叉红黑树的性质，具体的本文主题就不介绍了。由于当前调度实体出队了，一些负载要从当前的cfs_rq里扣除，这就是avg_vruntime_sub做的事情：

static void
avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	unsigned long weight = scale_load_down(se->load.weight);
	s64 key = entity_key(cfs_rq, se);

	cfs_rq->avg_vruntime -= key * weight;
	cfs_rq->avg_load -= weight;
}

进程的weight一般通过set_load_weight去设置。

下一个动作是通过update_load_avg进行负载更新，进程的负载计算本篇主题不打算详细介绍，另开PELT算法主题单独介绍，总的来说该函数会将负载按周期进行衰减，这里的周期是1024us，也就是约为1ms，当前周期对负载的共享就是L，过去1ms内是L * y1，过去第2ms内是L * y^2，依次类推，其中y ^32 = 0.5。

update_stats_curr_start更新了现在调度实体开始运行的时间戳：

/*
 * We are picking a new current task - update its stats:
 */
static inline void
update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	/*
	 * We are starting a new run period:
	 */
	se->exec_start = rq_clock_task(rq_of(cfs_rq));
}

cfs_rq当前运行的实体curr也要改成当前的se。

最后的一段代码是记录下当前se的最大运行slice时间片，用当前的sum_exec_runtime减去上次的这个累计执行时间，并和之前的取大者。不过这个更新是有条件的，那就是当前cfs_rq的负载要较大（大于sed两倍）。这样set动作就介绍完了。

下面是介绍put动作：

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
	/*
	 * If still on the runqueue then deactivate_task()
	 * was not called and update_curr() has to be done:
	 */
	if (prev->on_rq)
		update_curr(cfs_rq);

	/* throttle cfs_rqs exceeding runtime */
	check_cfs_rq_runtime(cfs_rq);

	if (prev->on_rq) {
		update_stats_wait_start_fair(cfs_rq, prev);
		/* Put 'current' back into the tree. */
		__enqueue_entity(cfs_rq, prev);
		/* in !on_rq case, update occurred at dequeue */
		update_load_avg(cfs_rq, prev, 0);
	}
	WARN_ON_ONCE(cfs_rq->curr != prev);
	cfs_rq->curr = NULL;
}

如果prev在rq上，就需要更新下当前sched_entity的一些运行时信息，最重要的信息当属sched_entity::vruntime（较新的内核使用新的调度算法，sched_entity::deadline替代了vruntime的功能），因为它（较新的内核是sched_entity::deadline）会用来确定插入进程红黑树的位置，而插入的位置又决定调度哪个进程上CPU运行：

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	struct rq *rq = rq_of(cfs_rq);
	s64 delta_exec;
	bool resched;

	if (unlikely(!curr))
		return;

	delta_exec = update_curr_se(rq, curr);
	if (unlikely(delta_exec <= 0))
		return;

	curr->vruntime += calc_delta_fair(delta_exec, curr);
	resched = update_deadline(cfs_rq, curr);
	update_min_vruntime(cfs_rq);

	if (entity_is_task(curr)) {
		struct task_struct *p = task_of(curr);

		update_curr_task(p, delta_exec);

		/*
		 * If the fair_server is active, we need to account for the
		 * fair_server time whether or not the task is running on
		 * behalf of fair_server or not:
		 *  - If the task is running on behalf of fair_server, we need
		 *    to limit its time based on the assigned runtime.
		 *  - Fair task that runs outside of fair_server should account
		 *    against fair_server such that it can account for this time
		 *    and possibly avoid running this period.
		 */
		if (dl_server_active(&rq->fair_server))
			dl_server_update(&rq->fair_server, delta_exec);
	}

	account_cfs_rq_runtime(cfs_rq, delta_exec);

	if (cfs_rq->nr_queued == 1)
		return;

	if (resched || did_preempt_short(cfs_rq, curr)) {
		resched_curr_lazy(rq);
		clear_buddies(cfs_rq, curr);
	}
}

先通过update_curr_se算出delta_exec：

static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
{
	u64 now = rq_clock_task(rq);
	s64 delta_exec;

	delta_exec = now - curr->exec_start;
	if (unlikely(delta_exec <= 0))
		return delta_exec;

	curr->exec_start = now;
	curr->sum_exec_runtime += delta_exec;

	if (schedstat_enabled()) {
		struct sched_statistics *stats;

		stats = __schedstats_from_se(curr);
		__schedstat_set(stats->exec_max,
				max(delta_exec, stats->exec_max));
	}

	return delta_exec;
}

delta_exec开始就是进程运行了多少时间，同时这个函数还更新了exec_start为新的现在的时间，累加了进程总运行时间sum_exec_runtime，进程运行的最大时间长度exec_max也可能会被更新。

接下来的calc_delta_fair里的逻辑是体现哪个进程优先调度的关键逻辑：

/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
	if (unlikely(se->load.weight != NICE_0_LOAD))
		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

	return delta;
}

/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
	u64 fact = scale_load_down(weight);
	u32 fact_hi = (u32)(fact >> 32);
	int shift = WMULT_SHIFT;
	int fs;

	__update_inv_weight(lw);

	if (unlikely(fact_hi)) {
		fs = fls(fact_hi);
		shift -= fs;
		fact >>= fs;
	}

	fact = mul_u32_u32(fact, lw->inv_weight);

	fact_hi = (u32)(fact >> 32);
	if (fact_hi) {
		fs = fls(fact_hi);
		shift -= fs;
		fact >>= fs;
	}

	return mul_u64_u32_shr(delta_exec, fact, shift);
}

从这些逻辑可以看到，是把delta_exec进行了缩放，缩放的比例就是NICE_0_LOAD/weight_of_se，而sched_entity::weight经由函数set_load_weight设置：

void set_load_weight(struct task_struct *p, bool update_load)
{
	int prio = p->static_prio - MAX_RT_PRIO;
	struct load_weight lw;

	if (task_has_idle_policy(p)) {
		lw.weight = scale_load(WEIGHT_IDLEPRIO);
		lw.inv_weight = WMULT_IDLEPRIO;
	} else {
		lw.weight = scale_load(sched_prio_to_weight[prio]);
		lw.inv_weight = sched_prio_to_wmult[prio];
	}

	/*
	 * SCHED_OTHER tasks have to update their load when changing their
	 * weight
	 */
	if (update_load && p->sched_class->reweight_task)
		p->sched_class->reweight_task(task_rq(p), p, &lw);
	else
		p->se.load = lw;
}

这里比较关键的就是有个sched_prio_to_weight数组：

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

该数组是按nice值作为优先级填充的，相邻的nice值差距大约25%的CPU使用率，nice为0定义为1024。与之对应的是2^32 / weight的另一个数组sched_prio_to_wmult，这个值是为后面除以weight提前算的，因为对weight做除，效率较低，可以直接乘以这个数组里的值再右移2^32位，抵消了提前乘的2^32次方：

/*
 * Inverse (2^32/x) values of the sched_prio_to_weight[] array, pre-calculated.
 *
 * In cases where the weight does not change often, we can use the
 * pre-calculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
const u32 sched_prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

具体的__calc_delta里的计算方式会有一些优化，就不详细分析了。这样__calc_delta最后返回值的效果就是，对于权重越大的（nice值越低），NICE_0_LOAD / weight这个比率就越小接近0，这样会把delta_exec缩放的越小，那么update_curr里的语句：

curr->vruntime += calc_delta_fair(delta_exec, curr);

往vruntime里累加的值就越小，而往进程红黑树里插入进程时比较的key值就是这里的vruntime（较新的调度算法是sched_entity::deadline，但原理类似），越小的vruntime越容易出现在进程调度红黑树的左半部分而选择被调度运行。

update_curr接下来调用update_deadline更新sched_entity::deadline的值：

/*
 * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
 * this is probably good enough.
 */
static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	if ((s64)(se->vruntime - se->deadline) < 0)
		return false;

	/*
	 * For EEVDF the virtual time slope is determined by w_i (iow.
	 * nice) while the request time r_i is determined by
	 * sysctl_sched_base_slice.
	 */
	if (!se->custom_slice)
		se->slice = sysctl_sched_base_slice;

	/*
	 * EEVDF: vd_i = ve_i + r_i / w_i
	 */
	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);

	/*
	 * The task has consumed its request, reschedule.
	 */
	return true;
}

这里可以看到，新的EEVDF算法会sched_entity::deadline，它是vruntime以及slice也通过calc_delta_fair进行缩放后的和，vruntime前面介绍过，权重越大（nice越低），vruntime增长越慢，但是留给到下一次运行的时间也少，更高权重的任务，在申请同样的运行时间（slice）时，会得到一个“更短的deadline”，即期望更快完成调度。

update_curr后面调用update_min_vruntime：

static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
{
	u64 min_vruntime = cfs_rq->min_vruntime;
	/*
	 * open coded max_vruntime() to allow updating avg_vruntime
	 */
	s64 delta = (s64)(vruntime - min_vruntime);
	if (delta > 0) {
		avg_vruntime_update(cfs_rq, delta);
		min_vruntime = vruntime;
	}
	return min_vruntime;
}

static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
	struct sched_entity *se = __pick_root_entity(cfs_rq);
	struct sched_entity *curr = cfs_rq->curr;
	u64 vruntime = cfs_rq->min_vruntime;

	if (curr) {
		if (curr->on_rq)
			vruntime = curr->vruntime;
		else
			curr = NULL;
	}

	if (se) {
		if (!curr)
			vruntime = se->min_vruntime;
		else
			vruntime = min_vruntime(vruntime, se->min_vruntime);
	}

	/* ensure we never gain time by being placed backwards. */
	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
}

这里要提下，更新cfs_rq::min_vruntime可不是delta < 0才更新，而是delta > 0，也就是新传进来的vruntime较大时，反而更新成这个vruntime，看起来不符合min_vruntime的字面意义，最小值不是应该越来越小吗？但实际上：这个vruntime是已经挑选好、即将要执行的任务的vruntime（cfs_rq::curr），也就是说，整个运行队列中，已经没有比它更小的vruntime了（小的都已经调度走了）。因此，这时可以安全地推进min_vruntime，它仍然代表“当前所有活跃任务中（在cfs_rq中的）最小的vruntime”。

如果cfs_rq::curr是一个任务的话，调用update_curr_task->account_group_exec_runtime去更新累加到所在线程组用的时间

static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
{
	trace_sched_stat_runtime(p, delta_exec);
	account_group_exec_runtime(p, delta_exec);
	cgroup_account_cputime(p, delta_exec);
}

/**
 * account_group_exec_runtime - Maintain exec runtime for a thread group.
 *
 * @tsk:	Pointer to task structure.
 * @ns:		Time value by which to increment the sum_exec_runtime field
 *		of the thread_group_cputime structure.
 *
 * If thread group time is being maintained, get the structure for the
 * running CPU and update the sum_exec_runtime field there.
 */
static inline void account_group_exec_runtime(struct task_struct *tsk,
					      unsigned long long ns)
{
	struct thread_group_cputimer *cputimer = get_running_cputimer(tsk);

	if (!cputimer)
		return;

	atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime);
}

而随后的account_cfs_rq_runtime会从cfs_rq::runtime_remaining里减去运行的delta_exec，这个功能呢跟CONFIG_CFS_BANDWIDTH有关。

如果update_deadline返回的resched为true，也就是需要重新调度当前进程，那么就会resched_curr_lazy设置需要重新调度的标志TIF_NEED_RESCHED：

void resched_curr_lazy(struct rq *rq)
{
	__resched_curr(rq, get_lazy_tif_bit());
}

static __always_inline int get_lazy_tif_bit(void)
{
	if (dynamic_preempt_lazy())
		return TIF_NEED_RESCHED_LAZY;

	return TIF_NEED_RESCHED;
}

/*
 * resched_curr - mark rq's current task 'to be rescheduled now'.
 *
 * On UP this means the setting of the need_resched flag, on SMP it
 * might also involve a cross-CPU call to trigger the scheduler on
 * the target CPU.
 */
static void __resched_curr(struct rq *rq, int tif)
{
	struct task_struct *curr = rq->curr;
	struct thread_info *cti = task_thread_info(curr);
	int cpu;

	lockdep_assert_rq_held(rq);

	/*
	 * Always immediately preempt the idle task; no point in delaying doing
	 * actual work.
	 */
	if (is_idle_task(curr) && tif == TIF_NEED_RESCHED_LAZY)
		tif = TIF_NEED_RESCHED;

	if (cti->flags & ((1 << tif) | _TIF_NEED_RESCHED))
		return;

	cpu = cpu_of(rq);

	if (cpu == smp_processor_id()) {
		set_ti_thread_flag(cti, tif);
		if (tif == TIF_NEED_RESCHED)
			set_preempt_need_resched();
		return;
	}

	if (set_nr_and_not_polling(cti, tif)) {
		if (tif == TIF_NEED_RESCHED)
			smp_send_reschedule(cpu);
	} else {
		trace_sched_wake_idle_without_ipi(cpu);
	}
}

一旦设置，中断返回、抢占点、定时器中断等都会检测这个标志，内核会尽快调用schedule进行上下文切换，如果是跨cpu，会通过IPI使其他CPU尽快进入调度流程。

这样update_curr介绍完了，回到put_prev_entity，如果sched_entity::on_rq非0，注意为非0时（on_rq = 1），任务可能在等待（红黑树中），也可能在运行（当前CPU上）。这段时间都算作任务仍在runqueue中，只有cfs_rq::curr指向的任务当前才正在CPU上运行，所以回顾update_curr是要先取cfs_rq::curr，然后更新其统计运行信息。

这样sched_entity::on_rq为非0时，就需要调用__enqueue_entity重新入下红黑树：

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	avg_vruntime_add(cfs_rq, se);
	se->min_vruntime = se->vruntime;
	se->min_slice = se->slice;
	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
				__entity_less, &min_vruntime_cb);
}

static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
{
	return entity_before(__node_2_se(a), __node_2_se(b));
}

static inline bool entity_before(const struct sched_entity *a,
				 const struct sched_entity *b)
{
	/*
	 * Tiebreak on vruntime seems unnecessary since it can
	 * hardly happen.
	 */
	return (s64)(a->deadline - b->deadline) < 0;
}

到这里就可以很清楚的看到，插入rq红黑树的比较依据就是deadline（较老的调度算法只看vruntime）小的在树的左半部分优先调用到，这就是进程调度相对核心的逻辑，具体的树结构tasks_timeline前面已经涉及到了。

最后cfs_rq::curr被设置为NULL，代表当前cfs_rq上没有进程运行。

这样pick_next_task_fair里的set/put动作都介绍完了。

往下想看下通过sched_balance_newidle去从别的rq上拉取任务的逻辑：

/*
 * sched_balance_newidle is called by schedule() if this_cpu is about to become
 * idle. Attempts to pull tasks from other CPUs.
 *
 * Returns:
 *   < 0 - we released the lock and there are !fair tasks present
 *     0 - failed, no new tasks
 *   > 0 - success, new (fair) tasks present
 */
static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
{
	unsigned long next_balance = jiffies + HZ;
	int this_cpu = this_rq->cpu;
	int continue_balancing = 1;
	u64 t0, t1, curr_cost = 0;
	struct sched_domain *sd;
	int pulled_task = 0;

	update_misfit_status(NULL, this_rq);

	/*
	 * There is a task waiting to run. No need to search for one.
	 * Return 0; the task will be enqueued when switching to idle.
	 */
	if (this_rq->ttwu_pending)
		return 0;

	/*
	 * We must set idle_stamp _before_ calling sched_balance_rq()
	 * for CPU_NEWLY_IDLE, such that we measure the this duration
	 * as idle time.
	 */
	this_rq->idle_stamp = rq_clock(this_rq);

	/*
	 * Do not pull tasks towards !active CPUs...
	 */
	if (!cpu_active(this_cpu))
		return 0;

	/*
	 * This is OK, because current is on_cpu, which avoids it being picked
	 * for load-balance and preemption/IRQs are still disabled avoiding
	 * further scheduler activity on it and we're being very careful to
	 * re-start the picking loop.
	 */
	rq_unpin_lock(this_rq, rf);

	rcu_read_lock();
	sd = rcu_dereference_check_sched_domain(this_rq->sd);

	if (!get_rd_overloaded(this_rq->rd) ||
	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

		if (sd)
			update_next_balance(sd, &next_balance);
		rcu_read_unlock();

		goto out;
	}
	rcu_read_unlock();

	raw_spin_rq_unlock(this_rq);

	t0 = sched_clock_cpu(this_cpu);
	sched_balance_update_blocked_averages(this_cpu);

	rcu_read_lock();
	for_each_domain(this_cpu, sd) {
		u64 domain_cost;

		update_next_balance(sd, &next_balance);

		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
			break;

		if (sd->flags & SD_BALANCE_NEWIDLE) {

			pulled_task = sched_balance_rq(this_cpu, this_rq,
						   sd, CPU_NEWLY_IDLE,
						   &continue_balancing);

			t1 = sched_clock_cpu(this_cpu);
			domain_cost = t1 - t0;
			update_newidle_cost(sd, domain_cost);

			curr_cost += domain_cost;
			t0 = t1;
		}

		/*
		 * Stop searching for tasks to pull if there are
		 * now runnable tasks on this rq.
		 */
		if (pulled_task || !continue_balancing)
			break;
	}
	rcu_read_unlock();

	raw_spin_rq_lock(this_rq);

	if (curr_cost > this_rq->max_idle_balance_cost)
		this_rq->max_idle_balance_cost = curr_cost;

	/*
	 * While browsing the domains, we released the rq lock, a task could
	 * have been enqueued in the meantime. Since we're not going idle,
	 * pretend we pulled a task.
	 */
	if (this_rq->cfs.h_nr_queued && !pulled_task)
		pulled_task = 1;

	/* Is there a task of a high priority class? */
	if (this_rq->nr_running != this_rq->cfs.h_nr_queued)
		pulled_task = -1;

out:
	/* Move the next balance forward */
	if (time_after(this_rq->next_balance, next_balance))
		this_rq->next_balance = next_balance;

	if (pulled_task)
		this_rq->idle_stamp = 0;
	else
		nohz_newidle_balance(this_rq);

	rq_repin_lock(this_rq, rf);

	return pulled_task;
}

pelt load计算以及上面的负载均衡都可以另开主题，本片主题主要是进程调度，就不详细分析了。

以上其实对pick_next_task的介绍整体上就差不多了，下面回到__schedule函数，开始介绍picked部分，picked部分主要的逻辑是切换mm，在prev和选出来要运行的next不是一个进程时，就涉及到切换mm，这主要是通过context_switch来完成的，这留给下一节去分析。

3. context_switch

context_switch是完成进程切换至关重要的函数，在context_switch->switch_to里面，CPU执行流其实已经去到了另外一个线程，在那个执行流里其实也是从context_switch里的switch_to开始，未来回到当前执行流第一入口点还是这里的context_swith->switch_to，也就是说__schedule返回了，__schedule自身是去往另一个CPU执行流（另一个线程的逻辑）。

以下是context_switch的代码：

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	prepare_task_switch(rq, prev, next);

	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	/*
	 * kernel -> kernel   lazy + transfer active
	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
	 *
	 * kernel ->   user   switch + mmdrop_lazy_tlb() active
	 *   user ->   user   switch
	 *
	 * switch_mm_cid() needs to be updated if the barriers provided
	 * by context_switch() are modified.
	 */
	if (!next->mm) {                                // to kernel
		enter_lazy_tlb(prev->active_mm, next);

		next->active_mm = prev->active_mm;
		if (prev->mm)                           // from user
			mmgrab_lazy_tlb(prev->active_mm);
		else
			prev->active_mm = NULL;
	} else {                                        // to user
		membarrier_switch_mm(rq, prev->active_mm, next->mm);
		/*
		 * sys_membarrier() requires an smp_mb() between setting
		 * rq->curr / membarrier_switch_mm() and returning to userspace.
		 *
		 * The below provides this either through switch_mm(), or in
		 * case 'prev->active_mm == next->mm' through
		 * finish_task_switch()'s mmdrop().
		 */
		switch_mm_irqs_off(prev->active_mm, next->mm, next);
		lru_gen_use_mm(next->mm);

		if (!prev->mm) {                        // from kernel
			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
			rq->prev_mm = prev->active_mm;
			prev->active_mm = NULL;
		}
	}

	/* switch_mm_cid() requires the memory barriers above. */
	switch_mm_cid(rq, prev, next);

	prepare_lock_switch(rq, next, rf);

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);
	barrier();

	return finish_task_switch(prev);
}

从这个函数可以看到，主要分几种情况可以延迟做tlb的刷新操作，切换到内核线程时不必做切换mm的操作，因为内核线程没有自己的mm，一般不会访问用户空间，只需通过enter_lazy_tlb设置上一个标志，表示延迟做tlb刷新操作：

/*
 * Please ignore the name of this function.  It should be called
 * switch_to_kernel_thread().
 *
 * enter_lazy_tlb() is a hint from the scheduler that we are entering a
 * kernel thread or other context without an mm.  Acceptable implementations
 * include doing nothing whatsoever, switching to init_mm, or various clever
 * lazy tricks to try to minimize TLB flushes.
 *
 * The scheduler reserves the right to call enter_lazy_tlb() several times
 * in a row.  It will notify us that we're going back to a real mm by
 * calling switch_mm_irqs_off().
 */
void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
		return;

	this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
}

先看prepare_task_switch的实现：

/**
 * prepare_task_switch - prepare to switch tasks
 * @rq: the runqueue preparing to switch
 * @prev: the current task that is being switched out
 * @next: the task we are going to switch to.
 *
 * This is called with the rq lock held and interrupts off. It must
 * be paired with a subsequent finish_task_switch after the context
 * switch.
 *
 * prepare_task_switch sets up locking and calls architecture specific
 * hooks.
 */
static inline void
prepare_task_switch(struct rq *rq, struct task_struct *prev,
		    struct task_struct *next)
{
	kcov_prepare_switch(prev);
	sched_info_switch(rq, prev, next);
	perf_event_task_sched_out(prev, next);
	rseq_preempt(prev);
	fire_sched_out_preempt_notifiers(prev, next);
	kmap_local_sched_out();
	prepare_task(next);
	prepare_arch_switch(next);
}

kcov_prepare_switch主要是代码覆盖率统计方面的逻辑，该配置一般没配。接着看sched_info_switch：

/*
 * Called when tasks are switched involuntarily due, typically, to expiring
 * their time slice.  (This may also be called when switching to or from
 * the idle task.)  We are only called when prev != next.
 */
static inline void
sched_info_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next)
{
	/*
	 * prev now departs the CPU.  It's not interesting to record
	 * stats about how efficient we were at scheduling the idle
	 * process, however.
	 */
	if (prev != rq->idle)
		sched_info_depart(rq, prev);

	if (next != rq->idle)
		sched_info_arrive(rq, next);
}

SCHED_INFO可以导出很多调度方面的信息，所以一般是开启的。这里主要分两种情况统计调度信息，只要即将被换出的prev或者即将被换入的next都不是idle的话，都会调用对应的函数进行统计操作，先看sched_info_depart函数：

/*
 * Called when a process ceases being the active-running process involuntarily
 * due, typically, to expiring its time slice (this may also be called when
 * switching to the idle task).  Now we can calculate how long we ran.
 * Also, if the process is still in the TASK_RUNNING state, call
 * sched_info_enqueue() to mark that it has now again started waiting on
 * the runqueue.
 */
static inline void sched_info_depart(struct rq *rq, struct task_struct *t)
{
	unsigned long long delta = rq_clock(rq) - t->sched_info.last_arrival;

	rq_sched_info_depart(rq, delta);

	if (task_is_running(t))
		sched_info_enqueue(rq, t);
}

last_arrival是上次prev作为换入进程通过sched_info_arrive设置的，待会还会看到这个函数的分析。在sched_info_depart里，拿到现在的clock减去上次的last_arrival，就是本轮运行的delta，以这个参数调用rq_sched_info_depart：

/*
 * Expects runqueue lock to be held for atomicity of update
 */
static inline void
rq_sched_info_depart(struct rq *rq, unsigned long long delta)
{
	if (rq)
		rq->rq_cpu_time += delta;
}

可以看到该函数就是往该rq的rq_cpu_time上累加了delta时间，这个成员可以认为是该rq总共运行了多少cpu时间。task_is_running判断任务状态还是TASK_RUNNING的话，就会调用sched_info_enqueue重新记录排队时间戳：

#define task_is_running(task)		(READ_ONCE((task)->__state) == TASK_RUNNING)

/*
 * This function is only called from enqueue_task(), but also only updates
 * the timestamp if it is already not set.  It's assumed that
 * sched_info_dequeue() will clear that stamp when appropriate.
 */
static inline void sched_info_enqueue(struct rq *rq, struct task_struct *t)
{
	if (!t->sched_info.last_queued)
		t->sched_info.last_queued = rq_clock(rq);
}

就是进程只是下物理CPU了，但并没有离开rq运行队列，所以last_queued记录的是离开物理CPU在rq的时刻。

继续看sched_info_arrive，该函数处理将要调度上CPU的进程的统计信息：

/*
 * Called when a task finally hits the CPU.  We can now calculate how
 * long it was waiting to run.  We also note when it began so that we
 * can keep stats on how long its time-slice is.
 */
static void sched_info_arrive(struct rq *rq, struct task_struct *t)
{
	unsigned long long now, delta = 0;

	if (!t->sched_info.last_queued)
		return;

	now = rq_clock(rq);
	delta = now - t->sched_info.last_queued;
	t->sched_info.last_queued = 0;
	t->sched_info.run_delay += delta;
	t->sched_info.last_arrival = now;
	t->sched_info.pcount++;
	if (delta > t->sched_info.max_run_delay)
		t->sched_info.max_run_delay = delta;
	if (delta && (!t->sched_info.min_run_delay || delta < t->sched_info.min_run_delay))
		t->sched_info.min_run_delay = delta;

	rq_sched_info_arrive(rq, delta);
}

该函数主要用现在的时间戳now减去之前排队的时间戳last_queued，这样就得到了等待时间delta，同时清除last_queued为0，也就是现在任务并不在rq红黑树上进行等待，而是得到了CPU即将运行。run_delay里记录了总的等待延迟，而max_run_delay里记录了最大的等待延迟，这个延迟需要每次比较重新记录，min_run_delay同理。

rq_sched_info_arrive主要是统计rq调度队列层面的信息（之前都是讲的进程task_struct层面的信息）：

static inline void
rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
{
	if (rq) {
		rq->rq_sched_info.run_delay += delta;
		rq->rq_sched_info.pcount++;
	}
}

随后的perf_event_task_sched_out是CONFIG_PERF_EVENTS的功能，perf事件包括软硬件事件，软件事件通过内建机制或通用tracepoints（跟踪点）来支持。大多数现代CPU支持通过性能计数器寄存器来收集性能事件。这些寄存器可以统计某些类型的硬件事件的数量，例如：执行的指令数、遭遇的缓存未命中（cache miss）、分支预测失败（分支错误预测）等，并且不会对内核或应用程序造成性能下降。这些寄存器还可以在事件次数达到某个阈值时触发中断，因此可以用于对运行在该CPU上的代码进行性能分析（profiling）。Linux 的性能事件子系统（Performance Event Subsystem）对这些软件和硬件事件能力进行了抽象，通过系统调用向用户空间提供服务，供tools/perf/目录下的perf工具使用。它提供了基于任务（task）和基于CPU的计数器，并在此基础上提供事件统计功能。

fire_sched_out_preempt_noifiers是调用注册在task_struct::preempt_notifiers上的通知链回调函数，比如现在prev要被调度出去了，就需要逐一调用注册在这上面的通知链的回调函数。

下一个比较关键的工作就是设置next任务的on_cpu为1，这通过prepare_task：

static inline void prepare_task(struct task_struct *next)
{
#ifdef CONFIG_SMP
	/*
	 * Claim the task as running, we do this before switching to it
	 * such that any running task will have this set.
	 *
	 * See the smp_load_acquire(&p->on_cpu) case in ttwu() and
	 * its ordering comment.
	 */
	WRITE_ONCE(next->on_cpu, 1);
#endif
}

回到context_switch函数：

/*
 * kernel -> kernel   lazy + transfer active
 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
 *
 * kernel ->   user   switch + mmdrop_lazy_tlb() active
 *   user ->   user   switch
 *
 * switch_mm_cid() needs to be updated if the barriers provided
 * by context_switch() are modified.
 */
if (!next->mm) {                                // to kernel
	enter_lazy_tlb(prev->active_mm, next);

	next->active_mm = prev->active_mm;
	if (prev->mm)                           // from user
		mmgrab_lazy_tlb(prev->active_mm);
	else
		prev->active_mm = NULL;
} else {                                        // to user
	membarrier_switch_mm(rq, prev->active_mm, next->mm);
	/*
	 * sys_membarrier() requires an smp_mb() between setting
	 * rq->curr / membarrier_switch_mm() and returning to userspace.
	 *
	 * The below provides this either through switch_mm(), or in
	 * case 'prev->active_mm == next->mm' through
	 * finish_task_switch()'s mmdrop().
	 */
	switch_mm_irqs_off(prev->active_mm, next->mm, next);
	lru_gen_use_mm(next->mm);

	if (!prev->mm) {                        // from kernel
		/* will mmdrop_lazy_tlb() in finish_task_switch(). */
		rq->prev_mm = prev->active_mm;
		prev->active_mm = NULL;
	}
}

如果next是一个内核线程，并不会立即flush tlb，通过enter_lazy_tlb仅设置了一个标记，active_mm是内核线程借用的用户态的mm。

else分支是真正会马上进行切换mm的情况，具体调用函数就是switch_mm_irqs_off。

下面分析switch_mm_irqs_off函数：

/*
 * This optimizes when not actually switching mm's.  Some architectures use the
 * 'unused' argument for this optimization, but x86 must use
 * 'cpu_tlbstate.loaded_mm' instead because it does not always keep
 * 'current->active_mm' up to date.
 */
void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
			struct task_struct *tsk)
{
	struct mm_struct *prev = this_cpu_read(cpu_tlbstate.loaded_mm);
	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
	bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
	unsigned cpu = smp_processor_id();
	unsigned long new_lam;
	struct new_asid ns;
	u64 next_tlb_gen;


	/* We don't want flush_tlb_func() to run concurrently with us. */
	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
		WARN_ON_ONCE(!irqs_disabled());

	/*
	 * Verify that CR3 is what we think it is.  This will catch
	 * hypothetical buggy code that directly switches to swapper_pg_dir
	 * without going through leave_mm() / switch_mm_irqs_off() or that
	 * does something like write_cr3(read_cr3_pa()).
	 *
	 * Only do this check if CONFIG_DEBUG_VM=y because __read_cr3()
	 * isn't free.
	 */
#ifdef CONFIG_DEBUG_VM
	if (WARN_ON_ONCE(__read_cr3() != build_cr3(prev->pgd, prev_asid,
						   tlbstate_lam_cr3_mask()))) {
		/*
		 * If we were to BUG here, we'd be very likely to kill
		 * the system so hard that we don't see the call trace.
		 * Try to recover instead by ignoring the error and doing
		 * a global flush to minimize the chance of corruption.
		 *
		 * (This is far from being a fully correct recovery.
		 *  Architecturally, the CPU could prefetch something
		 *  back into an incorrect ASID slot and leave it there
		 *  to cause trouble down the road.  It's better than
		 *  nothing, though.)
		 */
		__flush_tlb_all();
	}
#endif
	if (was_lazy)
		this_cpu_write(cpu_tlbstate_shared.is_lazy, false);

	/*
	 * The membarrier system call requires a full memory barrier and
	 * core serialization before returning to user-space, after
	 * storing to rq->curr, when changing mm.  This is because
	 * membarrier() sends IPIs to all CPUs that are in the target mm
	 * to make them issue memory barriers.  However, if another CPU
	 * switches to/from the target mm concurrently with
	 * membarrier(), it can cause that CPU not to receive an IPI
	 * when it really should issue a memory barrier.  Writing to CR3
	 * provides that full memory barrier and core serializing
	 * instruction.
	 */
	if (prev == next) {
		/* Not actually switching mm's */
		VM_WARN_ON(is_dyn_asid(prev_asid) &&
			   this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
			   next->context.ctx_id);

		/*
		 * If this races with another thread that enables lam, 'new_lam'
		 * might not match tlbstate_lam_cr3_mask().
		 */

		/*
		 * Even in lazy TLB mode, the CPU should stay set in the
		 * mm_cpumask. The TLB shootdown code can figure out from
		 * cpu_tlbstate_shared.is_lazy whether or not to send an IPI.
		 */
		if (IS_ENABLED(CONFIG_DEBUG_VM) &&
		    WARN_ON_ONCE(prev != &init_mm && !is_notrack_mm(prev) &&
				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
			cpumask_set_cpu(cpu, mm_cpumask(next));

		/* Check if the current mm is transitioning to a global ASID */
		if (mm_needs_global_asid(next, prev_asid)) {
			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
			ns = choose_new_asid(next, next_tlb_gen);
			goto reload_tlb;
		}

		/*
		 * Broadcast TLB invalidation keeps this ASID up to date
		 * all the time.
		 */
		if (is_global_asid(prev_asid))
			return;

		/*
		 * If the CPU is not in lazy TLB mode, we are just switching
		 * from one thread in a process to another thread in the same
		 * process. No TLB flush required.
		 */
		if (!was_lazy)
			return;

		/*
		 * Read the tlb_gen to check whether a flush is needed.
		 * If the TLB is up to date, just use it.
		 * The barrier synchronizes with the tlb_gen increment in
		 * the TLB shootdown code.
		 */
		smp_mb();
		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
				next_tlb_gen)
			return;

		/*
		 * TLB contents went out of date while we were in lazy
		 * mode. Fall through to the TLB switching code below.
		 */
		ns.asid = prev_asid;
		ns.need_flush = true;
	} else {
		/*
		 * Apply process to process speculation vulnerability
		 * mitigations if applicable.
		 */
		cond_mitigation(tsk);

		/*
		 * Indicate that CR3 is about to change. nmi_uaccess_okay()
		 * and others are sensitive to the window where mm_cpumask(),
		 * CR3 and cpu_tlbstate.loaded_mm are not all in sync.
		 */
		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
		barrier();

		/* Start receiving IPIs and then read tlb_gen (and LAM below) */
		if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
			cpumask_set_cpu(cpu, mm_cpumask(next));
		next_tlb_gen = atomic64_read(&next->context.tlb_gen);

		ns = choose_new_asid(next, next_tlb_gen);
	}

reload_tlb:
	new_lam = mm_lam_cr3_mask(next);
	if (ns.need_flush) {
		VM_WARN_ON_ONCE(is_global_asid(ns.asid));
		this_cpu_write(cpu_tlbstate.ctxs[ns.asid].ctx_id, next->context.ctx_id);
		this_cpu_write(cpu_tlbstate.ctxs[ns.asid].tlb_gen, next_tlb_gen);
		load_new_mm_cr3(next->pgd, ns.asid, new_lam, true);

		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
	} else {
		/* The new ASID is already up to date. */
		load_new_mm_cr3(next->pgd, ns.asid, new_lam, false);

		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
	}

	/* Make sure we write CR3 before loaded_mm. */
	barrier();

	this_cpu_write(cpu_tlbstate.loaded_mm, next);
	this_cpu_write(cpu_tlbstate.loaded_mm_asid, ns.asid);
	cpu_tlbstate_update_lam(new_lam, mm_untag_mask(next));

	if (next != prev) {
		cr4_update_pce_mm(next);
		switch_ldt(prev, next);
	}
}

如果之前处于lazy tlb模式下，这时就要清除is_lazy标志了，因为switch_mm_irqs_off函数本身就是要切换mm空间，不再是之前的lazy模式了。在Linux的lazy TLB机制中，当一个进程（task）从CPU上被切换出去后，如果下一个任务共享相同的mm_struct（比如另一个线程），我们不立即清除TLB或切换页表，而只切换寄存器上下文，CPU处于一种“懒惰”的状态，即虽然任务被切走，但它的页表依然保留在TLB 中，直到真正有需要（比如切换到另一个不同的mm）时才清除或更新，这种机制优化了上下文切换的性能，避免不必要的TLB flush和CR3写入。这个状态通过cpu_tlbstate_shared.is_lazy来记录。它是一个每CPU状态变量。

随后的逻辑分两个大的部分，一是如果prev和next是同一个进程，那么就很有可能不做真正的切换动作，提前return。但如果prev和next不一样，也就是else分支，那么肯定是要切换mm，所谓切换mm主要有两个方面的事情，一是按新的next进程的页表基地址作为构成cr3的一个部分写到cr3，二是刷新tlb（根据条件看是否要做这个动作）。下面继续分析，首先是prev == next满足的情况，这个分支下先有一个warn判断：

VM_WARN_ON(is_dyn_asid(prev_asid) &&
	   this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
	   next->context.ctx_id);

context是内嵌在mm_context_t里的，也就是说mm_struct::context::ctx_id也可以唯一标识一个mm_struct，在init_new_context函数里，会初始化这个ctx_id：

mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);

last_mm_ctx_id是64位全局只增不回退的计数器，每次当有新的mm_struct申请出来时，都会自增这个计数并初始化给ctx_id，用以唯一确定一个mm_struct，而在switch_mm_irqs_off稍后的逻辑，真正切换了mm_struct时，才会更新cpu_tlbstate.loaded_mm_asid，也就是保证per-cpu上（tlb_state结构体）的loaded_mm_asid准确反应现在用的是哪个ASID（槽），所谓ASID槽，具体指的其实就是tlb_state结构体里的ctxs数组，它的大小为TLB_NR_DYN_ASIDS，x86下是6。tlb_state::ctxs[]共计6个槽，当前正在用的槽由tlb_state::loaded_mm_asid指明，而当前正在用的槽对应是哪个mm_struct，由tlb_state::ctxs[loaded_mm_asid].ctx_id指明。

有了以上认知再来继续深入后面X86 tlb管理的代码就容易一些。

回到前面的WARN判断，这个WARN判断是在prev == next满足的情况下，那么自然正常情况下，per-cpu上正在使用的tlb_context::ctx_id就该等于task_struct::context::ctx_id，因为在随后的逻辑，就会设置per-cpu里的ctx_id（当然是在刷新tlb时才做这个动作）：

this_cpu_write(cpu_tlbstate.ctxs[ns.asid].ctx_id, next->context.ctx_id);

再往下看：

/*
 * Even in lazy TLB mode, the CPU should stay set in the
 * mm_cpumask. The TLB shootdown code can figure out from
 * cpu_tlbstate_shared.is_lazy whether or not to send an IPI.
 */
if (IS_ENABLED(CONFIG_DEBUG_VM) &&
    WARN_ON_ONCE(prev != &init_mm && !is_notrack_mm(prev) &&
		 !cpumask_test_cpu(cpu, mm_cpumask(next))))
	cpumask_set_cpu(cpu, mm_cpumask(next));

正常来说CONFIG_DEBUG_VM不会开启，因为它会影响性能：

Enable this to turn on extended checks in the virtual-memory system
that may impact performance.

即使开启了，一般后面的WARN_ON_ONCE也不会满足条件得到执行，但是这些条件和下面的cpumask_set_cpu就是要兜底保证一个mm_struct在某个cpu上使用时，需要将cpu号记录在mm_struct::cpu_bitmap里，这样将来假如其它CPU上也使用了这个mm_struct，然后可以通过如下一些例子路径发送IPI通知到本CPU：

unmap_region->tlb_finish_mmu->tlb_flush_mmu->tlb_flush_mmu_tlbonly->tlb_flush->flush_tlb_mm->flush_tlb_mm_range->flush_tlb_multi->__flush_tlb_multi->native_flush_tlb_multi
try_to_unmap->rmap_walk->rmap_walk_anon->rmap_one(try_to_unmap_one)->flush_tlb_mm_range->flush_tlb_multi->__flush_tlb_multi->native_flush_tlb_multi
exit_mmap->unmap_vmas->unmap_single_vma->unmap_page_range->zap_p4d_range->zap_pud_range->zap_pmd_range->zap_pte_range->tlb_flush_mmu_tlbonly->tlb_flush->flush_tlb_mm->flush_tlb_mm_range->flush_tlb_multi->__flush_tlb_multi->native_flush_tlb_multi

在native_flush_tlb_multi里就有针对cpumask里每个cpu去调用flush_tlb_func：

on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);

可以看到，unmap时就会刷新tlb，可能通知到其它cpu，本质上tlb其实就是缓存的虚拟地址到物理地址的缓存，所以页表变了的话肯定也会刷新tlb以反应新的虚拟地址到物理地址的映射。

前面提到了，上面的检查cpu是否在mm_struct::cpu_bitmap里，一般都不会做，那么往mm_struct::cpu_bitmap里设置cpu号是在哪里呢？实际就在switch_mm_irqs_off稍后的逻辑，就是prev== next不满足，确实会切换为不同mm_struct，就会设置上当前cpu号到next的cpu_bitmap：

/* Start receiving IPIs and then read tlb_gen (and LAM below) */
if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
	cpumask_set_cpu(cpu, mm_cpumask(next));

这样假如将来需要在next这个mm_struct上执行flush_tlb_func，就可以通过mm_cpumask找到是哪些CPU需要执行这个tlb刷新函数，tlb是每个CPU私有的per-CPU硬件资源，内容不会自动在核间同步，需要显式失效机制来保持一致性，也就是做：

flush_tlb_func->flush_tlb_one_user->__flush_tlb_one_user->native_flush_tlb_one_user->invpcid_flush_one->__invpcid

最后其实就是invpcid指令，关于它的详细介绍可以参考Intel SDM3 Chapter 4.10。

继续往下分析，mm_needs_global_asid->mm_global_asid主要在AMD平台上使用，因为mm_global_asid在CONFIG_BROADCAST_TLB_FLUSH配置下，但是这个配置：

config BROADCAST_TLB_FLUSH
	def_bool y
	depends on CPU_SUP_AMD && 64BIT

可以看到在AMD平台上使用。

往下的代码：

/*
 * Broadcast TLB invalidation keeps this ASID up to date
 * all the time.
 */
if (is_global_asid(prev_asid))
	return;

也主要针对AMD平台，is_global_asid将大于6的asid都识别为global的，主要依靠amd平台的invlpgb指令。

随后的代码也是提前返回，不做实际的tlb reload：

/*
 * If the CPU is not in lazy TLB mode, we are just switching
 * from one thread in a process to another thread in the same
 * process. No TLB flush required.
 */
if (!was_lazy)
	return;

也就是说，之前并没有在kernel thread里，并且上面的代码是在prev == next满足的情况下，就是说是在同一个进程的多个线程间切换，多个线程共享一个mm_struct空间，自然也不需要reloadtlb（mm_struct没有改变，页表也就没有改变）。

再往下分析：

/*
 * Read the tlb_gen to check whether a flush is needed.
 * If the TLB is up to date, just use it.
 * The barrier synchronizes with the tlb_gen increment in
 * the TLB shootdown code.
 */
smp_mb();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
		next_tlb_gen)
	return;

tlb_gen指示了内存页表是否改变，如果内存页表改变了，那么tlb就不是up to date的了，tlb的本质是缓存了虚拟地址到物理地址的映射，如果内存里的页表改变了，自然TLB需要重新同步，注意这里prev_asid来自loaded_mm_asid：

u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);

并且之前某次切换流程里，那时prev作为next，假如是真的进行了tlb flush，也就是ns.need_flush是true，在那时cpu_tlbstate里的tlb_gen会更新成next里的：

this_cpu_write(cpu_tlbstate.ctxs[ns.asid].tlb_gen, next_tlb_gen);

自那次后到本次进程切换，如果next_tlb_gen有改变，那么上面的判断相等就不会满足，也就是说这期间进程有改变自己的页表，此时再切换到prev（或next，因为整个大的条件在prev == next满足的情况下），就需要刷新tlb，让tlb部件和内存里的页表重新进行同步。那么什么情况下，mm_struct::context::tlb_gen可能会增加呢？其实就是前面有列举出的ipi的路径上就会增加这个计数，flush_tlb_mm_range都会调用inc_mm_tlb_gen：

static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
{
	/*
	 * Bump the generation count.  This also serves as a full barrier
	 * that synchronizes with switch_mm(): callers are required to order
	 * their read of mm_cpumask after their writes to the paging
	 * structures.
	 */
	return atomic64_inc_return(&mm->context.tlb_gen);
}

那么再往下的代码就比较容易理解了：

/*
 * TLB contents went out of date while we were in lazy
 * mode. Fall through to the TLB switching code below.
 */
ns.asid = prev_asid;
ns.need_flush = true;

上面的代码还是在prev == next满足的情况下，但是这时没有提前返回，也就是要进行实质的reload tlb动作，上面的代码跑到，实际是说前面几个判断可以提前跳过reload tlb的条件都没有满足，虽然是在同一个进程的多个线程间切换，但是mm_struct改变了，所以需要设置need_flush为true，进而fall through到后面真正的tlb reload操作。总结起来就是，某个进程的页表虽然改变了，但是可能在当时不会立即刷新tlb，而只是增长一个tlb_gen计数，这样在下次真正要切换到这个进程时再检测到这个情况进行tlb刷新，也是一种惰性策略，可以提高性能，因为在更新了页表时就立即做了tlb刷新，但下次切换进程时，不一定切换到做了页表更新的进程，这样白白浪费了之前做的tlb刷新，注意这种“计数改变”的方法是一种实现惰性的手段。

在分析reload_tlb分支前，先分析前面if的else分支，也就是说prev == next没有满足的代码，这时需要切换到另一个不同的进程的mm_struct。

cond_mitigation主要是安全方面的措施，和进程切换的流程关系不大，本文跳过。

else分支的主要逻辑其实是调用到choose_new_asid，去选择一个在当前cpu上可用的tlb_state::ctxs[]，至于再前一点的调用cpumask_set_cpu前面有提到，是将当前cpu号标记到mm_struct::cpu_bitmap里，以便将来有其它cpu更新页表时，可以通过ipi通知到本CPU去刷新tlb，注意init_mm表示的内核的页表，它是内核页表映射，所有CPU共享，不需要进行TLB shootdown，因为它永远不会动态更改（不会换页表，也不会取消映射）。

next_tlb_gen先将要切换到的next进程的tlb_gen读上来，然后以这个参数调用choose_new_asid函数，下面分析这个函数：

static struct new_asid choose_new_asid(struct mm_struct *next, u64 next_tlb_gen)
{
	struct new_asid ns;
	u16 asid;

	if (!static_cpu_has(X86_FEATURE_PCID)) {
		ns.asid = 0;
		ns.need_flush = 1;
		return ns;
	}

	/*
	 * TLB consistency for global ASIDs is maintained with hardware assisted
	 * remote TLB flushing. Global ASIDs are always up to date.
	 */
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
		u16 global_asid = mm_global_asid(next);

		if (global_asid) {
			ns.asid = global_asid;
			ns.need_flush = 0;
			return ns;
		}
	}

	if (this_cpu_read(cpu_tlbstate.invalidate_other))
		clear_asid_other();

	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
		    next->context.ctx_id)
			continue;

		ns.asid = asid;
		ns.need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) < next_tlb_gen);
		return ns;
	}

	/*
	 * We don't currently own an ASID slot on this CPU.
	 * Allocate a slot.
	 */
	ns.asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
	if (ns.asid >= TLB_NR_DYN_ASIDS) {
		ns.asid = 0;
		this_cpu_write(cpu_tlbstate.next_asid, 1);
	}
	ns.need_flush = true;

	return ns;
}

该函数的逻辑分为几个部分，X86_FEATURE_PCID判定CPU是否支持硬件PCID功能，前面的分析已经提到了pcid，那么到底什么是pcid呢，这在Intel SDM3 Chapter4.10.1有介绍，它其实是CR3寄存器的0:11这十二个比特位的值，当用虚拟地址+物理地址填tlb条目时，也会将这个条目关联上当前的pcid值cr3.[11:0]，而当CPU访问虚拟地址查询物理地址时，也仅会使用当前pcid指明的哪些TLB条目，并且在切换进程（mm_struct）时，不相关的pcid的tlb条目可以继续留在tlb cache上，不至于刷新整个tlb，而影响其它进程缓存的tlb信息，这主要是靠invpcid指令以type 1运行时（Intel SDM3 4.10.4.1），这样可以提高性能。

现代CPU一般都支持PCID功能，而随后的invlpgb功能主要在AMD平台上使用，本文不做介绍。

下面invalidate_other的逻辑主要是通过clear_asid_other函数将ctx_id设置为0，ctxs::ctx_id一般是用于指明mm_struct，但为0的ctx_id有特别的用途：

/*
 * We get here when we do something requiring a TLB invalidation
 * but could not go invalidate all of the contexts.  We do the
 * necessary invalidation by clearing out the 'ctx_id' which
 * forces a TLB flush when the context is loaded.
 */
static void clear_asid_other(void)
{
	u16 asid;

	/*
	 * This is only expected to be set if we have disabled
	 * kernel _PAGE_GLOBAL pages.
	 */
	if (!static_cpu_has(X86_FEATURE_PTI)) {
		WARN_ON_ONCE(1);
		return;
	}

	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
		/* Do not need to flush the current asid */
		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
			continue;
		/*
		 * Make sure the next time we go to switch to
		 * this asid, we do a flush:
		 */
		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
	}
	this_cpu_write(cpu_tlbstate.invalidate_other, false);
}

下面的for循环，针对每个cpu的6个tlb_context结构体，看其对应的ctx_id是否和即将切换到要运行的next:context::ctx_id是否相等，如果有相等的，证明之前这个进程在这个cpu上运行过，并且已经分配了一个tlb_context槽，那么本次再在这个cpu上运行到next进程的话，就会复用上次之前用过的asid，当然tlb_gen可能会增加，也就是可能改过next对应的页表，这时就需要刷新tlb了，need_flush为true了。当然如果这个cpu上的所有六个tlb_context::ctx_id都和将要运行的context::ctx_id不一样，代表之前没有运行过，这包括之前提到过的，如果在clear_asid_other里将所有per-cpu里的ctxs::ctx_id设为0的话，这里自然不可能复用，走到choose_new_asid最后的逻辑，自然就是need_flush设置为true，同时后面的逻辑其实是为将要运行的next这个mm_struct在这个cpu上分配一个asid，当这个asid号大于TLB_NR_DYN_ASIDS时，还需要回绕到0，注意next_asid记录的是下次要用的asid号，所以这里先加1到next_asid，表征下次可以用的asid，再减1才是本次可以用的asid号。

现在终于到reload_tlb的逻辑了，其实这部分逻辑前面已经有涉及了，比如如果need_flush为true的话，代表需要切换进程mm_struct空间，这时per-cpu上对应asid编号的ctx_id也要更新为现在即将要运行next这个mm_struct里的ctx_id，per-cpu上对应asid编号的tlb_gen也要更新为next进程的next_tlb_gen，而per-cpu的cpu_tlbstate的loaded_mm也要反应为现在的next进程，per-cpu的cpu_tlbstate::loaded_mm_asid要更新为前面新选择的ns.asid，真正写cr3的动作是在load_new_mm_cr3里，该函数会调用build_cr3或者build_cr3_noflush来构建要往cr3里写的值，noflush版本会明确将cr3寄存器的最高位设置为1(通过CR3_NOFLUSH宏)也就是在写cr3时不要刷新tlb，至于CR3本身的值现在回过头来看就很简单了，低12位为asid的值，中间的位是mm_struct对应的顶级页表的地址的指针。

继续往下看lru_gen_use_mm，mm_struct::lru_gen::bitmap用于内存回收的代码，在这里切换到某个进程运行时，就会通过lru_gen_use_mm将其只为全1（-1）：

static inline void lru_gen_use_mm(struct mm_struct *mm)
{
	/*
	 * When the bitmap is set, page reclaim knows this mm_struct has been
	 * used since the last time it cleared the bitmap. So it might be worth
	 * walking the page tables of this mm_struct to clear the accessed bit.
	 */
	WRITE_ONCE(mm->lru_gen.bitmap, -1);
}

这里既然马上要切换到next运行了，自然需要将其bitmap设置为-1，代表这个mm_struct访问过，那么在vmscan（内存回收）的代码里，就会判断这个bitmap是否设置了，如果设置了，就会clear掉bitmap里的标志为0，并且返回这个mm_struct，代表需要内存回收代码进一步进行处理，这由linux/mm/vmscan.c里的get_next_mm函数体现。关于内存回收后面有文章分析。

继续往下如果之前是在内核线程里运行，那么需要将prev->active_mm保存到rq->prev_mm。

switch_mm_cid在SCHED_MM_CID配置下起效，这个配置主要是用来区分某个进程的mm_struct被多个线程共享访问时来区分哪个线程的，本文不介绍这个主题。

分析完switch_mm_irqs_off后，可以看看switch_to，switch_to里会完成任务流的切换，并且当switch_to返回时，正式完成切换，switch_to的实现是架构相关的，因为该函数涉及到上下文保存和恢复。这里主要针对x86-64架构来分析，其实现如下：

#define switch_to(prev, next, last)					\
do {									\
	((last) = __switch_to_asm((prev), (next)));			\
} while (0)

__switch_to_asm的实现在x86架构上分为32位和64位的版本，分别实现在arch/x86/entry/entry_32.S和arch/x86/entry/entry_64.S两个，通过命令：

find . -name "*Makefile*" -exec grep -Rn --color=auto "entry_" {} +

知道编译哪个版本由BITS宏决定，在arch/x86/entry/Makefile里：

obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o

那么BITS在哪里定义，依据什么条件定义成多少呢？使用命令：

find . -name "*Makefile*" -exec grep -Rn --color=auto "BITS" {} +

在arch/x86/Makefile里有：

ifeq ($(CONFIG_X86_32),y)
        BITS := 32
        UTS_MACHINE := i386
        CHECKFLAGS += -D__i386__

        KBUILD_AFLAGS += -m32
        KBUILD_CFLAGS += -m32

        KBUILD_CFLAGS += -msoft-float -mregparm=3 -freg-struct-return

        # Never want PIC in a 32-bit kernel, prevent breakage with GCC built
        # with nonstandard options
        KBUILD_CFLAGS += -fno-pic

        # Align the stack to the register width instead of using the default
        # alignment of 16 bytes. This reduces stack usage and the number of
        # alignment instructions.
        KBUILD_CFLAGS += $(cc_stack_align4)

        # CPU-specific tuning. Anything which can be shared with UML should go here.
        include $(srctree)/arch/x86/Makefile_32.cpu
        KBUILD_CFLAGS += $(cflags-y)

    ifneq ($(call clang-min-version, 160000),y)
        # https://github.com/llvm/llvm-project/issues/53645
        KBUILD_CFLAGS += -ffreestanding
    endif

        percpu_seg := fs
else
        BITS := 64

也就是说，只要没有定义CONFIG_X86_32的情况，BITS就是64，那么这时就知道分析entry_64.S里的__switch_to_asm的实现：

/*
 * %rdi: prev task
 * %rsi: next task
 */
.pushsection .text, "ax"
SYM_FUNC_START(__switch_to_asm)
	ANNOTATE_NOENDBR
	/*
	 * Save callee-saved registers
	 * This must match the order in inactive_task_frame
	 */
	pushq	%rbp
	pushq	%rbx
	pushq	%r12
	pushq	%r13
	pushq	%r14
	pushq	%r15

	/* switch stack */
	movq	%rsp, TASK_threadsp(%rdi)
	movq	TASK_threadsp(%rsi), %rsp

#ifdef CONFIG_STACKPROTECTOR
	movq	TASK_stack_canary(%rsi), %rbx
	movq	%rbx, PER_CPU_VAR(__stack_chk_guard)
#endif

	/*
	 * When switching from a shallower to a deeper call stack
	 * the RSB may either underflow or use entries populated
	 * with userspace addresses. On CPUs where those concerns
	 * exist, overwrite the RSB with entries which capture
	 * speculative execution to prevent attack.
	 */
	FILL_RETURN_BUFFER %r12, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW

	/* restore callee-saved registers */
	popq	%r15
	popq	%r14
	popq	%r13
	popq	%r12
	popq	%rbx
	popq	%rbp

	jmp	__switch_to
SYM_FUNC_END(__switch_to_asm)
.popsection

调用__switch_to_asm函数时，rdi寄存器是prev任务，而rsi是next任务，先解释下：

.pushsection .text, "ax"

的作用，pushsection是GNU汇编器（GAS）的伪指令，它可以把后续指令和数据放到指定的section（段）里，同时保存当前的section，具体的说，.text表示代码段（通常含可执行指令），"ax"是section的属性，a表示allocatable（链接时会分配虚拟地址），x表示executable（可执行）等价于：

.section .text, "ax"

但不同的是，pushsection会把当前section压栈保存，后面popsection时再恢复。.section只是切换过去，不自动恢复。所以.pushsection/.popsection成对出现，用来临时切换到某个section添加指令或数据，再恢复原先的section。

ANNOTATE_NOENDBR主要是一个安全功能，endbr指令用于indirect call（比如函数指针）的情况，被call的函数其开头第一条指令应该是endbr指令，但这里__switch_to_asm没有必要使用这个指令，因为都是直接调用。

接下来的几条pushq指令，实际是将callee-saved的寄存器保存到当前进程的内核栈，然后两条movq指令很关键，它实际完成了任务切换，__switch_to_asm被内联到了__schedule函数，对于x86-64，反汇编__schedule函数有：

ffffffff81cc7277:       e8 64 0e 34 ff          call   ffffffff810080e0 <__switch_to_asm>
ffffffff81cc727c:       48 89 c7                mov    %rax,%rdi
ffffffff81cc727f:       e8 2c 23 44 ff          call   ffffffff811095b0 <finish_task_switch.isra.0>

那么call指令一旦执行，现在rsp栈顶就存着call下一条指令的地址：ffffffff81cc727c。而在__switch_to_asm的反汇编里有：

ffffffff810080ee:       48 89 a7 18 15 00 00    mov    %rsp,0x1518(%rdi)
ffffffff810080f5:       48 8b a6 18 15 00 00    mov    0x1518(%rsi),%rsp

在这里可以看到rsp被恢复成之前保存在task_struct里偏移TASK_threadsp的位置，最后在__switch_to的反汇编里又通过__x86_return_thunk里的ret指令完成从栈顶弹出之前保存的ffffffff81cc727c处的指令继续运行，这样切换完成过来的进程都从finish_task_switch开始运行，context_switch->finish_task_switch，finish_task_switch也作为context_switch的最后一个函数，而context_switch也作为__schedule的最后一个函数，这样就离开调度器的代码（或者最终离开内核代码返回到user space）了，继续运行进程自己的代码了，调用__schedule的代码有很多（查看调用scheule的地方有很多，schedule->__schedule_loop->__schedule）。这是x86任务切换的底层原理。

CONFIG_STACKPROTECTOR配置一般是用于防御篡改栈上的返回地址这种攻击。

在rsp寄存器switch成新的进程的栈后，最后的几个popq指令实际是恢复新进程之前通过pushq存储的callee-saved的寄存器。

这里context switch只保存了callee-saved的寄存器，至于rax等其它寄存器是caller-saved的，编译器会根据ABI要求，决定是否在调用前保存这些值，编译器生成调用方会汇编来保存。

最后又去到了__swith_to这个C函数，也就是说截止__switch_to_asm结束，CPU执行流还没有交到next进程手里，__switch_to这个C函数里还有一些处理切换的逻辑，在__switch_to的最后通过return语句才弹回next的rsp上的返回指令，且到next的执行流上去，所以后面还要继续分析__switch_to这个C函数：

/*
 *	switch_to(x,y) should switch tasks from x to y.
 *
 * This could still be optimized:
 * - fold all the options into a flag word and test it with a single test.
 * - could test fs/gs bitsliced
 *
 * Kprobes not supported here. Set the probe on schedule instead.
 * Function graph tracer not supported too.
 */
__no_kmsan_checks
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
	struct thread_struct *prev = &prev_p->thread;
	struct thread_struct *next = &next_p->thread;
	int cpu = smp_processor_id();

	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
		     this_cpu_read(hardirq_stack_inuse));

	switch_fpu(prev_p, cpu);

	/* We must save %fs and %gs before load_TLS() because
	 * %fs and %gs may be cleared by load_TLS().
	 *
	 * (e.g. xen_load_tls())
	 */
	save_fsgs(prev_p);

	/*
	 * Load TLS before restoring any segments so that segment loads
	 * reference the correct GDT entries.
	 */
	load_TLS(next, cpu);

	/*
	 * Leave lazy mode, flushing any hypercalls made here.  This
	 * must be done after loading TLS entries in the GDT but before
	 * loading segments that might reference them.
	 */
	arch_end_context_switch(next_p);

	/* Switch DS and ES.
	 *
	 * Reading them only returns the selectors, but writing them (if
	 * nonzero) loads the full descriptor from the GDT or LDT.  The
	 * LDT for next is loaded in switch_mm, and the GDT is loaded
	 * above.
	 *
	 * We therefore need to write new values to the segment
	 * registers on every context switch unless both the new and old
	 * values are zero.
	 *
	 * Note that we don't need to do anything for CS and SS, as
	 * those are saved and restored as part of pt_regs.
	 */
	savesegment(es, prev->es);
	if (unlikely(next->es | prev->es))
		loadsegment(es, next->es);

	savesegment(ds, prev->ds);
	if (unlikely(next->ds | prev->ds))
		loadsegment(ds, next->ds);

	x86_fsgsbase_load(prev, next);

	x86_pkru_load(prev, next);

	/*
	 * Switch the PDA and FPU contexts.
	 */
	raw_cpu_write(current_task, next_p);
	raw_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));

	/* Reload sp0. */
	update_task_stack(next_p);

	switch_to_extra(prev_p, next_p);

	if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) {
		/*
		 * AMD CPUs have a misfeature: SYSRET sets the SS selector but
		 * does not update the cached descriptor.  As a result, if we
		 * do SYSRET while SS is NULL, we'll end up in user mode with
		 * SS apparently equal to __USER_DS but actually unusable.
		 *
		 * The straightforward workaround would be to fix it up just
		 * before SYSRET, but that would slow down the system call
		 * fast paths.  Instead, we ensure that SS is never NULL in
		 * system call context.  We do this by replacing NULL SS
		 * selectors at every context switch.  SYSCALL sets up a valid
		 * SS, so the only way to get NULL is to re-enter the kernel
		 * from CPL 3 through an interrupt.  Since that can't happen
		 * in the same task as a running syscall, we are guaranteed to
		 * context switch between every interrupt vector entry and a
		 * subsequent SYSRET.
		 *
		 * We read SS first because SS reads are much faster than
		 * writes.  Out of caution, we force SS to __KERNEL_DS even if
		 * it previously had a different non-NULL value.
		 */
		unsigned short ss_sel;
		savesegment(ss, ss_sel);
		if (ss_sel != __KERNEL_DS)
			loadsegment(ss, __KERNEL_DS);
	}

	/* Load the Intel cache allocation PQR MSR. */
	resctrl_arch_sched_in(next_p);

	return prev_p;
}

switch_fpu函数用于保存当前cpu的fpu寄存器状态到内存的fpu结构体，在详细介绍这个函数前首先需要引入一个fpu结构体，fpu是架构相关的结构体，主要定义在x86架构下，每个fpu跟在task_struct结构体后，内核开启了CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT选项，表示task_struct结构体的大小是可以动态变化的，但是要注意动态变化主要体现在架构自己需要使用的内存需要，比如保存fpu的信息，至于task_struct结构体自身一般是不变化的：

#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
extern int arch_task_struct_size __read_mostly;
#else
# define arch_task_struct_size (sizeof(struct task_struct))
#endif

x86架构一般都定义了CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT选项，所以在linux/arch/x86/kernel/fpu/init.c里的fpu__init_task_struct_size函数就会给arch_task_struct_size赋值，这个值需要加上fpu等结构体的大小：

/*
 * We append the 'struct fpu' to the task_struct:
 */
static void __init fpu__init_task_struct_size(void)
{
	int task_size = sizeof(struct task_struct);

	task_size += sizeof(struct fpu);

	/*
	 * Subtract off the static size of the register state.
	 * It potentially has a bunch of padding.
	 */
	task_size -= sizeof(union fpregs_state);

	/*
	 * Add back the dynamically-calculated register state
	 * size.
	 */
	task_size += fpu_kernel_cfg.default_size;

	/*
	 * We dynamically size 'struct fpu', so we require that
	 * 'state' be at the end of 'it:
	 */
	CHECK_MEMBER_AT_END_OF(struct fpu, __fpstate);

	arch_task_struct_size = task_size;
}

这样在linux/kernel/fork.c里又有fork_init函数就可以使用kmem_cache_create_usercopy函数创建一个arch_task_struct_size大小的缓存，用来分配task_struct结构体，不过这个结构体后append了架构相关的结构体比如fpu：

void __init fork_init(void)
{
      ...
	task_struct_cachep = kmem_cache_create_usercopy("task_struct",
			arch_task_struct_size, align,
			SLAB_PANIC|SLAB_ACCOUNT,
			useroffset, usersize, NULL);
      ...
}

回到switch_fpu函数：

/*
 * FPU state switching for scheduling.
 *
 * switch_fpu() saves the old state and sets TIF_NEED_FPU_LOAD if
 * TIF_NEED_FPU_LOAD is not set.  This is done within the context
 * of the old process.
 *
 * Once TIF_NEED_FPU_LOAD is set, it is required to load the
 * registers before returning to userland or using the content
 * otherwise.
 *
 * The FPU context is only stored/restored for a user task and
 * PF_KTHREAD is used to distinguish between kernel and user threads.
 */
static inline void switch_fpu(struct task_struct *old, int cpu)
{
	if (!test_tsk_thread_flag(old, TIF_NEED_FPU_LOAD) &&
	    cpu_feature_enabled(X86_FEATURE_FPU) &&
	    !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) {
		struct fpu *old_fpu = x86_task_fpu(old);

		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
		save_fpregs_to_fpstate(old_fpu);
		/*
		 * The save operation preserved register state, so the
		 * fpu_fpregs_owner_ctx is still @old_fpu. Store the
		 * current CPU number in @old_fpu, so the next return
		 * to user space can avoid the FPU register restore
		 * when is returns on the same CPU and still owns the
		 * context. See fpregs_restore_userregs().
		 */
		old_fpu->last_cpu = cpu;

		trace_x86_fpu_regs_deactivated(old_fpu);
	}
}

其定义在linux/arch/x86/include/asm/fpu/sched.h里，那么x86_task_fpu的实现：

#ifdef CONFIG_X86_DEBUG_FPU
extern struct fpu *x86_task_fpu(struct task_struct *task);
#else
# define x86_task_fpu(task)	((struct fpu *)((void *)(task) + sizeof(*(task))))
#endif

就可以通过一个task_struct的起始地址加上某个偏移就是fpu结构体的方式得到fpu结构体的地址。save_fpregs_to_fpstate是真正的执行xsave/fxsave等指令保存当前cpu浮点相关的信息到fpu::fpstate::regs:save或者fpu::fpstate::regs::fxsave内存区域。最后将当前cpu号放入fpu里的last_cpu记录下刚运行的cpu号。

往下的save_fsgs/load_TLS函数主要涉及x86架构的分段机制，在64位的long mode下，已经很少使用分段机制了，不过正确设置cs/ds等可以使用段描述符的低几位做权限检查。

先看save_fsgs：

static __always_inline void save_fsgs(struct task_struct *task)
{
	savesegment(fs, task->thread.fsindex);
	savesegment(gs, task->thread.gsindex);
	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
		/*
		 * If FSGSBASE is enabled, we can't make any useful guesses
		 * about the base, and user code expects us to save the current
		 * value.  Fortunately, reading the base directly is efficient.
		 */
		task->thread.fsbase = rdfsbase();
		task->thread.gsbase = __rdgsbase_inactive();
	} else {
		save_base_legacy(task, task->thread.fsindex, FS);
		save_base_legacy(task, task->thread.gsindex, GS);
	}
}

首先是保存fs/gs段寄存器的值到task_struct::thread里相应的成员里，因为后面可能会加载新的值到这些寄存器，传统上这些寄存器里存的是一个段选择子，用来在gdt/ldt里选择一个段描述符，这个段描述符里记录了某个段的起始地址、限长以及权限相关的信息。但是针对支持fsbase/gsbase的更现代的X86-64架构，基址信息记录在msr寄存器里了，所以fs/gs段选择子寄存器里的索引部分就没什么用了，它原来主要是在gdt/ldt中索引段描述符的。

savesegment实现如下：

/*
 * Save a segment register away:
 */
#define savesegment(seg, value)				\
	asm("mov %%" #seg ",%0":"=r" (value) : : "memory")

rdfsbase函数就是使用rdfsbase指令了，主要是将fsbase msr寄存器的值也保存到内存里。

往下看load_TLS：

/*
 * Load TLS before restoring any segments so that segment loads
 * reference the correct GDT entries.
 */
load_TLS(next, cpu);

static inline void native_load_tls(struct thread_struct *t, unsigned int cpu)
{
	struct desc_struct *gdt = get_cpu_gdt_rw(cpu);
	unsigned int i;

	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
		gdt[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i];
}

可以看到，这里是将每个线程自己的tls数据，也就是tls_array数组，里面是8-byte的段描述符，加载到了per-cpu的desc_struct数组里（当然这也是一个段描述符数组），从目前这里的代码可以看到gdt还是是一片内存，那么为什么写段描述符的信息到这个地址就可以反应给cpu访问thread线程变量时用这里的gdt里描述的段信息呢？先看get_cpu_gdt_rw实现：

/* Provide the original GDT */
static inline struct desc_struct *get_cpu_gdt_rw(unsigned int cpu)
{
	return per_cpu(gdt_page, cpu).gdt;
}

可以看到就是获取per-cpu变量gdt_page，该变量定义如下：

DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
#ifdef CONFIG_X86_64
	/*
	 * We need valid kernel segments for data and code in long mode too
	 * IRET will check the segment types  kkeil 2000/10/28
	 * Also sysret mandates a special GDT layout
	 *
	 * TLS descriptors are currently at a different place compared to i386.
	 * Hopefully nobody expects them at a fixed place (Wine?)
	 */
	[GDT_ENTRY_KERNEL32_CS]		= GDT_ENTRY_INIT(DESC_CODE32, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_CS]		= GDT_ENTRY_INIT(DESC_CODE64, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_DS]		= GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER32_CS]	= GDT_ENTRY_INIT(DESC_CODE32 | DESC_USER, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_DS]	= GDT_ENTRY_INIT(DESC_DATA64 | DESC_USER, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_CS]	= GDT_ENTRY_INIT(DESC_CODE64 | DESC_USER, 0, 0xfffff),
#else
	[GDT_ENTRY_KERNEL_CS]		= GDT_ENTRY_INIT(DESC_CODE32, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_DS]		= GDT_ENTRY_INIT(DESC_DATA32, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_CS]	= GDT_ENTRY_INIT(DESC_CODE32 | DESC_USER, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_DS]	= GDT_ENTRY_INIT(DESC_DATA32 | DESC_USER, 0, 0xfffff),
	/*
	 * Segments used for calling PnP BIOS have byte granularity.
	 * They code segments and data segments have fixed 64k limits,
	 * the transfer segment sizes are set at run time.
	 */
	[GDT_ENTRY_PNPBIOS_CS32]	= GDT_ENTRY_INIT(DESC_CODE32_BIOS, 0, 0xffff),
	[GDT_ENTRY_PNPBIOS_CS16]	= GDT_ENTRY_INIT(DESC_CODE16, 0, 0xffff),
	[GDT_ENTRY_PNPBIOS_DS]		= GDT_ENTRY_INIT(DESC_DATA16, 0, 0xffff),
	[GDT_ENTRY_PNPBIOS_TS1]		= GDT_ENTRY_INIT(DESC_DATA16, 0, 0),
	[GDT_ENTRY_PNPBIOS_TS2]		= GDT_ENTRY_INIT(DESC_DATA16, 0, 0),
	/*
	 * The APM segments have byte granularity and their bases
	 * are set at run time.  All have 64k limits.
	 */
	[GDT_ENTRY_APMBIOS_BASE]	= GDT_ENTRY_INIT(DESC_CODE32_BIOS, 0, 0xffff),
	[GDT_ENTRY_APMBIOS_BASE+1]	= GDT_ENTRY_INIT(DESC_CODE16, 0, 0xffff),
	[GDT_ENTRY_APMBIOS_BASE+2]	= GDT_ENTRY_INIT(DESC_DATA32_BIOS, 0, 0xffff),

	[GDT_ENTRY_ESPFIX_SS]		= GDT_ENTRY_INIT(DESC_DATA32, 0, 0xfffff),
	[GDT_ENTRY_PERCPU]		= GDT_ENTRY_INIT(DESC_DATA32, 0, 0xfffff),
#endif
} };
EXPORT_PER_CPU_SYMBOL_GPL(gdt_page);

可以看到这里初始化了一些段描述符，而tls这种段描述符又在刚展示的native_load_tls里进行切换重新设置。

但是目前为止，这些段描述符都还是在内存里，而在arch/x86/kernel/head_64.S里有如下一段汇编：

leaq	gdt_page(%rdx), %rax
movq	%rax, 2(%rsp)
lgdt	(%rsp)

在这里可以看到将gdt_page的的内存地址通过lgdt设置到了gdtr寄存器里，而gdtr寄存器里存放的正是cpu通过分段方式访存时需要查询的gdt段描述符表，这样lgdt执行完毕了设置在内存里的gdt表才起效，当然后续内核/os代码就可以根据需要更改gdt描述符表，但是gdtr一直指向这张表。

另外，tls段描述符设置的来源tls_array哪里来的呢？TLS主要是线程局部变量的存储空间，那么可以想见tls的段描述符是可以由每个进程自己设置的，比如系统调用set_thread_area：

do_set_thread_area->set_tls_desc->fill_ldt

这样在fill_ldt里填充desc_struct描述符：

static void set_tls_desc(struct task_struct *p, int idx,
			 const struct user_desc *info, int n)
{
	struct thread_struct *t = &p->thread;
	struct desc_struct *desc = &t->tls_array[idx - GDT_ENTRY_TLS_MIN];
	int cpu;

	/*
	 * We must not get preempted while modifying the TLS.
	 */
	cpu = get_cpu();

	while (n-- > 0) {
		if (LDT_empty(info) || LDT_zero(info))
			memset(desc, 0, sizeof(*desc));
		else
			fill_ldt(desc, info);
		++info;
		++desc;
	}

	if (t == &current->thread)
		load_TLS(t, cpu);

	put_cpu();
}

这里user_desc就是来自用户态传递的描述符结构体。

再往下的代码就是保存以及恢复es和ds段寄存器了，其代码实现前面都有涉及了，需要注意的是读取段寄存器（savesegment）仅返回段寄存器的内容（段选择子），但是写段寄存器的话，除了写入段选择子，还需要从gdt/ldt表里加载对应选择子的描述符到段寄存器的隐藏部分。

pkru是一个控制内存页访问权限的寄存器，允许应用程序通过pkey_mprotect系统调用快速更改对某些内存区域的读写权限，而不需要修改页表，在上下文切换时通过x86_pkru_load去做pkru值的保存和恢复。

再往下是更新两个x86架构下的per-cpu变量current_task和cpu_current_top_of_stack，这里主要想分析下task_top_of_stack的实现：

#define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))

#define task_pt_regs(task) \
({									\
	unsigned long __ptr = (unsigned long)task_stack_page(task);	\
	__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;		\
	((struct pt_regs *)__ptr) - 1;					\
})

这里加减1从效果上看是多余的，但是主要是为了复用task_pt_regs这个接口的实现，THREAD_SIZE的实现各架构不同，对于X86-64来说，没有配置CONFIG_KASAN时就是4个page，页面大小为4KB时，就是16KB：

#ifdef CONFIG_KASAN
#define KASAN_STACK_ORDER 1
#else
#define KASAN_STACK_ORDER 0
#endif

#define THREAD_SIZE_ORDER	(2 + KASAN_STACK_ORDER)
#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

task_stack_page实现如下：

/*
 * When accessing the stack of a non-current task that might exit, use
 * try_get_task_stack() instead.  task_stack_page will return a pointer
 * that could get freed out from under you.
 */
static __always_inline void *task_stack_page(const struct task_struct *task)
{
	return task->stack;
}

task的stack指针在创建task_struct结构体时，通过dup_task_struct->alloc_thread_stack_node分配空间。

最后__switch_to这个C函数的语句是return prev_p，一旦return执行完毕，实际上就是切换到另外的线程流上去执行了，因为在之前的__switch_to_asm里已经在rsp寄存器存储了另外线程当时切走时的ip，return语句执行就会弹出这个ip去执行，另一方面prev_p还会返回给switch_to的第三个参数，这样即使栈发生变化也能找到之前的局部变量prev。

最后分析下finish_task_switch函数：

/**
 * finish_task_switch - clean up after a task-switch
 * @prev: the thread we just switched away from.
 *
 * finish_task_switch must be called after the context switch, paired
 * with a prepare_task_switch call before the context switch.
 * finish_task_switch will reconcile locking set up by prepare_task_switch,
 * and do any other architecture-specific cleanup actions.
 *
 * Note that we may have delayed dropping an mm in context_switch(). If
 * so, we finish that here outside of the runqueue lock. (Doing it
 * with the lock held can cause deadlocks; see schedule() for
 * details.)
 *
 * The context switch have flipped the stack from under us and restored the
 * local variables which were saved when this task called schedule() in the
 * past. 'prev == current' is still correct but we need to recalculate this_rq
 * because prev may have moved to another CPU.
 */
static struct rq *finish_task_switch(struct task_struct *prev)
	__releases(rq->lock)
{
	struct rq *rq = this_rq();
	struct mm_struct *mm = rq->prev_mm;
	unsigned int prev_state;

	/*
	 * The previous task will have left us with a preempt_count of 2
	 * because it left us after:
	 *
	 *	schedule()
	 *	  preempt_disable();			// 1
	 *	  __schedule()
	 *	    raw_spin_lock_irq(&rq->lock)	// 2
	 *
	 * Also, see FORK_PREEMPT_COUNT.
	 */
	if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
		      "corrupted preempt_count: %s/%d/0x%x\n",
		      current->comm, current->pid, preempt_count()))
		preempt_count_set(FORK_PREEMPT_COUNT);

	rq->prev_mm = NULL;

	/*
	 * A task struct has one reference for the use as "current".
	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
	 * schedule one last time. The schedule call will never return, and
	 * the scheduled task must drop that reference.
	 *
	 * We must observe prev->state before clearing prev->on_cpu (in
	 * finish_task), otherwise a concurrent wakeup can get prev
	 * running on another CPU and we could rave with its RUNNING -> DEAD
	 * transition, resulting in a double drop.
	 */
	prev_state = READ_ONCE(prev->__state);
	vtime_task_switch(prev);
	perf_event_task_sched_in(prev, current);
	finish_task(prev);
	tick_nohz_task_switch();
	finish_lock_switch(rq);
	finish_arch_post_lock_switch();
	kcov_finish_switch(current);
	/*
	 * kmap_local_sched_out() is invoked with rq::lock held and
	 * interrupts disabled. There is no requirement for that, but the
	 * sched out code does not have an interrupt enabled section.
	 * Restoring the maps on sched in does not require interrupts being
	 * disabled either.
	 */
	kmap_local_sched_in();

	fire_sched_in_preempt_notifiers(current);
	/*
	 * When switching through a kernel thread, the loop in
	 * membarrier_{private,global}_expedited() may have observed that
	 * kernel thread and not issued an IPI. It is therefore possible to
	 * schedule between user->kernel->user threads without passing though
	 * switch_mm(). Membarrier requires a barrier after storing to
	 * rq->curr, before returning to userspace, so provide them here:
	 *
	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
	 *   provided by mmdrop_lazy_tlb(),
	 * - a sync_core for SYNC_CORE.
	 */
	if (mm) {
		membarrier_mm_sync_core_before_usermode(mm);
		mmdrop_lazy_tlb_sched(mm);
	}

	if (unlikely(prev_state == TASK_DEAD)) {
		if (prev->sched_class->task_dead)
			prev->sched_class->task_dead(prev);

		/* Task is done with its stack. */
		put_task_stack(prev);

		put_task_struct_rcu_user(prev);
	}

	return rq;
}

finish_task_switch的执行其实已经在新的next进程的逻辑流里了，因为switch_to->__switch_to_asm->__switch_to里最后的：

return prev_p;

语是在切换为新的next进程栈的情况下执行ret指令的，这样会导致一是从新的栈顶弹出之前next作为prev时保存的下次需要待执行的rip，就是finish_task_switch函数，二是switch_to还将上面的prev_p返回写给了last：

#define switch_to(prev, next, last)					\
do {									\
	((last) = __switch_to_asm((prev), (next)));			\
} while (0)

last参数就是prev，注意这里往last写，其实已经是在切换栈的情况下，写的last（prev），其实是新进程next栈上的prev了，因为next之前作为prev时也是跑过conext_switch把自己切走了，那时已经在栈上开好了prev变量，这样switch_to一旦返回，实际cpu的执行逻辑流已经将prev切走了，finish_task_switch已经是在新的next进程的栈，寄存器环境等上执行了，相当于在next的逻辑流上帮忙做了一些切换完成后prev的清理工作。所以接下来详细分析finish_task_switch时要时刻记住，CPU的执行环境/逻辑已经来到了新的next进程。

开始的WARN_ONCE实际是检测抢占计数到现在这个点的正确性，因为schedule->preempt_disable会增加一次抢占计数，而raw_spin_lock_irq抢自旋锁的逻辑里又会关一次抢占。

vtime_task_switch主要是在全动态节拍的情况下统计CPU的时间消耗。动态节拍又叫no_hz，传统上定时器中断会定周期到来，而CONFIG_NO_HZ_FULL是Linux内核中的一个选项，它启用了完全动态tickless系统（Full Dynticks）。启用这个选项后，某些CPU上可以在运行用户态任务时完全关闭周期性调度tick，以减少中断、降低抖动，适用于实时和高性能计算场景。在传统内核中，内核会为每个CPU设置一个周期性的定时器tick（通常为100Hz、250Hz或1000Hz，对应每10ms、4ms或1ms触发一次）。每次tick中断会执行任务调度、更新负载、统计CPU时间等。即使CPU上运行的是用户态任务，也会被tick打断。这样做的优点是简单、调度器行为可预测，缺点是中断频繁，功耗高，用户任务被频繁打断，实时jitter增大。开启CONFIG_NO_HZ_FULL后，可以使运行用户态任务的CPU在绝大多数时间不再收到周期性调度tick中断。CPU只在必要时才重新启用tick，例如进入内核、处理中断等，从而实现真正的tickless用户态执行环境。要使用这个功能，除了编译内核时启用CONFIG_NO_HZ_FULL，还需要通过boot参数指定具体启用的CPU，例如：nohz_full=1-3表示CPU1到CPU3启用full dynticks模式。

目标CPU上最好只运行一个用户态任务，且该任务很少进入内核。通常会配合isolcpus=1-3和rcu_nocbs=1-3使用，以获得更好的隔离和实时性。

在启用了NO_HZ_FULL并正确设置参数的CPU上，调度不再依赖周期性tick。调度改为事件驱动方式触发，例如系统调用返回后检查need_resched标志、进程显式调用schedule()、或其他CPU发出的IPI中断。也就是说，tick不再驱动调度器，而是调度器根据事件和状态自行决定是否调度。

为了保证系统正常运行，还需要启用一些相关机制，例如RCU_NOCB_CPU用于支持非中断上下文下的RCU回调处理，VIRT_CPU_ACCOUNTING_GEN用于在无tick情况下精确地记录CPU使用时间，IRQ_WORK用于中断上下文中的异步任务处理。

而CONFIG_NO_HZ_IDLE适用于大多数系统，它是在在CPU idle时才关闭tick；CONFIG_NO_HZ_FULL则适用于对jitter、实时性或能效有更高要求的场景，在用户态运行期间也能关闭tick。调度机制也从周期性驱动转变为完全事件驱动。

所以vtime_task_switch正是针对full dyntikcs开启时，精确统计CPU时间是花在user space还是 kernel space，因为这时已经没有周期性的tick到来更新时间消耗了，进程切换算是一个可以更新的点。

vtime_task_switch实现如下：

static inline void vtime_task_switch(struct task_struct *prev)
{
	if (vtime_accounting_enabled_this_cpu())
		vtime_task_switch_generic(prev);
}

先要判断一些配置是否开启：

static inline bool vtime_accounting_enabled_this_cpu(void)
{
	return context_tracking_enabled_this_cpu();
}

static __always_inline bool context_tracking_enabled_this_cpu(void)
{
	return context_tracking_enabled() && __this_cpu_read(context_tracking.active);
}

而context_tracking_enabled的实现依赖CONFIG_CONTEXT_TRACKING_USER配置，CONTEXT_TRACKING_USER是一个内核配置项，用于支持“用户态上下文追踪”，其核心目的是为了让内核更好地知道CPU当前是否处于用户态、内核态或空闲态，这个信息对一些子系统非常关键，主要包括RCU需要知道CPU是否处于所谓的“扩展静默状态（extended quiescent state）”，这意味着该CPU处于用户态或空闲态，可以安全地执行某些延迟操作，比如释放内存等。启用CONTEXT_TRACKING_USER后，内核会在进入/退出用户态时显式通知RCU，从而提高精确性，尤其是在完全无时钟（NO_HZ_FULL）模式下。另外需要这个功能的就是Tickless cputime accounting（无时钟CPU时间统计）。传统的CPU时间统计依赖于定时器周期性地中断（时钟 tick）。而当使用NO_HZ_FULL模式关闭定时器tick后，系统需要通过别的方式知道任务在用户态/内核态运行了多长时间，这就需要context tracking的辅助，追踪上下文切换。通常在构建面向低延迟、节能或实时性强的内核（比如 NO_HZ_FULL）时会启用该选项。

在笔者的环境下开启了这个配置，所以context_tracking_enabled实现如下：

static __always_inline bool context_tracking_enabled_this_cpu(void)
{
	return context_tracking_enabled() && __this_cpu_read(context_tracking.active);
}

static __always_inline bool context_tracking_enabled(void)
{
	return static_branch_unlikely(&context_tracking_key);
}

context_tracking_key默认通过DEFINE_STATIC_KEY_FALSE_RO声明为FALSE，在开启了nohz的系统里，通过tick_nohz_init->ct_cpu_track_user开启这个key：

void __init tick_nohz_init(void)
{
    ...
    for_each_cpu(cpu, tick_nohz_full_mask)
		ct_cpu_track_user(cpu);
    ...
}

从这段代码也可以看到nohz模式可以针对每个cpu来配置（tick_nohz_full_mask），ct_cpu_track_user实现如下：

void __init ct_cpu_track_user(int cpu)
{
	static __initdata bool initialized = false;

	if (!per_cpu(context_tracking.active, cpu)) {
		per_cpu(context_tracking.active, cpu) = true;
		static_branch_inc(&context_tracking_key);
	}

这里也会将context_tracking.active设置为true，这样context_tracking_enabled_this_cpu就可以返回true。这样就可以调用vtime_task_switch_generic函数了：

void vtime_task_switch_generic(struct task_struct *prev)
{
	struct vtime *vtime = &prev->vtime;

	write_seqcount_begin(&vtime->seqcount);
	if (vtime->state == VTIME_IDLE)
		vtime_account_idle(prev);
	else
		__vtime_account_kernel(prev, vtime);
	vtime->state = VTIME_INACTIVE;
	vtime->cpu = -1;
	write_seqcount_end(&vtime->seqcount);

	vtime = &current->vtime;

	write_seqcount_begin(&vtime->seqcount);
	if (is_idle_task(current))
		vtime->state = VTIME_IDLE;
	else if (current->flags & PF_VCPU)
		vtime->state = VTIME_GUEST;
	else
		vtime->state = VTIME_SYS;
	vtime->starttime = sched_clock();
	vtime->cpu = smp_processor_id();
	write_seqcount_end(&vtime->seqcount);
}

这里先介绍下write_seqcount_begin的实现，seqcount是一种序列锁，序列锁（seqlock）是一种Linux内核中的同步机制，专为写少读多的场景设计。它允许读者不加锁地访问共享数据，而写者加锁并更新一个序号（sequence counter）。其核心思想是，读者在读取数据前后记录sequence值，若值发生变化，则说明有写者正在修改数据，读者需重试，并且读前判断是奇数证明当前有写者正在写，需要等待其变成偶数，写者在修改数据前后更新sequence值，以此来通知读者“我正在修改数据”，写前加1变成奇数，写后再加1变成偶数，以下是write_seqcount_begin的实现：

#define write_seqcount_begin(s)						\
do {									\
	seqprop_assert(s);						\
									\
	if (seqprop_preemptible(s))					\
		preempt_disable();					\
									\
	do_write_seqcount_begin(seqprop_ptr(s));			\
} while (0)

do_write_seqcount_begin实现如下：

static inline void do_write_seqcount_begin(seqcount_t *s)
{
	do_write_seqcount_begin_nested(s, 0);
}

static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
{
	seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
	do_raw_write_seqcount_begin(s);
}

static inline void do_raw_write_seqcount_begin(seqcount_t *s)
{
	kcsan_nestable_atomic_begin();
	s->sequence++;
	smp_wmb();
}

这里可以看到就是对一个计数进行增加，加完后变成奇数，而在write_seqcount_end还会再次自增这个计数：

/**
 * write_seqcount_end() - end a seqcount_t write side critical section
 * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
 *
 * Context: Preemption will be automatically re-enabled if and only if
 * the seqcount write serialization lock is associated, and preemptible.
 */
#define write_seqcount_end(s)						\
do {									\
	do_write_seqcount_end(seqprop_ptr(s));				\
									\
	if (seqprop_preemptible(s))					\
		preempt_enable();					\
} while (0)

static inline void do_write_seqcount_end(seqcount_t *s)
{
	seqcount_release(&s->dep_map, _RET_IP_);
	do_raw_write_seqcount_end(s);
}

这样这个sequence变成偶数，这种奇偶关系的变化会直接影响读端的接口，比如read_seqbegin的实现：

/**
 * read_seqbegin() - start a seqlock_t read side critical section
 * @sl: Pointer to seqlock_t
 *
 * Return: count, to be passed to read_seqretry()
 */
static inline unsigned read_seqbegin(const seqlock_t *sl)
{
	return read_seqcount_begin(&sl->seqcount);
}

/**
 * read_seqcount_begin() - begin a seqcount_t read critical section
 * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
 *
 * Return: count to be passed to read_seqcount_retry()
 */
#define read_seqcount_begin(s)						\
({									\
	seqcount_lockdep_reader_access(seqprop_const_ptr(s));		\
	raw_read_seqcount_begin(s);					\
})

#define raw_read_seqcount_begin(s) __read_seqcount_begin(s)
/**
 * __read_seqcount_begin() - begin a seqcount_t read section
 * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
 *
 * Return: count to be passed to read_seqcount_retry()
 */
#define __read_seqcount_begin(s)					\
({									\
	unsigned __seq;							\
									\
	while (unlikely((__seq = seqprop_sequence(s)) & 1))		\
		cpu_relax();						\
									\
	kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);			\
	__seq;								\
})

从这里可以看到，读端要是遇到写者了，也就是sequence序列计数为奇数，就循环pause cpu降低功耗，直到写者写完将它改成偶数，这里读者就可以退出while循环了。

现在回过头来看vtime_task_switch_generic的主体逻辑其实就是计数CPU处于某种状态下的时间长短，比如vtime_account_idle实现如下：

void vtime_account_idle(struct task_struct *tsk)
{
	account_idle_time(get_vtime_delta(&tsk->vtime));
}

/*
 * Account for idle time.
 * @cputime: the CPU time spent in idle wait
 */
void account_idle_time(u64 cputime)
{
	u64 *cpustat = kcpustat_this_cpu->cpustat;
	struct rq *rq = this_rq();

	if (atomic_read(&rq->nr_iowait) > 0)
		cpustat[CPUTIME_IOWAIT] += cputime;
	else
		cpustat[CPUTIME_IDLE] += cputime;
}

static u64 get_vtime_delta(struct vtime *vtime)
{
	u64 delta = vtime_delta(vtime);
	u64 other;

	/*
	 * Unlike tick based timing, vtime based timing never has lost
	 * ticks, and no need for steal time accounting to make up for
	 * lost ticks. Vtime accounts a rounded version of actual
	 * elapsed time. Limit account_other_time to prevent rounding
	 * errors from causing elapsed vtime to go negative.
	 */
	other = account_other_time(delta);
	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
	vtime->starttime += delta;

	return delta - other;
}

可以看到这种情况是将delta时间给到了per-cpu里cpustat数组里，而starttime正是在vtime_task_switch_generic里稍后拿出current的vtime后写入的starttime，但是stime等其它时间的记录却不一样，比如vtime_task_switch_generic->__vtime_account_kernel->vtime_account_system：

static void vtime_account_system(struct task_struct *tsk,
				 struct vtime *vtime)
{
	vtime->stime += get_vtime_delta(vtime);
	if (vtime->stime >= TICK_NSEC) {
		account_system_time(tsk, irq_count(), vtime->stime);
		vtime->stime = 0;
	}
}

可以看到是直接记录在了vtime结构体里的stime。 vtime_task_swith_generic里前半部分逻辑是处理prev的情况，主要就是前面介绍的记录相应的时间delta，而后半部分主要是针对当前进程进行记录时间戳，以便下次它以prev身份进入vtime_task_swithc_generic时的前半部分计算消耗的delta时间。

perf_event_task_sched_in用于计数进程调度出入计数，与调度没有太大关系。

finish_task是最后一次引用prev任务了：

static inline void finish_task(struct task_struct *prev)
{
#ifdef CONFIG_SMP
	/*
	 * This must be the very last reference to @prev from this CPU. After
	 * p->on_cpu is cleared, the task can be moved to a different CPU. We
	 * must ensure this doesn't happen until the switch is completely
	 * finished.
	 *
	 * In particular, the load of prev->state in finish_task_switch() must
	 * happen before this.
	 *
	 * Pairs with the smp_cond_load_acquire() in try_to_wake_up().
	 */
	smp_store_release(&prev->on_cpu, 0);
#endif
}

将prev:on_cpu设置成0，表示没有在cpu上运行了。

tick_nohz_task_switch功能前面有所介绍，这里主要想提一点，重新开启tick可以通过 tick_sched_to_timer函数。

finish_lock_switch->raw_spin_rq_unlock_irq里主要是释放锁以及调用local_irq_enable开启中断。

以上就是进程切换的所有分析。

进程切换

Table of Contents

1. __schedule

2. pick_next_task

3. context_switch