Re: PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller & bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel)
From: Pengfei Xu
Date: Wed Nov 23 2022 - 10:45:51 EST
Hi Frederic Weisbecker,
On 2022-11-23 at 15:37:58 +0100, Frederic Weisbecker wrote:
> On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote:
> > Hi Frederic Weisbecker and kernel developers,
> >
> > Greeting!
> > There is task hung in "synchronize_rcu" in v6.1-rc5 kernel.
> >
> > Bisected the issue on Raptor and server(No atom small core, big core only),
> > both platforms bisected results show that:
> > first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26:
> > "sched: Provide Kconfig support for default dynamic preempt mode"
> >
> > [ 300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds.
> > [ 300.097455] Not tainted 6.1.0-rc5-094226ad94f4 #1
> > [ 300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 300.097922] task:rcu_tasks_kthre state:D stack:0 pid:11 ppid:2 flags:0x00004000
> > [ 300.098230] Call Trace:
> > [ 300.098325] <TASK>
> > [ 300.098410] __schedule+0x2de/0x8f0
> > [ 300.098562] schedule+0x5b/0xe0
> > [ 300.098693] schedule_timeout+0x3f1/0x4b0
> > [ 300.098849] ? __sanitizer_cov_trace_pc+0x25/0x60
> > [ 300.099032] ? queue_delayed_work_on+0x82/0xc0
> > [ 300.099206] wait_for_completion+0x81/0x140
> > [ 300.099373] __synchronize_srcu.part.23+0x83/0xb0
> > [ 300.099558] ? __bpf_trace_rcu_stall_warning+0x20/0x20
> > [ 300.099757] synchronize_srcu+0xd6/0x100
> > [ 300.099913] rcu_tasks_postscan+0x19/0x20
> > [ 300.100070] rcu_tasks_wait_gp+0x108/0x290
> > [ 300.100230] ? _raw_spin_unlock+0x1d/0x40
> > [ 300.100389] rcu_tasks_one_gp+0x27f/0x370
> > [ 300.100546] ? rcu_tasks_postscan+0x20/0x20
> > [ 300.100709] rcu_tasks_kthread+0x37/0x50
> > [ 300.100863] kthread+0x14d/0x190
> > [ 300.100998] ? kthread_complete_and_exit+0x40/0x40
> > [ 300.101199] ret_from_fork+0x1f/0x30
> > [ 300.101347] </TASK>
>
> Thanks for reporting this. Fortunately I managed to reproduce and debug.
> It took me a few days to understand the complicated circular dependency
> involved.
>
> So here is a summary:
>
> 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
> that every subsequent child of TASK A will belong to. But TASK A doesn't
> itself belong to that new PID namespace.
>
> 2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
> thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
> and TASK B is the first task belonging to the new PID namespace created by
> unshare() (let's call it PID_NS2).
>
> 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
> child reaper.
>
> 4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
> Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
> TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
>
> 3) TASK B exits and since it is the child reaper for PID_NS2, it has to
> kill all other tasks attached to PID_NS2, and wait for all of them to die
> before reaping itself (zap_pid_ns_process()). Note it seems to make a
> misleading assumption here, trusting that all tasks in PID_NS2 either
> get reaped by a parent belonging to the same namespace or by TASK B.
> And it is confident that since it deactivated SIGCHLD handler, all
> the remaining tasks ultimately autoreap. And it waits for that to happen.
> However TASK C escapes that rule because it will get reaped by its parent
> TASK A belonging to PID_NS1.
>
> 4) TASK A calls synchronize_rcu_tasks() which leads to
> synchronize_srcu(&tasks_rcu_exit_srcu).
>
> 5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
> But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
> (exit_notify() is between exit_tasks_rcu_start() and
> exit_tasks_rcu_finish()), blocking TASK A
>
> 6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
> but it can't because TASK A waits for TASK B that waits for TASK C.
>
> So there is a circular dependency:
>
> _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
> section
> _ TASK B waits for TASK C to get reaped
> _ TASK C waits for TASK A to reap it.
>
> I have no idea how to solve the situation without violating the pid_namespace
> rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
> error prone behaviour with allowing creating more than one task belonging to the
> same namespace).
>
> So probably having an SRCU read side critical section within exit_notify() is
> not a good idea, is there a solution to work around that for rcu tasks?
>
Thanks for the analysis!
Add one more information: I tried to revert this commit only on top of
v6.1-rc5 mainline by script, but it caused kernel make to fail, it could not
confirm the bisect information is 100% accurate if I could not pass the
revert step verification. I just provide all the information I could.
And this issue is too difficult to me.
If I find more clue, I will update the eamil.
Thanks!
BR.
> Thanks.