PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller & bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel)

From: Frederic Weisbecker
Date: Wed Nov 23 2022 - 09:38:12 EST


On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote:
> Hi Frederic Weisbecker and kernel developers,
>
> Greeting!
> There is task hung in "synchronize_rcu" in v6.1-rc5 kernel.
>
> Bisected the issue on Raptor and server(No atom small core, big core only),
> both platforms bisected results show that:
> first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26:
> "sched: Provide Kconfig support for default dynamic preempt mode"
>
> [ 300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds.
> [ 300.097455] Not tainted 6.1.0-rc5-094226ad94f4 #1
> [ 300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 300.097922] task:rcu_tasks_kthre state:D stack:0 pid:11 ppid:2 flags:0x00004000
> [ 300.098230] Call Trace:
> [ 300.098325] <TASK>
> [ 300.098410] __schedule+0x2de/0x8f0
> [ 300.098562] schedule+0x5b/0xe0
> [ 300.098693] schedule_timeout+0x3f1/0x4b0
> [ 300.098849] ? __sanitizer_cov_trace_pc+0x25/0x60
> [ 300.099032] ? queue_delayed_work_on+0x82/0xc0
> [ 300.099206] wait_for_completion+0x81/0x140
> [ 300.099373] __synchronize_srcu.part.23+0x83/0xb0
> [ 300.099558] ? __bpf_trace_rcu_stall_warning+0x20/0x20
> [ 300.099757] synchronize_srcu+0xd6/0x100
> [ 300.099913] rcu_tasks_postscan+0x19/0x20
> [ 300.100070] rcu_tasks_wait_gp+0x108/0x290
> [ 300.100230] ? _raw_spin_unlock+0x1d/0x40
> [ 300.100389] rcu_tasks_one_gp+0x27f/0x370
> [ 300.100546] ? rcu_tasks_postscan+0x20/0x20
> [ 300.100709] rcu_tasks_kthread+0x37/0x50
> [ 300.100863] kthread+0x14d/0x190
> [ 300.100998] ? kthread_complete_and_exit+0x40/0x40
> [ 300.101199] ret_from_fork+0x1f/0x30
> [ 300.101347] </TASK>

Thanks for reporting this. Fortunately I managed to reproduce and debug.
It took me a few days to understand the complicated circular dependency
involved.

So here is a summary:

1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
that every subsequent child of TASK A will belong to. But TASK A doesn't
itself belong to that new PID namespace.

2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
and TASK B is the first task belonging to the new PID namespace created by
unshare() (let's call it PID_NS2).

3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
child reaper.

4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.

3) TASK B exits and since it is the child reaper for PID_NS2, it has to
kill all other tasks attached to PID_NS2, and wait for all of them to die
before reaping itself (zap_pid_ns_process()). Note it seems to make a
misleading assumption here, trusting that all tasks in PID_NS2 either
get reaped by a parent belonging to the same namespace or by TASK B.
And it is confident that since it deactivated SIGCHLD handler, all
the remaining tasks ultimately autoreap. And it waits for that to happen.
However TASK C escapes that rule because it will get reaped by its parent
TASK A belonging to PID_NS1.

4) TASK A calls synchronize_rcu_tasks() which leads to
synchronize_srcu(&tasks_rcu_exit_srcu).

5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
(exit_notify() is between exit_tasks_rcu_start() and
exit_tasks_rcu_finish()), blocking TASK A

6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
but it can't because TASK A waits for TASK B that waits for TASK C.

So there is a circular dependency:

_ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
section
_ TASK B waits for TASK C to get reaped
_ TASK C waits for TASK A to reap it.

I have no idea how to solve the situation without violating the pid_namespace
rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
error prone behaviour with allowing creating more than one task belonging to the
same namespace).

So probably having an SRCU read side critical section within exit_notify() is
not a good idea, is there a solution to work around that for rcu tasks?

Thanks.