On 2022/11/24 04:23, Waiman Long wrote:
On 11/23/22 03:21, haifeng.xu wrote:Hi, Longman.
When change the 'cpuset.mems' under some cgroup, system will hungCould you try the attached test patch to see if it can fix your problem?
for a long time. From the dmesg, many processes or theads are
stuck in fork/exit. The reason is show as follows.
thread A:
cpuset_write_resmask /* takes cpuset_rwsem */
...
update_tasks_nodemask
mpol_rebind_mm /* waits mmap_lock */
thread B:
worker_thread
...
cpuset_migrate_mm_workfn
do_migrate_pages /* takes mmap_lock */
thread C:
cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */
...
cpuset_can_attach
percpu_down_write /* waits cpuset_rwsem */
Once update the nodemasks of cpuset, thread A wakes up thread B to
migrate mm. But when thread A iterates through all tasks, including
child threads and group leader, it has to wait the mmap_lock which
has been take by thread B. Unfortunately, thread C wants to migrate
tasks into cgroup at this moment, it must wait thread A to release
cpuset_rwsem. If thread B spends much time to migrate mm, the
fork/exit which acquire cgroup_threadgroup_rwsem also need to
wait for a long time.
There is no need to migrate the mm of child threads which is
shared with group leader. Just iterate through the group
leader only.
Signed-off-by: haifeng.xu <haifeng.xu@xxxxxxxxxx>
---
kernel/cgroup/cpuset.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 589827ccda8b..43cbd09546d0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset
*cs)
cpuset_change_task_nodemask(task, &newmems);
+ if (!thread_group_leader(task))
+ continue;
+
mm = get_task_mm(task);
if (!mm)
continue;
Something along the line of this patch will be more acceptable.
Thanks,
Longman
Thanks for your patch, but there are still some problems.
1)
(group leader, node: 0,1)
cgroup0
/ \
/ \
cgroup1 cgroup2
(threads) (threads)
If set node 0 in cgroup1 and node 1 in cgroup2, both of them will update
the mm. And the nodemask of mm depends on who set the node last.
2)
(process, node: 0,1)
cgroup0
/ \
/ \
cgroup1 cgroup2
(node: 0) (node: 1)
If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach
won't update the mm. So the nodemask of thread, including mems_allowed
and mempolicy(updated in cpuset_change_task_nodemask), is different from
the vm_policy in vma(updated in mpol_rebind_mm).
In a word, if threads have different cpusets with different nodemask, it
will cause inconsistent memory behavior.