Re: [PATCH] sched/fair: Introduce priority load balance for CFS

From: Vincent Guittot
Date: Wed Nov 16 2022 - 09:38:31 EST


On Wed, 16 Nov 2022 at 08:37, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
>
>
>
> On 2022/11/15 15:18, Vincent Guittot wrote:
> > On Mon, 14 Nov 2022 at 17:42, Vincent Guittot
> > <vincent.guittot@xxxxxxxxxx> wrote:
> >>
> >> On Sat, 12 Nov 2022 at 03:51, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
> >>>
> >>> Hi, Vincent
> >>>
> >>> On 2022/11/3 17:22, Vincent Guittot wrote:
> >>>> On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2022/11/3 16:33, Vincent Guittot wrote:
> >>>>>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> Thanks for your reply!
> >>>>>>>
> >>>>>>> On 2022/11/3 2:01, Vincent Guittot wrote:
> >>>>>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> This really looks like a v3 of
> >>>>>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@xxxxxxxxxx/
> >>>>>>>>
> >>>>>>>> Please keep versioning.
> >>>>>>>>
> >>>>>>>>> Add a new sysctl interface:
> >>>>>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
> >>>>>>>>
> >>>>>>>> We don't want to add more sysctl knobs for the scheduler, we even
> >>>>>>>> removed some. Knob usually means that you want to fix your use case
> >>>>>>>> but the solution doesn't make sense for all cases.
> >>>>>>>>
> >>>>>>>
> >>>>>>> OK, I will remove this knobs later.
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> 0: default behavior
> >>>>>>>>> 1: enable priority load balance for CFS
> >>>>>>>>>
> >>>>>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
> >>>>>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
> >>>>>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
> >>>>>>>>> as much as possible.
> >>>>>>>>
> >>>>>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
> >>>>>>>>
> >>>>>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
> >>>>>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
> >>>>>>>> non idle task from CPU1 to CPU0 is not the best choice
> >>>>>>>>
> >>>>>>>
> >>>>>>> If the non idle task on CPU1 is running or cache hot, it cannot be
> >>>>>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
> >>>>>>> think it does not matter.
> >>>>>>
> >>>>>> What I mean is that migrating non idle tasks first is not a universal
> >>>>>> win and not always what we want.
> >>>>>>
> >>>>>
> >>>>> But migrating online tasks first is mostly a trade-off that
> >>>>> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
> >>>>> the interference caused by IDLE tasks. I think this makes sense in most
> >>>>> cases, or you can point out what else I need to think about it ?
> >>>>>
> >>>>> Best regards.
> >>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> Testcase:
> >>>>>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
> >>>>>>>>
> >>>>>>>> What do you mean by a large number ?
> >>>>>>>>
> >>>>>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
> >>>>>>>>>
> >>>>>>>>> Using schbench to test non-idle tasks latency:
> >>>>>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
> >>>>>>>>
> >>>>>>>> How many CPUs do you have ?
> >>>>>>>>
> >>>>>>>
> >>>>>>> OK, some details may not be mentioned.
> >>>>>>> My virtual machine has 8 CPUs running with a schbench process and 5000
> >>>>>>> idle tasks. The idle task is a while dead loop process below:
> >>>>>>
> >>>>>> How can you care about latency when you start 10 workers on 8 vCPUs
> >>>>>> with 5000 non idle threads ?
> >>>>>>
> >>>>>
> >>>>> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
> >>>>> threads, and with 10 non-idle schbench workers on 8 vCPUs.
> >>>>
> >>>> yes spawn 5000 idle tasks but my point remains the same
> >>>>
> >>>
> >>> I am so sorry that I have not received your reply for a long time, and I
> >>> am still waiting for it anxiously. In fact, migrating non-idle tasks 1st
> >>> works well in most scenarios, so it maybe possible to add a
> >>> sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope
> >>> you can give me some better advice.
> >>
> >> I have seen that you posted a v4 5 days ago which is on my list to be reviewed.
> >>
> >> My concern here remains that selecting non idle task 1st is not always
> >> the best choices as for example when you have 1 non idle task per cpu
> >> and thousands of idle tasks moving around. Then regarding your use
> >> case, the weight of the 5000 idle threads is around twice more than
> >> the weight of your non idle bench: sum weight of idle threads is 15k
> >> whereas the weight of your bench is around 6k IIUC how RPS run. This
> >> also means that the idle threads will take a significant times of the
> >> system: 5000 / 7000 ticks. I don't understand how you can care about
> >> latency in such extreme case and I'm interested to get the real use
> >> case where you can have such situation.
> >>
> >> All that to say that idle task remains cfs task with a small but not
> >> null weight and we should not make them special other than by not
> >> preempting at wakeup.
> >
> > Also, as mentioned for a previous version, a task with nice prio 19
> > has a weight of 15 so if you replace the 5k idle threads with 1k cfs
> > w/ nice prio 19 threads, you will face a similar problem. So you can't
> > really care only on the idle property of a task
> >
>
> Well, my original idea was to consider interference between tasks of
> different priorities when doing CFS load balancing to ensure that
> non-idle tasks get more CPU scheduler time without changing the native
> CFS load balancing policy.
>
> Consider a simple scenario. Assume that CPU 0 has two non-idle tasks
> whose weight is 1024 * 2 = 2048, also CPU 0 has 1000 idle tasks whose
> weight is 1K x 15 = 15K. CPU 1 is idle. Therefore, IDLE load balance is

weight of cfs idle thread is 3, the weight of cfs nice 19 thread is 15

> triggered. CPU 1 needs to pull a certain number of tasks from CPU 0. If
> we do not considerate task priorities and interference between tasks,
> more than 600 idle tasks on CPU 0 may be migrated to CPU 1. As a result,
> two non-idle tasks still compete on CPU 0. However, CPU 1 is running
> with all idle but not non-idle tasks.
>
> Let's calculate the percentage of CPU time gained by non-idle tasks in a
> scheduling period:
>
> CPU 1: time_percent(non-idle tasks) = 0
> CPU 0: time_percent(non-idle tasks) = 2048 * 2 / (2048 + 15000) = 24%

2 cfs task nice 0 with 1000 cfs idle tasks on 2 CPUs. The weight of
the system is:

2*1024 + 1000*3 = 5048 or 2524 per CPU

This means that the cfs nice 0 task should get 1024/(5048) = 20% of
system time which means 40% of CPUs time.

This also means that the 2 cfs tasks on CPU0 is a valid configuration
as they will both have their 40% of CPUs

cfs idle threads have a small weight to be negligible compared to
"normal" threads so they can't normally balance a system by themself
but by spawning 1000+ cfs idle threads, you make them not negligible
anymore. That's the root of your problem. A CPU with only cfs idle
tasks should be seen unbalanced compared to other CPUs with non idle
tasks and this is what is happening with small/normal number of cfs
idle threads

>
> On the other hand, if we consider the interference between different
> task priorities, we change the migration policy to firstly migrate an
> non-idle task on CPU 0 to CPU 1. Migrating idle tasks on CPU 0 maybe
> interfered with the non-idle task on CPU 1. So we decide to migrate idle
> tasks on CPU 0 after non-idle tasks on CPU 1 are completed or exited.
>
> Now the percentage of the CPU time obtained by the non-idle tasks in a
> scheduling period is as follows:
>
> CPU 1: time_percent(non-idle tasks) = 1024 / 1024 = 100%
> CPU 0: time_percent(non-idle tasks) = 1024 / (1024 + 15000) = 6.4%

But this is unfair for one cfs nice 0 thread and all cfs idle threads

>
> Obviously, if load balance migration tasks prefer migrate non-idle tasks
> and suppress the interference of idle tasks migration on non-idle tasks,
> the latency of non-idle tasks can be significantly reduced. Although
> this will cause some idle tasks imbalance between different CPUs and
> reduce throughput of idle tasks., I think this strategy is feasible in
> some real-time business scenarios for latency tasks.

But idle cfs ask remains cfs task and we keep cfs fairness for all threads

Have you tried to :
- Increase nice priority of the non idle cfs task so the sum of the
weight of idle tasks remain a small portion of the total weight ?
- to put your thousands idle tasks in a cgroup and set cpu.idle for
this cgroup. This should also ensure that the weight of idle threads
remains negligible compared to others.

I have tried both setup in my local system and I have 1 non idle task per CPU

Regards,
Vincent

>
> >>
> >>>
> >>> Best regards.
> >>>
> >>> Song Zhang
> > .