Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion
From: Qais Yousef
Date: Sun Nov 20 2022 - 16:30:26 EST
On 11/16/22 18:45, Dietmar Eggemann wrote:
> On 12/11/2022 20:35, Qais Yousef wrote:
> > On 11/09/22 11:42, Dietmar Eggemann wrote:
> >
>
> [...]
>
> >>> + * thermal pressure. We record it as capacity
> >>> + * inversion.
> >>> + */
> >>> + if (capacity_orig == pd_cap_orig) {
> >>> + pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> >>> +
> >>> + if (pd_cap > inv_cap) {
> >>> + rq->cpu_capacity_inverted = inv_cap;
> >>> + break;
> >>> + }
> >>
> >> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> >> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> >> erroneously since thermal_load_avg(rq) can return different values for
> >> inv_cap and pd_cap.
> >
> > Good catch!
> >
> >>
> >> So even on a classical big little system, this condition can set
> >> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
> >>
> >> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> >> PD to avoid this.
> >>
> >> This is one example of the `thermal pressure` is per PD (or Frequency
> >> Domain) in Thermal but per-CPU in the task scheduler.
> >
> > Only compile tested so far, does this patch address all your points? I should
> > get hardware soon to run some tests and send the patch. I might re-write it to
> > avoid using pds; though it seems cleaner this way but we miss CAS support.
> >
> > Thoughts?
>
> I still don't think that the `CPU capacity inversion` implementation
> which uses `cap_orig' = cap_orig - thermal load avg (2)` instead of
> `cap_orig'' = cap_orig - thermal pressure (1)` for inverted CPUs (i.e.
> other PD exists w/ cap_orig > cap_orig') is the right answer, besides
> the EAS vs. CAS coverage.
>
> The basic question for me is why do we have to switch between (1) and
> (2)? IIRC we introduced (1) in feec() to cater for the CPUfreq policy
> min/max capping between schedutil and the CPUfreq driver
> __resolve_freq() [drivers/cpufreq/cpufreq.c] (3).
>
> The policy caps are set together with thermal pressure in
> cpufreq_set_cur_state() [drivers/thermal/cpufreq_cooling.c] via
> freq_qos_update_request().
>
> I think we should only use (2) in the task scheduler even though the
> EAS-schedutil machinery would be not 100% in sync in some cases due to (3).
> Thermal load avg has similar properties like all the other EWMA-based
> signals we use and we have to live with a certain degree of inaccuracy
> anyway (e.g. also because of lock-less CPU statistic fetching when
> selecting CPU).
>
> And in this case we wouldn't have to have infrastructure to switch
> between (1) and (2) at all.
>
> To illustrate the problem I ran 2 workloads (hackbench/sleep) on a H960
> board with step-wise thermal governor tracing thermal load_avg
> (sched_pelt_thermal), thermal pressure (thermal_pressure_update) and CPU
> capacity (sched_cpu_capacity). Would we really gain something
> substantial reliably when we would know the diff between (1) and (2)?
>
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/thermal_pressure.ipynb
>
So what you're asking for is to switch to this?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b01854984994..989f1947bd34 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8862,7 +8862,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
rcu_read_lock();
- inv_cap = capacity_orig - thermal_load_avg(rq);
+ inv_cap = capacity_orig - arch_scale_thermal_pressure(rq);
pd = rcu_dereference(rq->rd->pd);
rq->cpu_capacity_inverted = 0;
@@ -8887,7 +8887,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
* inversion.
*/
if (capacity_orig == pd_cap_orig) {
- pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
+ pd_cap = pd_cap_orig - arch_scale_thermal_pressure(cpu_rq(cpu));
if (pd_cap > inv_cap) {
rq->cpu_capacity_inverted = inv_cap;
My main worry is that rq->cpu_capacity which is updated in the same location
uses thermal_load_avg(). The consistency was important IMO. Besides I think we
need good certainty the inversion is there - we don't want to be oscillating.
Say the big core thermal pressure is increasing and it is entering capacity
inversion. If we don't use the average we'd be avoiding the cpu one tick, but
place something that drives freq high on it the next tick. This ping-pong could
end up not giving the big cores some breathing room to cool down and settle on
one state, no?
I think Lukasz patch [1] is very important in controlling this aspect. And it
might help make the code more consistent by enabling switching all users to
thermal_load_avg() if we can speed up its reaction time sufficiently.
That said; I don't have a bullet proof answer or a very strong opinion about
it. Either direction we take I think we'll have room for improvements. It
seemed the safer more sensible option to me at this stage.
[1] https://lore.kernel.org/lkml/20220429091245.12423-1-lukasz.luba@xxxxxxx/
Thanks!
--
Qais Yousef