Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion

From: Dietmar Eggemann
Date: Wed Nov 16 2022 - 12:46:10 EST

Next message: Sean Christopherson: "Re: [V4 PATCH 2/3] KVM: selftests: Add arch specific initialization"
Previous message: Jason Gunthorpe: "Re: [patch 09/39] powerpc/pseries/msi: Use msi_domain_ops::msi_post_free()"
Next in thread: Qais Yousef: "Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12/11/2022 20:35, Qais Yousef wrote:
> On 11/09/22 11:42, Dietmar Eggemann wrote:
>

[...]

>>> + * thermal pressure. We record it as capacity
>>> + * inversion.
>>> + */
>>> + if (capacity_orig == pd_cap_orig) {
>>> + pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
>>> +
>>> + if (pd_cap > inv_cap) {
>>> + rq->cpu_capacity_inverted = inv_cap;
>>> + break;
>>> + }
>>
>> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
>> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
>> erroneously since thermal_load_avg(rq) can return different values for
>> inv_cap and pd_cap.
>
> Good catch!
>
>>
>> So even on a classical big little system, this condition can set
>> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
>>
>> thermal_load_avg(rq) would have to stay constant for all CPUs within the
>> PD to avoid this.
>>
>> This is one example of the `thermal pressure` is per PD (or Frequency
>> Domain) in Thermal but per-CPU in the task scheduler.
>
> Only compile tested so far, does this patch address all your points? I should
> get hardware soon to run some tests and send the patch. I might re-write it to
> avoid using pds; though it seems cleaner this way but we miss CAS support.
>
> Thoughts?

I still don't think that the `CPU capacity inversion` implementation
which uses `cap_orig' = cap_orig - thermal load avg (2)` instead of
`cap_orig'' = cap_orig - thermal pressure (1)` for inverted CPUs (i.e.
other PD exists w/ cap_orig > cap_orig') is the right answer, besides
the EAS vs. CAS coverage.

The basic question for me is why do we have to switch between (1) and
(2)? IIRC we introduced (1) in feec() to cater for the CPUfreq policy
min/max capping between schedutil and the CPUfreq driver
__resolve_freq() [drivers/cpufreq/cpufreq.c] (3).

The policy caps are set together with thermal pressure in
cpufreq_set_cur_state() [drivers/thermal/cpufreq_cooling.c] via
freq_qos_update_request().

I think we should only use (2) in the task scheduler even though the
EAS-schedutil machinery would be not 100% in sync in some cases due to (3).
Thermal load avg has similar properties like all the other EWMA-based
signals we use and we have to live with a certain degree of inaccuracy
anyway (e.g. also because of lock-less CPU statistic fetching when
selecting CPU).

And in this case we wouldn't have to have infrastructure to switch
between (1) and (2) at all.

To illustrate the problem I ran 2 workloads (hackbench/sleep) on a H960
board with step-wise thermal governor tracing thermal load_avg
(sched_pelt_thermal), thermal pressure (thermal_pressure_update) and CPU
capacity (sched_cpu_capacity). Would we really gain something
substantial reliably when we would know the diff between (1) and (2)?

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/thermal_pressure.ipynb

Next message: Sean Christopherson: "Re: [V4 PATCH 2/3] KVM: selftests: Add arch specific initialization"
Previous message: Jason Gunthorpe: "Re: [patch 09/39] powerpc/pseries/msi: Use msi_domain_ops::msi_post_free()"
Next in thread: Qais Yousef: "Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]