Re: [PATCH] thermal: intel: hfi: ACK HFI for the same timestamp

From: Rafael J. Wysocki
Date: Wed Nov 23 2022 - 14:15:15 EST


On Thu, Nov 17, 2022 at 12:15 AM Srinivas Pandruvada
<srinivas.pandruvada@xxxxxxxxxxxxxxx> wrote:
>
> Some processors issue more than one HFI interrupt with the same
> timestamp. Each interrupt must be acknowledged to let the hardware issue
> new HFI interrupts. But this can't be done without some additional flow
> modification in the existing interrupt handling.
>
> For background, the HFI interrupt is a package level thermal interrupt
> delivered via a LVT. This LVT is common for both the CPU and package
> level interrupts. Hence, all CPUs receive the HFI interrupts. But only
> one CPU should process interrupt and others simply exit by issuing EOI
> to LAPIC.
>
> The current HFI interrupt processing flow:
>
> 1. Receive Thermal interrupt
> 2. Check if there is an active HFI status in MSR_IA32_THERM_STATUS
> 3. Try and get spinlock, one CPU will enter spinlock and others
> will simply return from here to issue EOI.
> (Let's assume CPU 4 is processing interrupt)
> 4. Check the stored time-stamp from the HFI memory time-stamp
> 5. if same
> 6. ignore interrupt, unlock and return
> 7. Copy the HFI message to local buffer
> 8. unlock spinlock
> 9. ACK HFI interrupt
> 10. Queue the message for processing in a work-queue
>
> It is tempting to simply acknowledge all the interrupts even if they
> have the same timestamp. This may cause some interrupts to not be
> processed.
>
> Let's say CPU5 is slightly late and reaches step 4 while CPU4 is
> between steps 8 and 9.
>
> Currently we simply ignore interrupts with the same timestamp. No
> issue here for CPU5. When CPU4 acknowledges the interrupt, the next
> HFI interrupt can be delivered.
>
> If we acknowledge interrupts with the same timestamp (at step 6), there
> is a race condition. Under the same scenario, CPU 5 will acknowledge
> the HFI interrupt. This lets hardware generate another HFI interrupt,
> before CPU 4 start executing step 9. Once CPU 4 complete step 9, it
> will acknowledge the newly arrived HFI interrupt, without actually
> processing it.
>
> Acknowledge the interrupt when holding the spinlock. This avoids
> contention of the interrupt acknowledgment.
>
> Updated flow:
>
> 1. Receive HFI Thermal interrupt
> 2. Check if there is an active HFI status in MSR_IA32_THERM_STATUS
> 3. Try and get spin-lock
> Let's assume CPU 4 is processing interrupt
> 4.1 Read MSR_IA32_PACKAGE_THERM_STATUS and check HFI status bit
> 4.2 If hfi status is 0
> 4.3 unlock spinlock
> 4.4 return
> 4.5 Check the stored time-stamp from the HFI memory time-stamp
> 5. if same
> 6.1 ACK HFI Interrupt,
> 6.2 unlock spinlock
> 6.3 return
> 7. Copy the HFI message to local buffer
> 8. ACK HFI interrupt
> 9. unlock spinlock
> 10. Queue the message for processing in a work-queue
>
> To avoid taking the lock unnecessarily, intel_hfi_process_event() checks
> the status of the HFI interrupt before taking the lock. If CPU5 is late,
> when it starts processing the interrupt there are two scenarios:
>
> a) CPU4 acknowledged the HFI interrupt before CPU5 read
> MSR_IA32_THERM_STATUS. CPU5 exits.
>
> b) CPU5 reads MSR_IA32_THERM_STATUS before CPU4 has acknowledged the
> interrupt. CPU5 will take the lock if CPU4 has released it. It then
> re-reads MSR_IA32_THERM_STATUS. If there is not a new interrupt, the HFI
> status bit is clear and CPU5 exits. If a new HFI interrupt was generated
> it will find that the status bit is set and it will continue to process
> the interrupt. In this case even if timestamp is not changed, the ACK
> can be issued as this is a new interrupt.
>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx>
> Tested-by: Arshad, Adeel<adeel.arshad@xxxxxxxxx>
> ---
> This patch depends on two other patches posted before:
> [PATCH RESEND 1/2] thermal: intel: Prevent accidental clearing of HFI status
> [PATCH RESEND 2/2] thermal: intel: Protect clearing of thermal status bits
>
> drivers/thermal/intel/intel_hfi.c | 22 +++++++++++++++++-----
> 1 file changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index c9e0827c9ebe..c543a967cd50 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -250,7 +250,7 @@ void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
> struct hfi_instance *hfi_instance;
> int cpu = smp_processor_id();
> struct hfi_cpu_info *info;
> - u64 new_timestamp;
> + u64 new_timestamp, msr, hfi;
>
> if (!pkg_therm_status_msr_val)
> return;
> @@ -279,9 +279,21 @@ void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
> if (!raw_spin_trylock(&hfi_instance->event_lock))
> return;
>
> - /* Skip duplicated updates. */
> + rdmsrl(MSR_IA32_PACKAGE_THERM_STATUS, msr);
> + hfi = msr & PACKAGE_THERM_STATUS_HFI_UPDATED;
> + if (!hfi) {
> + raw_spin_unlock(&hfi_instance->event_lock);
> + return;
> + }
> +
> + /*
> + * Ack duplicate update. Since there is an active HFI
> + * status from HW, it must be a new event, not a case
> + * where a lagging CPU entered the locked region.
> + */
> new_timestamp = *(u64 *)hfi_instance->hw_table;
> if (*hfi_instance->timestamp == new_timestamp) {
> + thermal_clear_package_intr_status(PACKAGE_LEVEL, PACKAGE_THERM_STATUS_HFI_UPDATED);
> raw_spin_unlock(&hfi_instance->event_lock);
> return;
> }
> @@ -295,15 +307,15 @@ void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
> memcpy(hfi_instance->local_table, hfi_instance->hw_table,
> hfi_features.nr_table_pages << PAGE_SHIFT);
>
> - raw_spin_unlock(&hfi_instance->table_lock);
> - raw_spin_unlock(&hfi_instance->event_lock);
> -
> /*
> * Let hardware know that we are done reading the HFI table and it is
> * free to update it again.
> */
> thermal_clear_package_intr_status(PACKAGE_LEVEL, PACKAGE_THERM_STATUS_HFI_UPDATED);
>
> + raw_spin_unlock(&hfi_instance->table_lock);
> + raw_spin_unlock(&hfi_instance->event_lock);
> +
> queue_delayed_work(hfi_updates_wq, &hfi_instance->update_work,
> HFI_UPDATE_INTERVAL);
> }
> --

Applied as 6.2 material, thanks!