Re: [PATCH v7 5/6] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information

From: Frederic Weisbecker
Date: Wed Nov 16 2022 - 17:55:18 EST

Next message: Saeed Mahameed: "Re: [PATCH v1] net/ethtool/ioctl: ensure that we have phy ops before using them"
Previous message: Thomas Gleixner: "Re: [patch 13/39] PCI/MSI: Use msi_domain_info::bus_token"
Next in thread: Leizhen (ThunderTown): "Re: [PATCH v7 5/6] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Nov 11, 2022 at 09:07:08PM +0800, Zhen Lei wrote:
> +1. A CPU looping with interrupts disabled.::
> +
> + rcu: hardirqs softirqs csw/system
> + rcu: number: 0 0 0
> +65;6003;1c rcu: cputime: 0 0 0 ==> 2500(ms)
> +
> + Because interrupts have been disabled throughout the measurement
> + interval, there are no interrupts and no context switches.
> + Furthermore, because CPU time consumption was measured using interrupt
> + handlers, the system CPU consumption is misleadingly measured as zero.
> + This scenario will normally also have "(0 ticks this GP)" printed on
> + this CPU's summary line.
> +
> +2. A CPU looping with bottom halves disabled.
> +
> + This is similar to the previous example, but with non-zero number of
> + and CPU time consumed by hard interrupts, along with non-zero CPU
> + time consumed by in-kernel execution.::
> +
> + rcu: hardirqs softirqs csw/system
> + rcu: number: 624 0 0
> + rcu: cputime: 49 0 2446 ==> 2500(ms)
> +
> + The fact that there are zero softirqs gives a hint that these were
> + disabled, perhaps via local_bh_disable(). It is of course possible
> + that there were no softirqs, perhaps because all events that would
> + result in softirq execution are confined to other CPUs. In this case,
> + the diagnosis should continue as shown in the next example.
> +
> +3. A CPU looping with preemption disabled.
> +
> + Here, only the number of context switches is zero.::
> +
> + rcu: hardirqs softirqs csw/system
> + rcu: number: 624 45 0
> + rcu: cputime: 69 1 2425 ==> 2500(ms)
> +
> + This situation hints that the stalled CPU was looping with preemption
> + disabled.
> +
> +4. No looping, but massive hard and soft interrupts.::
> +
> + rcu: hardirqs softirqs csw/system
> + rcu: number: xx xx 0
> + rcu: cputime: xx xx 0 ==> 2500(ms)
> +
> + Here, the number and CPU time of hard interrupts are all non-zero,
> + but the number of context switches and the in-kernel CPU time consumed
> + are zero. The number and cputime of soft interrupts will usually be
> + non-zero, but could be zero, for example, if the CPU was spinning
> + within a single hard interrupt handler.
> +
> + If this type of RCU CPU stall warning can be reproduced, you can
> + narrow it down by looking at /proc/interrupts or by writing code to
> + trace each interrupt, for example, by referring to show_interrupts().

One last question I have. Usually all these informations can be deduced by
just looking at the stacktrace that comes along an RCU stall report. So on
which kind of situation the stacktrace is not enough?

Thanks.

Next message: Saeed Mahameed: "Re: [PATCH v1] net/ethtool/ioctl: ensure that we have phy ops before using them"
Previous message: Thomas Gleixner: "Re: [patch 13/39] PCI/MSI: Use msi_domain_info::bus_token"
Next in thread: Leizhen (ThunderTown): "Re: [PATCH v7 5/6] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]