Re: [PATCH v7 5/6] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
From: Leizhen (ThunderTown)
Date: Wed Nov 16 2022 - 21:03:28 EST
On 2022/11/17 6:55, Frederic Weisbecker wrote:
> On Fri, Nov 11, 2022 at 09:07:08PM +0800, Zhen Lei wrote:
>> +1. A CPU looping with interrupts disabled.::
>> +
>> + rcu: hardirqs softirqs csw/system
>> + rcu: number: 0 0 0
>> +65;6003;1c rcu: cputime: 0 0 0 ==> 2500(ms)
>> +
>> + Because interrupts have been disabled throughout the measurement
>> + interval, there are no interrupts and no context switches.
>> + Furthermore, because CPU time consumption was measured using interrupt
>> + handlers, the system CPU consumption is misleadingly measured as zero.
>> + This scenario will normally also have "(0 ticks this GP)" printed on
>> + this CPU's summary line.
>> +
>> +2. A CPU looping with bottom halves disabled.
>> +
>> + This is similar to the previous example, but with non-zero number of
>> + and CPU time consumed by hard interrupts, along with non-zero CPU
>> + time consumed by in-kernel execution.::
>> +
>> + rcu: hardirqs softirqs csw/system
>> + rcu: number: 624 0 0
>> + rcu: cputime: 49 0 2446 ==> 2500(ms)
>> +
>> + The fact that there are zero softirqs gives a hint that these were
>> + disabled, perhaps via local_bh_disable(). It is of course possible
>> + that there were no softirqs, perhaps because all events that would
>> + result in softirq execution are confined to other CPUs. In this case,
>> + the diagnosis should continue as shown in the next example.
>> +
>> +3. A CPU looping with preemption disabled.
>> +
>> + Here, only the number of context switches is zero.::
>> +
>> + rcu: hardirqs softirqs csw/system
>> + rcu: number: 624 45 0
>> + rcu: cputime: 69 1 2425 ==> 2500(ms)
>> +
>> + This situation hints that the stalled CPU was looping with preemption
>> + disabled.
>> +
>> +4. No looping, but massive hard and soft interrupts.::
>> +
>> + rcu: hardirqs softirqs csw/system
>> + rcu: number: xx xx 0
>> + rcu: cputime: xx xx 0 ==> 2500(ms)
>> +
>> + Here, the number and CPU time of hard interrupts are all non-zero,
>> + but the number of context switches and the in-kernel CPU time consumed
>> + are zero. The number and cputime of soft interrupts will usually be
>> + non-zero, but could be zero, for example, if the CPU was spinning
>> + within a single hard interrupt handler.
>> +
>> + If this type of RCU CPU stall warning can be reproduced, you can
>> + narrow it down by looking at /proc/interrupts or by writing code to
>> + trace each interrupt, for example, by referring to show_interrupts().
>
> One last question I have. Usually all these informations can be deduced by
> just looking at the stacktrace that comes along an RCU stall report. So on
> which kind of situation the stacktrace is not enough?
Interrupt storm.
>
> Thanks.
> .
>
--
Regards,
Zhen Lei