Re: BUG: BISECTED: in squashfs_xz_uncompress() (Was: RCU stalls in squashfs_readahead())

From: Mirsad Goran Todorovac
Date: Sun Nov 20 2022 - 10:49:53 EST


On 18. 11. 2022. 16:55, Paul E. McKenney wrote:
On Fri, Nov 18, 2022 at 06:11:16AM +0000, Phillip Lougher wrote:
On 17/11/2022 23:05, Mirsad Goran Todorovac wrote:
Hi,

While trying to bisect, I've found another bug that predated the
introduction of squashfs_readahead(), but it has
a common denominator in squashfs_decompress() and squashfs_xz_uncompress().

Wrong, the stall is happening in the XZ decompressor library, which
is *not* in Squashfs.

This reported stall in the decompressor code is likely a symptom of you
deliberately thrashing your system. When the system thrashes everything
starts to happen very slowly, and the system will spend a lot of
its time doing page I/O, and the CPU will spend a lot of time in
any CPU intensive code like the XZ decompressor library.

So the fact the stall is being hit here is a symptom and not
a cause. The decompressor code is likely running slowly due to
thrashing and waiting on paged-out buffers. This is not indicative
of any bug, merely a system running slowly due to overload.

As I said, this is not a Squashfs issue, because the code when the
stall takes place isn't in Squashfs.

The people responsible for the rcu code should have a lot more insight
about what happens when the system is thrashing, and how this will
throw up false positives. In this I believe this is an instance of
perfectly correct code running slowly due to thrashing incorrectly
being flagged as looping.

CC'ing Paul E. McKenney <paulmck@xxxxxxxxxx>

Frankly, I need an advice on how to handle a situation like this.

Obviously, the best place to test the RCU problem with
squashfs_readahead() is where it is introduced + Phillip's patches
[012]/3.

Thanks,
Mirsad

P.S.

This is excerpt from dmesg log attached:

[   25.338169] rcu: INFO: rcu_preempt detected expedited stalls on
CPUs/tasks: { 4-... } 6 jiffies s: 373 root: 0x10/.

The "6 jiffies" leads me to believe that this kernel was built with
CONFIG_ANDROID=y, but also with the short expedited grace-period commit,
that is, with CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to 20 milliseconds
and with HZ set to something like 250. This means that the kernel
will complain if an RCU read-side critical section exceeeds about 20
milliseconds. This is a fine thing if you are actually running on an
Android, but a bit extreme for general usage.

If my guesses are correct, please apply this commit from mainline:

1045a06724f3 ("remove CONFIG_ANDROID")

With this commit in place, CONFIG_RCU_EXP_CPU_STALL_TIMEOUT will default
to 21000 (that is, 21 seconds) and things will be much better.

Thanx, Paul

Dear Paul,

Actually, I was able to reproduce and bisect the commit that started causing
described stalls (I have already explained how 5.10 didn't show any under the
exactly same saved widows and tabs set.)

# bad: [1e57930e9f4083ad5854ab6eadffe790a8167fb4] Merge tag 'rcu.2022.05.19a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
git bisect bad 1e57930e9f4083ad5854ab6eadffe790a8167fb4
# good: [f993aed406eaf968ba3867a76bb46c95336a33d0] Merge tag 'block-5.18-2022-05-18' of git://git.kernel.dk/linux-block
git bisect good f993aed406eaf968ba3867a76bb46c95336a33d0
# good: [bf2431021c8cfe52c866d7bf640ced03a35ebe91] Merge tag 'efi-next-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
git bisect good bf2431021c8cfe52c866d7bf640ced03a35ebe91
# good: [be05ee54378d451e1d60196921566d6087f3079f] Merge branches 'docs.2022.04.20a', 'fixes.2022.04.20a', 'nocb.2022.04.11b', 'rcu-tasks.2022.04.11b', 'srcu.2022.
05.03a', 'torture.2022.04.11b', 'torture-tasks.2022.04.20a' and 'torturescript.2022.04.20a' into HEAD
git bisect good be05ee54378d451e1d60196921566d6087f3079f
# good: [023033fe343cdf2ba83ab762f8de69241c7fc086] tools/nolibc/types: define PATH_MAX and MAXPATHLEN
git bisect good 023033fe343cdf2ba83ab762f8de69241c7fc086
# good: [96980b833a21c6dc29d0dfdc8f211fb8a10256a7] tools/nolibc/string: do not use __builtin_strlen() at -O0
git bisect good 96980b833a21c6dc29d0dfdc8f211fb8a10256a7
# good: [0e0ff638400be8f497a35b51a4751fd823f6bd6a] tools/nolibc/stdlib: Implement `malloc()`, `calloc()`, `realloc()` and `free()`
git bisect good 0e0ff638400be8f497a35b51a4751fd823f6bd6a
# good: [b2f02e9cdaade3ce565e6973ac313d2f814447f2] Merge tag 'lkmm.2022.05.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
git bisect good b2f02e9cdaade3ce565e6973ac313d2f814447f2
# good: [9621fbee44df940e2e1b94b0676460a538dffefa] rcu: Move expedited grace period (GP) work to RT kthread_worker
git bisect good 9621fbee44df940e2e1b94b0676460a538dffefa
# bad: [ce13389053a347aa9f8ffbfda2238352536e15c9] Merge branch 'exp.2022.05.11a' into HEAD
git bisect bad ce13389053a347aa9f8ffbfda2238352536e15c9
# first bad commit: [ce13389053a347aa9f8ffbfda2238352536e15c9] Merge branch 'exp.2022.05.11a' into HEAD

The bug looks like this:

[ 57.876820] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-... } 6 jiffies s: 545 root: 0x1/.
[ 57.876854] rcu: blocking rcu_node structures (internal RCU debug):
[ 57.876862] Task dump for CPU 0:
[ 57.876867] task:snap-store state:R running task stack: 0 pid: 2678 ppid: 2230 flags:0x00004008
[ 57.876883] Call Trace:
[ 57.876888] <TASK>
[ 57.876893] ? _raw_spin_lock_irqsave+0x93/0xf0
[ 57.876909] ? sysvec_apic_timer_interrupt+0x89/0x90
[ 57.876920] ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 57.876930] ? sysvec_apic_timer_interrupt+0x89/0x90
[ 57.876939] ? sysvec_apic_timer_interrupt+0x89/0x90
[ 57.876948] ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 57.876959] ? lzma_len+0x1fb/0x3b0
[ 57.876971] ? dict_repeat+0xb3/0x120
[ 57.876981] ? lzma_main+0x943/0x1260
[ 57.876995] ? lzma2_lzma+0x2a4/0x3b0
[ 57.877007] ? xz_dec_lzma2_run+0x11f/0xb90
[ 57.877018] ? crc32_le_base+0x186/0x250
[ 57.877032] ? xz_dec_run+0x346/0x11e0
[ 57.877046] ? squashfs_xz_uncompress+0x135/0x330
[ 57.877060] ? lzo_init+0xd0/0xd0
[ 57.877069] ? squashfs_decompress+0x80/0xb0
[ 57.877079] ? squashfs_read_data+0x1eb/0x920
[ 57.877090] ? __filemap_get_folio+0x37e/0x540
[ 57.877104] ? squashfs_bio_read.isra.2+0x2a0/0x2a0
[ 57.877118] ? squashfs_readpage_block+0x493/0x950
[ 57.877131] ? squashfs_readpage+0xa52/0x1140
[ 57.877143] ? xas_nomem+0x2f/0x100
[ 57.877154] ? squashfs_copy_cache+0x360/0x360
[ 57.877164] ? folio_wait_bit+0x3c0/0x3c0
[ 57.877175] ? __kasan_check_read+0x11/0x20
[ 57.877186] ? pagevec_add_and_need_flush+0x8c/0xa0
[ 57.877197] ? folio_add_lru+0x42/0x70
[ 57.877206] ? filemap_add_folio+0xd4/0x130
[ 57.877216] ? add_to_page_cache_locked+0xa0/0xa0
[ 57.877228] ? filemap_read_folio.isra.62+0xbe/0x4d0
[ 57.877238] ? __kasan_check_write+0x14/0x20
[ 57.877248] ? filemap_page_mkwrite+0x6a0/0x6a0
[ 57.877260] ? filemap_add_folio+0x130/0x130
[ 57.877269] ? try_charge_memcg+0x4cc/0xa90
[ 57.877281] ? filemap_fault+0xc9a/0xe50
[ 57.877293] ? generic_file_read_iter+0x230/0x230
[ 57.877305] ? __rcu_read_unlock+0x55/0x260
[ 57.877317] ? __do_fault+0x7c/0x1b0
[ 57.877329] ? __handle_mm_fault+0x173b/0x1970
[ 57.877340] ? copy_page_range+0x1890/0x1890
[ 57.877356] ? handle_mm_fault+0x11b/0x380
[ 57.877366] ? do_user_addr_fault+0x258/0x810
[ 57.877380] ? exc_page_fault+0x60/0xe0
[ 57.877389] ? asm_exc_page_fault+0x8/0x30
[ 57.877398] ? asm_exc_page_fault+0x1e/0x30
[ 57.877410] </TASK>

Please find complete dmesg, my lshw and bisect attached.

I am not competent to estimate how this commit causes this kind of problems,
and whether stalls up to 20 jiffies are worth eradicating.

Hope this helps.

Thank you,
Mirsad

--
Mirsad Goran Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu
--
System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia
The European Union

Attachment: config-6.1.0-rc5.xz
Description: application/xz

Attachment: dmesg-5.18.0-rc1-31a-ce13389053a3.log.xz
Description: application/xz

Attachment: dmesg-5.18.0-rc1-31a-ce13389053a3-full.log.xz
Description: application/xz

Attachment: lshw.txt.xz
Description: application/xz

Attachment: squashfs-stall-bisect-full.log.xz
Description: application/xz