Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

From: David Hildenbrand
Date: Wed Nov 23 2022 - 14:32:24 EST


On 23.11.22 19:56, Peter Xu wrote:
On Wed, Nov 23, 2022 at 10:21:30AM -0800, Mike Kravetz wrote:
On 11/23/22 10:09, Peter Xu wrote:
On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
Let me try understand the basic problem first:

hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
don't grab the page table locks. That's very hugetlb specific handling and I
assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
concurrent page fault s... but that's no news. hugetlb is weird in many ways
:)

So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
we use some very basic locking for that?

Yes we can in most cases. Please refer to above paragraph [1] where I
referred Mike's recent work on vma lock. That's the basic locking we need
so far to protect pmd unsharing. I'll attach the link too in the next
post, which is here:

https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@xxxxxxxxxx


Using RCU / disabling local irqs seems a bit excessive because we *are*
holding the mmap lock and only care about concurrent unsharing

The series wanted to address where the vma lock is not easy to take. It
originates from when I was reading Mike's other patch, I forgot why I did
that but I just noticed there's some code path that we may not want to take
a sleepable lock, e.g. in follow page code.

Yes, it was the patch suggested by David,

https://lore.kernel.org/linux-mm/20221030225825.40872-1-mike.kravetz@xxxxxxxxxx/

The issue was that FOLL_NOWAIT could be passed into follow_page_mask. If so,
then we do not want potentially sleep on the mutex.

Since you both are on this thread, I thought of/noticed a related issue. In
follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
is set. hugetlb_fault certainly has the potential for sleeping. Is this also
a similar issue?

Yeah maybe the clean way to do this is when FAULT_FLAG_RETRY_NOWAIT is set
we should always try to not sleep at all.

hva_to_pfn_slow() that sets FOLL_NOWAIT calls get_user_pages_unlocked(), which will just do a straight mmap_read_lock().

The interpretation of FOLL_NOWAIT should not be "don't take any sleepable locks" but instead more like "don't wait for a page to get swapped in".

#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO


I did not read the full replies yet (sorry, busy hacking :) ) but *any* code path that already takes the mmap_read_lock() can just take whatever other lock we want -- IMHO. No need to over-complicate our code trying to avoid locks in that case.

--
Thanks,

David / dhildenb