Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

From: Peter Xu
Date: Thu Nov 17 2022 - 11:29:05 EST


On Wed, Nov 16, 2022 at 05:00:08PM -0800, James Houghton wrote:
> On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> > > +struct hugetlb_pte {
> > > + pte_t *ptep;
> > > + unsigned int shift;
> > > + enum hugetlb_level level;
> > > + spinlock_t *ptl;
> > > +};
> >
> > Do we need both shift + level? Maybe it's only meaningful for ARM where
> > the shift may not be directly calculcated from level?
> >
> > I'm wondering whether we can just maintain "shift" then we calculate
> > "level" realtime. It just reads a bit weird to have these two fields, also
> > a burden to most of the call sites where shift and level exactly match..
>
> My main concern is interaction with folded levels. For example, if
> PUD_SIZE and PMD_SIZE are the same, we want to do something like this:
>
> pud = pud_offset(p4d, addr)
> pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */
> pte = pte_offset(pmd, addr)
>
> and I think we should avoid quietly skipping the folded level, which
> could happen:
>
> pud = pud_offset(p4d, addr)
> /* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here,
> it is impossible to know that `pud` came from `pud_offset` and not
> `pmd_offset`. We must assume the deeper level so that we don't get
> stuck in a loop. */
> pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * ->
> pmd_t *) */
>
> Quietly dropping p*d_offset for folded levels is safe; it's just a
> cast that we're doing anyway. If you think this is fine, then I can
> remove `level`. It might also be that this is a non-issue and that
> there will never be a folded level underneath a hugepage level.
>
> We could also change `ptep` to a union eventually (to clean up
> "hugetlb casts everything to pte_t *" messiness), and having an
> explicit `level` as a tag for the union would be nice help. In the
> same way: I like having `level` explicitly so that we know for sure
> where `ptep` came from.

Makes sense.

>
> I can try to reduce the burden at the callsite while keeping `level`:
> hpage_size_to_level() is really annoying to have everywhere.

Yeah this would be nice.

Per what I see most callers are having paired level/shift, so maybe we can
make hugetlb_hgm_iter_init() to only take one of them and calculate the
other. Then there can also be the __hugetlb_hgm_iter_init() which takes
both, only used in the few places where necessary to have explicit
level/shift. hugetlb_hgm_iter_init() calls __hugetlb_hgm_iter_init().

Thanks,

--
Peter Xu