Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
From: Huang, Kai
Date: Wed Nov 23 2022 - 20:04:26 EST
On Tue, 2022-11-22 at 16:21 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> > memory regions that can possibly be used by the TDX module, but they are
> > not automatically usable to the TDX module. As a step of initializing
> > the TDX module, the kernel needs to choose a list of memory regions (out
> > from convertible memory regions) that the TDX module can use and pass
> > those regions to the TDX module. Once this is done, those "TDX-usable"
> > memory regions are fixed during module's lifetime. No more TDX-usable
> > memory can be added to the TDX module after that.
> >
> > The initial support of TDX guests will only allocate TDX guest memory
> > from the global page allocator. To keep things simple, this initial
> > implementation simply guarantees all pages in the page allocator are TDX
> > memory. To achieve this, use all system memory in the core-mm at the
> > time of initializing the TDX module as TDX memory, and at the meantime,
> > refuse to add any non-TDX-memory in the memory hotplug.
> >
> > Specifically, walk through all memory regions managed by memblock and
> > add them to a global list of "TDX-usable" memory regions, which is a
> > fixed list after the module initialization (or empty if initialization
> > fails). To reject non-TDX-memory in memory hotplug, add an additional
> > check in arch_add_memory() to check whether the new region is covered by
> > any region in the "TDX-usable" memory region list.
> >
> > Note this requires all memory regions in memblock are TDX convertible
> > memory when initializing the TDX module. This is true in practice if no
> > new memory has been hot-added before initializing the TDX module, since
> > in practice all boot-time present DIMM is TDX convertible memory. If
> > any new memory has been hot-added, then initializing the TDX module will
> > fail due to that memory region is not covered by CMR.
> >
> > This can be enhanced in the future, i.e. by allowing adding non-TDX
> > memory to a separate NUMA node. In this case, the "TDX-capable" nodes
> > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > needs to guarantee memory pages for TDX guests are always allocated from
> > the "TDX-capable" nodes.
> >
> > Note TDX assumes convertible memory is always physically present during
> > machine's runtime. A non-buggy BIOS should never support hot-removal of
> > any convertible memory. This implementation doesn't handle ACPI memory
> > removal but depends on the BIOS to behave correctly.
>
> My eyes glazed over about halfway through that. Can you try to trim it
> down a bit, or at least try to summarize it better up front?
Will do.
>
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index dd333b46fafb..b36129183035 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
> > depends on X86_64
> > depends on KVM_INTEL
> > depends on X86_X2APIC
> > + select ARCH_KEEP_MEMBLOCK
> > help
> > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks. This option enables necessary TDX
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index d688228f3151..71169ecefabf 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> > #ifdef CONFIG_INTEL_TDX_HOST
> > bool platform_tdx_enabled(void);
> > int tdx_enable(void);
> > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
> > #else /* !CONFIG_INTEL_TDX_HOST */
> > static inline bool platform_tdx_enabled(void) { return false; }
> > static inline int tdx_enable(void) { return -ENODEV; }
> > +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> > + unsigned long end_pfn) { return true; }
> > #endif /* CONFIG_INTEL_TDX_HOST */
> >
> > #endif /* !__ASSEMBLY__ */
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 3f040c6e5d13..900341333d7e 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -55,6 +55,7 @@
> > #include <asm/uv/uv.h>
> > #include <asm/setup.h>
> > #include <asm/ftrace.h>
> > +#include <asm/tdx.h>
> >
> > #include "mm_internal.h"
> >
> > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > unsigned long start_pfn = start >> PAGE_SHIFT;
> > unsigned long nr_pages = size >> PAGE_SHIFT;
> >
> > + /*
> > + * For now if TDX is enabled, all pages in the page allocator
>
> s/For now//
Will do.
>
> > + * must be TDX memory, which is a fixed set of memory regions
> > + * that are passed to the TDX module. Reject the new region
> > + * if it is not TDX memory to guarantee above is true.
> > + */
> > + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > + return -EINVAL;
>
> There's a real art to making a right-size comment. I don't think this
> needs to be any more than:
>
> /*
> * Not all memory is compatible with TDX. Reject
> * the addition of any incomatible memory.
> */
Thanks.
>
> If you want to write a treatise, do it in Documentation or at the
> tdx_cc_memory_compatible() definition.
>
> > init_memory_mapping(start, start + size, params->pgprot);
> >
> > return add_pages(nid, start_pfn, nr_pages, params);
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 43227af25e44..32af86e31c47 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -16,6 +16,11 @@
> > #include <linux/smp.h>
> > #include <linux/atomic.h>
> > #include <linux/align.h>
> > +#include <linux/list.h>
> > +#include <linux/slab.h>
> > +#include <linux/memblock.h>
> > +#include <linux/minmax.h>
> > +#include <linux/sizes.h>
> > #include <asm/msr-index.h>
> > #include <asm/msr.h>
> > #include <asm/apic.h>
> > @@ -34,6 +39,13 @@ enum tdx_module_status_t {
> > TDX_MODULE_SHUTDOWN,
> > };
> >
> > +struct tdx_memblock {
> > + struct list_head list;
> > + unsigned long start_pfn;
> > + unsigned long end_pfn;
> > + int nid;
> > +};
>
> Why does the nid matter?
It is used to find the node for the PAMT allocation for a given TDMR.
>
> > static u32 tdx_keyid_start __ro_after_init;
> > static u32 tdx_keyid_num __ro_after_init;
> >
> > @@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
> > static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> > static int tdx_cmr_num;
> >
> > +/* All TDX-usable memory regions */
> > +static LIST_HEAD(tdx_memlist);
> > +
> > /*
> > * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> > * BIOS. Both initializing the TDX module and running TDX guest require
> > @@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void)
> > return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> > }
> >
> > +/* Check whether the given pfn range is covered by any CMR or not. */
> > +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> > + unsigned long end_pfn)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < tdx_cmr_num; i++) {
> > + struct cmr_info *cmr = &tdx_cmr_array[i];
> > + unsigned long cmr_start_pfn;
> > + unsigned long cmr_end_pfn;
> > +
> > + cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> > + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
> > +
> > + if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> > + return true;
> > + }
>
> What if the pfn range overlaps two CMRs? It will never pass any
> individual overlap test and will return false.
We can only return true if the two CMRs are contiguous.
I cannot think out a reason that a reasonable BIOS could generate contiguous
CMRs. Perhaps one reason is two contiguous NUMA nodes? For this case, memblock
has made sure no memory region could cross NUMA nodes, so the start_pfn/end_pfn
here should always be within one node. Perhaps we can add a comment for this
case?
Anyway I am not sure whether it is worth to consider "contiguous CMRs" case.
>
> > + return false;
> > +}
> > +
> > +/*
> > + * Add a memory region on a given node as a TDX memory block. The caller
> > + * to make sure all memory regions are added in address ascending order
>
> s/to/must/
Thanks.
>
> > + * and don't overlap.
> > + */
> > +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> > + int nid)
> > +{
> > + struct tdx_memblock *tmb;
> > +
> > + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> > + if (!tmb)
> > + return -ENOMEM;
> > +
> > + INIT_LIST_HEAD(&tmb->list);
> > + tmb->start_pfn = start_pfn;
> > + tmb->end_pfn = end_pfn;
> > + tmb->nid = nid;
> > +
> > + list_add_tail(&tmb->list, &tdx_memlist);
> > + return 0;
> > +}
> > +
> > +static void free_tdx_memory(void)
>
> This is named a bit too generically. How about free_tdx_memlist() or
> something?
Will use free_tdx_memlist(). Do you want to also change build_tdx_memory() to
build_tdx_memlist()?
>
> > +{
> > + while (!list_empty(&tdx_memlist)) {
> > + struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
> > + struct tdx_memblock, list);
> > +
> > + list_del(&tmb->list);
> > + kfree(tmb);
> > + }
> > +}
> > +
> > +/*
> > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> > + * Must be called when get_online_mems() is called by the caller.
> > + */
>
> Again, this explains the "what", but not the "why".
>
> /*
> * Ensure that all memblock memory regions are convertible to TDX
> * memory. Once this has been established, stash the memblock
> * ranges off in a secondary structure because $REASONS.
> */
>
> Which makes me wonder: Why do you even need a secondary structure here?
> What's wrong with the memblocks themselves?
One reason is the new region has already been added to memblock before calling
arch_add_memory(), so we cannot compare the new region against memblock.
The other reason is memblock is updated in memory hotplug so it really isn't a
set of *fixed* memory regions, which TDX requires. Having TDX's own tdx_memlist
can support such case: after module initialization, some memory can be hot-
removed and then hot-added again, because the hot-added range will be covered by
@tdx_memlist.
>
> > +static int build_tdx_memory(void)
> > +{
> > + unsigned long start_pfn, end_pfn;
> > + int i, nid, ret;
> > +
> > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > + /*
> > + * The first 1MB may not be reported as TDX convertible
> > + * memory. Manually exclude them as TDX memory.
>
> I don't like the "may not" here very much.
>
> > + * This is fine as the first 1MB is already reserved in
> > + * reserve_real_mode() and won't end up to ZONE_DMA as
> > + * free page anyway.
>
> ^ free pages
>
> > + */
>
> This way too wishy washy. The TDX module may or may not... Then, it
> doesn't matter since reserve_real_mode() does it anyway...
>
> Then it goes and adds code to skip it!
>
> > + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > + if (start_pfn >= end_pfn)
> > + continue;
>
>
> Please just put a dang stake in the ground. If the other code deals
> with this, then explain *why* more is needed here.
How about adding below before the 'for_each_mem_pfn_range()' loop:
/*
* Some reserved pages in memblock (i.e. kernel init code/data) are
* freed to the page allocator directly. Use for_each_mem_pfn_range()
* instead of for_each_free_mem_range() to make sure all pages in the
* page allocator are covered as TDX memory.
*/
It explains why to use for_each_mem_pfn_range().
And here before skipping first 1MB, we add below:
/*
* The first 1MB is not reported as TDX covertible memory.
* Although the first 1MB is always reserved and won't end up
* to the page allocator, it is still in memblock's memory
* regions. Skip them manually to exclude them as TDX memory.
*/
>
> > + /* Verify memory is truly TDX convertible memory */
> > + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> > + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> > + start_pfn << PAGE_SHIFT,
> > + end_pfn << PAGE_SHIFT);
> > + return -EINVAL;
>
> ... no 'goto err'? This leaks all the previous add_tdx_memblock()
> structures, right?
Right. It's a leftover from the old code. Will fix.
>
> > + }
> > +
> > + /*
> > + * Add the memory regions as TDX memory. The regions in
> > + * memblock has already guaranteed they are in address
> > + * ascending order and don't overlap.
> > + */
> > + ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> > + if (ret)
> > + goto err;
> > + }
> > +
> > + return 0;
> > +err:
> > + free_tdx_memory();
> > + return ret;
> > +}
> > +
> > /*
> > * Detect and initialize the TDX module.
> > *
> > @@ -357,12 +473,56 @@ static int init_tdx_module(void)
> > if (ret)
> > goto out;
> >
> > + /*
> > + * All memory regions that can be used by the TDX module must be
> > + * passed to the TDX module during the module initialization.
> > + * Once this is done, all "TDX-usable" memory regions are fixed
> > + * during module's runtime.
> > + *
> > + * The initial support of TDX guests only allocates memory from
> > + * the global page allocator. To keep things simple, for now
> > + * just make sure all pages in the page allocator are TDX memory.
> > + *
> > + * To achieve this, use all system memory in the core-mm at the
> > + * time of initializing the TDX module as TDX memory, and at the
> > + * meantime, reject any new memory in memory hot-add.
> > + *
> > + * This works as in practice, all boot-time present DIMM is TDX
> > + * convertible memory. However if any new memory is hot-added
> > + * before initializing the TDX module, the initialization will
> > + * fail due to that memory is not covered by CMR.
> > + *
> > + * This can be enhanced in the future, i.e. by allowing adding or
> > + * onlining non-TDX memory to a separate node, in which case the
> > + * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> > + * together -- the userspace/kernel just needs to make sure pages
> > + * for TDX guests must come from those "TDX-capable" nodes.
> > + *
> > + * Build the list of TDX memory regions as mentioned above so
> > + * they can be passed to the TDX module later.
> > + */
>
> This is msotly Documentation/, not a code comment. Please clean it up.
Will try to clean up.
>
> > + get_online_mems();
> > +
> > + ret = build_tdx_memory();
> > + if (ret)
> > + goto out;
> > /*
> > * Return -EINVAL until all steps of TDX module initialization
> > * process are done.
> > */
> > ret = -EINVAL;
> > out:
> > + /*
> > + * Memory hotplug checks the hot-added memory region against the
> > + * @tdx_memlist to see if the region is TDX memory.
> > + *
> > + * Do put_online_mems() here to make sure any modification to
> > + * @tdx_memlist is done while holding the memory hotplug read
> > + * lock, so that the memory hotplug path can just check the
> > + * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> > + * deadlock.
> > + */
>
> I'm honestly not following any of that.
How about:
/*
* Make sure tdx_cc_memory_compatible() either sees a fixed set of
* memory regions in @tdx_memlist, or an empty list.
*/
>
> > + put_online_mems();
> > return ret;
> > }
> >
> > @@ -485,3 +645,26 @@ int tdx_enable(void)
> > return ret;
> > }
> > EXPORT_SYMBOL_GPL(tdx_enable);
> > +
> > +/*
> > + * Check whether the given range is TDX memory. Must be called between
> > + * mem_hotplug_begin()/mem_hotplug_done().
> > + */
> > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + struct tdx_memblock *tmb;
> > +
> > + /* Empty list means TDX isn't enabled successfully */
> > + if (list_empty(&tdx_memlist))
> > + return true;
> > +
> > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > + /*
> > + * The new range is TDX memory if it is fully covered
> > + * by any TDX memory block.
> > + */
> > + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > + return true;
>
> Same bug. What if the start/end_pfn range is covered by more than one
> tdx_memblock?
We may want to return true if tdx_memblocks are contiguous.
However I don't think this will happen?
tdx_memblock is from memblock, and when two memory regions in memblock are
contiguous, they must have different node, or flags.
My understanding is the hot-added memory region here cannot across NUMA nodes,
nor have different flags, correct?
>
> > + }
> > + return false;
> > +}
>