Re: [PATCH -next] iommu/dma: avoid expensive indirect calls for sync operations

From: Michal Kubiak
Date: Tue Nov 22 2022 - 14:18:46 EST


On Sat, Nov 12, 2022 at 04:04:52AM +0000, Eric Dumazet wrote:
> Quite often, NIC devices do not need dma_sync operations
> on x86_64 at least.
>
> Indeed, when dev_is_dma_coherent(dev) is true and
> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
> and friends do nothing.
>
> However, indirectly calling them when CONFIG_RETPOLINE=y
> consumes about 10% of cycles on a cpu receiving packets
> from softirq at ~100Gbit rate, as shown in [1]
>
> Even if/when CONFIG_RETPOLINE is not set, there
> is a cost of about 3%.
>
> This patch adds a copy of iommu_dma_ops structure,
> where sync_single_for_cpu, sync_single_for_device,
> sync_sg_for_cpu and sync_sg_for_device are unset.


Larysa from our team has found out this patch introduces also a
functional improvement for batch allocation in AF_XDP while iommmu is
turned on.
In 'xp_alloc_batch()' function there is a check if DMA needs a
synchronization. If so, batch allocation is not supported and we can
allocate only one buffer at a time.

The flag 'dma_need_sync' is being set according to the value returned by
the function 'dma_need_sync()' (from '/kernel/dma/mapping.c').
That function only checks if at least one of two DMA ops is defined:
'ops->sync_single_for_cpu' or 'ops->sync_single_for_device'.

> +static const struct dma_map_ops iommu_nosync_dma_ops = {
> + iommu_dma_ops_common_fields
> +
> + .sync_single_for_cpu = NULL,
> + .sync_single_for_device = NULL,
> + .sync_sg_for_cpu = NULL,
> + .sync_sg_for_device = NULL,
> +};
> +#undef iommu_dma_ops_common_fields
> +
> /*
> * The IOMMU core code allocates the default DMA domain, which the underlying
> * IOMMU driver needs to support via the dma-iommu layer.
> @@ -1586,7 +1612,8 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit)
> if (iommu_is_dma_domain(domain)) {
> if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev))
> goto out_err;
> - dev->dma_ops = &iommu_dma_ops;
> + dev->dma_ops = dev_is_dma_sync_needed(dev) ?
> + &iommu_dma_ops : &iommu_nosync_dma_ops;
> }
>
> return;

This code removes defining 'sync_*' DMA ops if they are not actually
used. Thanks to that improvement the function 'dma_need_sync()' will
always return more meaningful information if any DMA synchronization is
actually needed for iommu.

Together with Larysa we have applied that patch and we can confirm it
helps for batch buffer allocation in AF_XDP ('xsk_buff_alloc_batch()'
call) when iommu is enabled.