Re: [syzbot] KASAN: use-after-free Read in task_work_run (2)

From: Peter Zijlstra
Date: Wed Nov 23 2022 - 11:27:45 EST


On Wed, Nov 23, 2022 at 03:55:31PM +0100, Marco Elver wrote:

> > > --- x/kernel/events/core.c
> > > +++ c/kernel/events/core.c
> > > @@ -2291,6 +2291,7 @@ event_sched_out(struct perf_event *event
> > > !event->pending_work) {
> > > event->pending_work = 1;
> > > dec = false;
> > > + atomic_long_inc(&event->refcount);
> > > task_work_add(current, &event->pending_task, TWA_RESUME);
> > > }
> > > if (dec)
> > > @@ -6561,6 +6562,8 @@ static void perf_pending_task(struct cal
> > > struct perf_event *event = container_of(head, struct perf_event, pending_task);
> > > int rctx;
> > >
> > > + if (event->state == PERF_EVENT_STATE_DEAD)
> > > + goto out;
> > > /*
> > > * If we 'fail' here, that's OK, it means recursion is already disabled
> > > * and we won't recurse 'further'.
> > > @@ -6577,6 +6580,8 @@ static void perf_pending_task(struct cal
> > > if (rctx >= 0)
> > > perf_swevent_put_recursion_context(rctx);
> > > preempt_enable_notrace();
> > > +out:
> > > + put_event(event);
> > > }
> > >
> > > #ifdef CONFIG_GUEST_PERF_EVENTS

This is broken and will corrupt ctx->nr_pending.


> My guess is that the __fput task work is in the same task as the perf
> task work, and so if we tried to cancel the task work from within
> __fput, it won't actually cancel it if task_work_run() already exchanged
> the 'task_works' list.

That seems very likely indeed.

> So it looks like prolonging the perf events lifetime is the only option
> right now?

Depends a bit on how complicated all this is; at the very least
perf_event_release_kernel() will schedule out the event if it still
running. It does this before switching the state to DEAD (it has to)
which means it can raise perf_pending_task() at this point in time, even
though we're tearing down the event.

This can be avoided by a patch like this...

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ab0eb073bd5..e9ad1ff7a9f8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2287,6 +2287,7 @@ group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)

#define DETACH_GROUP 0x01UL
#define DETACH_CHILD 0x02UL
+#define DETACH_DEAD 0x04UL

/*
* Cross CPU call to remove a performance event
@@ -2308,12 +2309,20 @@ __perf_remove_from_context(struct perf_event *event,
update_cgrp_time_from_cpuctx(cpuctx, false);
}

+ /*
+ * Ensure event_sched_out() switches to OFF, at the very least
+ * this avoids raising perf_pending_task() at this time.
+ */
+ if (flags & DETACH_DEAD)
+ event->pending_disable = 1;
event_sched_out(event, ctx);
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
+ if (flags & DETACH_DEAD)
+ event->state = PERF_EVENT_STATE_DEAD;

if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
@@ -5299,9 +5308,7 @@ int perf_event_release_kernel(struct perf_event *event)

ctx = perf_event_ctx_lock(event);
WARN_ON_ONCE(ctx->parent_ctx);
- perf_remove_from_context(event, DETACH_GROUP);

- raw_spin_lock_irq(&ctx->lock);
/*
* Mark this event as STATE_DEAD, there is no external reference to it
* anymore.
@@ -5313,8 +5320,7 @@ int perf_event_release_kernel(struct perf_event *event)
* Thus this guarantees that we will in fact observe and kill _ALL_
* child events.
*/
- event->state = PERF_EVENT_STATE_DEAD;
- raw_spin_unlock_irq(&ctx->lock);
+ perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);

perf_event_ctx_unlock(event, ctx);

---

However; I don't think that actually helps, because in this case the new
task_work would actually still be on the ->task_works list and
task_work_cancel() should've worked.

The other possibility seems to be that the sample happens and we
schedule before close() can terminate the event, which means we've
already got perf_pending_task() queued by the time we get to
perf_remove_from_context().

This means the perf_pending_task() queue happened before the fput()
queue, and it is thus ran later (due to FILO ordering -- also see commit
c82199061009 ("task_work: remove fifo ordering guarantee")).

And I can't really see a way out of that other than doing refcount games
indeed.

There is the straight forward way, similar to what Hillf attempted, and
a really nasty one that avoids the atomics in the common case and is
really only targeted at this case -- given the overhead of signals I'm
thinking simple is better.

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ab0eb073bd5..0228ea090b98 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2248,6 +2248,7 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
!event->pending_work) {
event->pending_work = 1;
dec = false;
+ WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
task_work_add(current, &event->pending_task, TWA_RESUME);
}
if (dec)
@@ -6755,6 +6762,8 @@ static void perf_pending_task(struct callback_head *head)
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
preempt_enable_notrace();
+
+ put_event(event);
}

#ifdef CONFIG_GUEST_PERF_EVENTS


So perhaps both the above..

Does that actually work?