Re: [PATCH] error-injection: Add prompt for function error injection
From: Chris Mason
Date: Tue Nov 22 2022 - 14:51:38 EST
On 11/22/22 1:29 PM, Steven Rostedt wrote:
> On Tue, 22 Nov 2022 12:42:33 -0500
> Chris Mason <clm@xxxxxxxx> wrote:
>
>> On 11/22/22 5:39 AM, Borislav Petkov wrote:
>>> On Mon, Nov 21, 2022 at 03:36:08PM -0800, Alexei Starovoitov wrote:
>>>> The commit log is bogus and the lack of understanding what
>>>
>>> You mean that:
>>>
>>> Documentation/fault-injection/fault-injection.rst
>>>
>>> ?
>>>
>>> I don't want any of that possible in production setups. And until you
>>> give me a sane argument why it is good to have in production setups
>>> generically, this is end of story.
>>>
>>
>> I think there are a few different sides to this:
>>
>> - it makes total sense that we all have wildly different ideas about
>> which tools should be available in prod. Making this decision more fine
>> grained seems reasonable.
>>
>> - fault injection for testing: we have a stage of qualification that
>> does error injection against the prod kernel. It helps to have this
>> against the debug kernel too, but that misses some races etc. I always
>> just assumed distros and partners did some fault injection tests against
>> the prod kernel builds?
>>
>> - fault injection for debugging: it doesn't happen often but at some
>> point we run out of ideas and start making different functions fail in
>> prod to figure out why we're not prodding.
>
> As you have stated, we have different ideas for production. Your POV is
> cloud based (as is with other parts of my company). But my POV is
> Chromebooks where production means what's on a user's device. There's no
> reason to ever have fault injection enabled in such cases. I would assume
> that distributions are the same. But having kprobes for visibility can also
> be useful for debugging purposes, even in the field.
>
Yeah, I definitely don't have opinions on the right way to build a
chromebook, and replying to Boris, only slightly better at distros.
Josef's original intent was this be easy to turn off.
>>
>> - overriding return values for security fixes: also not a common thing,
>> but it's a tool we've used. There are usually better long term fixes,
>> but it happens.
>>
>> Stepping back to the big picture of debugging systems with bpf in use, I
>> love hearing (and telling) stories of debugging difficult problems. As
>> far as I know, BPF telling lies hasn't really been a problem for us, so
>> even though it's a huge tangent, if you have specific examples of
>> problems you've seen, I'm really interested in hearing more.
>>
>> When I talk about production, both overall stability and validating new
>> kernels, if I compare the BPF subsystem with MM, filesystems, cgroups,
>> the scheduler, networking, and all things Jens, the systems BPF
>> developers put in place are working really well for me.
>>
>> If I expand the discussion to the BPF programs themselves, there have
>> been rare issues. Still completely on par with the rest of the kernel
>> subsystems and within the noise in comparison with hardware failures.
>>
>> In other words, I really do care about the concerns you're expressing
>> here, and I'm usually first in line to complain when random people make
>> my job harder. I'm just not seeing these issues with BPF, and I see
>> them actively trying to increase safety over time.
>
> I'm sure you are not seeing theses issues with BPF, as the main developers
> and you have the same focus areas.
>
> I have no problem with the concept of BPF. My concern is mostly the
> development side of it. As you can basically attach functionality to
> arbitrary points in the kernel via BPF programs, the perception is that
> anything that is available is fair game. BPF tends to expand features
> beyond their intended usage. Heck, look at the name itself. "extended
> Berkeley Packet Filter", were eBPF has nothing to do with packet filtering
> anymore. Perhaps it should be renamed to CUST (Compiled Use Space
> Trampoline) ;-)
Developers in general tend to stretch interfaces a lot. At some point
the friction of using the interface is worse than the friction of
changing it, and things get redone. At the end of the day, BPF
developers are still kernel developers and we end up with relatively
sane feedback loops.
>
> Alexei said it's "sad" about my expression of BPF and error injection. If
> it has to do with security, then I would like to see more collaboration
> with the security folks and perhaps have BPF integrate with their
> infrastructure.
Now is a great time to grab KP and hear all about BPF LSM.
> But the usual response is "that's not fast enough for me"
> and then something is done from scratch without working with that
> subsystem to make it fast enough. Yes, it takes more time to collaborate
> than just doing it on your own. But that's the nature of an open source
> *community*.
One of the awkward and wonderful parts of our community is that none of
us have the same goals or needs. Going back to the original thread, ARM
has either one or two different live patching subsystems in use in the
industry, and neither is upstream.
One reason you end up having these arguments often with BPF is because
they stick around and work with the community to upstream their work.
The tradeoffs, compromises and decisions aren't always what you want,
but we all show up every day and keep engaging.
-chris