Re: [PATCH 3/4] cachestat: implement cachestat syscall
From: Johannes Weiner
Date: Mon Nov 21 2022 - 10:55:27 EST
On Mon, Nov 21, 2022 at 09:45:49AM -0500, Brian Foster wrote:
> On Tue, Nov 15, 2022 at 10:29:00AM -0800, Nhat Pham wrote:
> > Implement a new syscall that queries cache state of a file and
> > summarizes the number of cached pages, number of dirty pages, number of
> > pages marked for writeback, number of (recently) evicted pages, etc. in
> > a given range.
> >
> > NAME
> > cachestat - query the page cache status of a file.
> >
> > SYNOPSIS
> > #include <sys/mman.h>
> >
> > struct cachestat {
> > unsigned long nr_cache;
> > unsigned long nr_dirty;
> > unsigned long nr_writeback;
> > unsigned long nr_evicted;
> > unsigned long nr_recently_evicted;
> > };
> >
> > int cachestat(unsigned int fd, off_t off, size_t len,
> > struct cachestat *cstat);
> >
>
> Do you have a strong use case for a user specified range vs. just
> checking the entire file? If not, have you considered whether it might
> be worth expanding statx() to include this data? That call is already
> designed to include "extended" file status and avoids the need for a new
> syscall. For example, the fields could be added individually with
> multiple flags, or the entire struct tied to a new STATX_CACHE flag or
> some such.
Whole-file stats are only useful for data that is structured in
directory trees. It doesn't work for structured files. For example,
understanding (and subsequently advising/influencing) the readahead
and dirty flushing in certain sections of a larger database file.
Fadvise/madvise/sync_file_range etc. give the user the ability to
influence cache behavior in sub-ranges, so it makes sense to also
allow querying at that granularity.
> > DESCRIPTION
> > cachestat() queries the number of cached pages, number of dirty
> > pages, number of pages marked for writeback, number of (recently)
> > evicted pages, in the bytes range given by `off` and `len`.
> >
> > These values are returned in a cachestat struct, whose address is
> > given by the `cstat` argument.
> >
> > The `off` argument must be a non-negative integers, If `off` + `len`
> > >= `off`, the queried range is [`off`, `off` + `len`]. Otherwise, we
> > will query in the range from `off` to the end of the file.
> >
>
> (off + len < off) is an error condition on some (most?) other syscalls.
> At least some calls (i.e. fadvise(), sync_file_range()) use len == 0 to
> explicitly specify "to EOF."
Good point, it would make sense to stick to that precedent.