Introduction to CAP_BPF

A primer to what the kernel enforces on eBPF program loading

Let's dive in

Note: The kernel changes all the time, especially when it comes to eBPF. This article is based on v5.18 of the Linux kernel.

If you look for information on CAP_BPF on the current man pages you don't get much satisfying information:

CAP_BPF (since Linux 5.8)
       Employ privileged BPF operations; see bpf(2) and
       bpf-helpers(7).

       This capability was added in Linux 5.8 to separate out BPF
       functionality from the overloaded CAP_SYS_ADMIN
       capability.

This probably due to the fact that it was recently added (2020) and that there's complexity in regards to when you need to pair it with other capabilities to be useful. You'll get a better introduction in the patch that introduces it: Introduce CAP_BPF

The user process has to have
- CAP_BPF to create maps, do other sys_bpf() commands and load SK_REUSEPORT progs.
  Note: dev_map, sock_hash, sock_map map types still require CAP_NET_ADMIN.
  That could be relaxed in the future.
- CAP_BPF and CAP_PERFMON to load tracing programs.
- CAP_BPF and CAP_NET_ADMIN to load networking programs.
(or CAP_SYS_ADMIN for backward compatibility).

CAP_BPF solves three main goals:
1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
   More on this below. This is the major difference vs v4 set back from Sep 2019.
2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
   prevents pointer leaks and arbitrary kernel memory access.
3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
   and making BPF infra more secure. Currently fuzzers run in unpriv.
   They will be able to run with CAP_BPF.

There is also useful information tucked away in the kernel headers:

/*
 * CAP_BPF allows the following BPF operations:
 * - Creating all types of BPF maps
 * - Advanced verifier features
 *   - Indirect variable access
 *   - Bounded loops
 *   - BPF to BPF function calls
 *   - Scalar precision tracking
 *   - Larger complexity limits
 *   - Dead code elimination
 *   - And potentially other features
 * - Loading BPF Type Format (BTF) data
 * - Retrieve xlated and JITed code of BPF programs
 * - Use bpf_spin_lock() helper
 *
 * CAP_PERFMON relaxes the verifier checks further:
 * - BPF progs can use of pointer-to-integer conversions
 * - speculation attack hardening measures are bypassed
 * - bpf_probe_read to read arbitrary kernel memory is allowed
 * - bpf_trace_printk to print kernel memory is allowed
 *
 * CAP_SYS_ADMIN is required to use bpf_probe_write_user.
 *
 * CAP_SYS_ADMIN is required to iterate system wide loaded
 * programs, maps, links, BTFs and convert their IDs to file descriptors.
 *
 * CAP_PERFMON and CAP_BPF are required to load tracing programs.
 * CAP_NET_ADMIN and CAP_BPF are required to load networking programs.
 */

Small intro to capabilities

A process's capability set isn't a singular, constant thing. A process actually contains multiple capability sets (currently 5 as of v4.3) and some may change based on file-set bits, execve, user changes and whether the process is capability-aware. You can see a snapshot of these values (written in hex) through /proc/{pid}/status:

CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

To be even more accurate (and add complexity), a process's thread is what actually contains the capability sets and you can still see these through procfs /proc/{pid}/task/{tid}. This means that in a multithreaded process, each thread can have different capability sets.

The most relevant for this article is a thread's effective capability set but you can read more about the other set types in this blog post. The effective set is what the kernel actually checks against.

Pro tip: You can see a live stream of the capabilities your kernel is checking with capable-bpfcc.

How CAP_BPF works

The general design rule here is that if a process is able to obtain a file descriptor to an eBPF object then its assumed that future operations on that object, through the descriptor, are allowed. You are expected to be returned a file descriptor for an eBPF program when it is successfully loaded so that is when many of the checks occur.

CAP_BPF will let a process load its own eBPF programs and maps but for loading specific program types you might need to pair it with another capability. You'll need to pair your process with CAP_PERFMON for loading tracing programs and CAP_NET_ADMIN for loading network programs.

Some network programs are allowed be loaded by unprivileged users; as an example, you can load but not attach a BPF_PROG_TYPE_CGROUP_SKB without CAP_NET_ADMIN.

If you want to see & manage the entire world of eBPF objects on your host, which is a requirement for a tool such as bpftool, you'll need CAP_SYS_ADMIN.

Aside: Because of the fd design, it implies that a process that obtains an fd to an eBPF object through other means (e.g. Unix sockets or bpffs) can also "manage" that object. As of recent, this fd-based design is even further realized with the removal of some early CAP_BPF checks for non-object-creation bpf subcommands.

Let's walk through the kernel code where the program loading occurs.

v5.18

static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
{
	/* ... */
	if (attr->insn_cnt == 0 ||
	    attr->insn_cnt > (bpf_capable() ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
		return -E2BIG;
	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
	    type != BPF_PROG_TYPE_CGROUP_SKB &&
	    !bpf_capable())
		return -EPERM;

	if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && !capable(CAP_SYS_ADMIN))
		return -EPERM;
	if (is_perfmon_prog_type(type) && !perfmon_capable())
		return -EPERM;
	/* ... */
}

This snippet actually gives us a lot of insight! bpf_capable() is just a wrapper for verifying CAP_BPF is enabled (CAP_SYS_ADMIN is checked against for backwards compatiblity):

static inline bool bpf_capable(void)
{
	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
}

What can we glean from this? So we can see that a program needs CAP_BPF to load more instructions than 4096 (BPF_MAXINSNS). We can see that every program type except BPF_PROG_TYPE_SOCKET_FILTER and BPF_PROG_TYPE_CGROUB_SKB needs CAP_BPF (for loading). And we can also see that we check for additional capabilities (described below) for specific groupings of program types. Sometimes kernel code is approachable!

With CAP_NET_ADMIN

Here is the list of program types that are required to have CAP_NET_ADMIN:

v5.18

static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)
{
	switch (prog_type) {
	case BPF_PROG_TYPE_SCHED_CLS:
	case BPF_PROG_TYPE_SCHED_ACT:
	case BPF_PROG_TYPE_XDP:
	case BPF_PROG_TYPE_LWT_IN:
	case BPF_PROG_TYPE_LWT_OUT:
	case BPF_PROG_TYPE_LWT_XMIT:
	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
	case BPF_PROG_TYPE_SK_SKB:
	case BPF_PROG_TYPE_SK_MSG:
	case BPF_PROG_TYPE_LIRC_MODE2:
	case BPF_PROG_TYPE_FLOW_DISSECTOR:
	case BPF_PROG_TYPE_CGROUP_DEVICE:
	case BPF_PROG_TYPE_CGROUP_SOCK:
	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
	case BPF_PROG_TYPE_CGROUP_SYSCTL:
	case BPF_PROG_TYPE_SOCK_OPS:
	case BPF_PROG_TYPE_EXT: /* extends any prog */
		return true;
	case BPF_PROG_TYPE_CGROUP_SKB:
		/* always unpriv */
	case BPF_PROG_TYPE_SK_REUSEPORT:
		/* equivalent to SOCKET_FILTER. need CAP_BPF only */
	default:
		return false;
	}
}

You can see the comments there special-casing BPF_PROG_TYPE_CGROUP_SKB and BPF_PROG_TYPE_SK_REUSEPORT as programs not requiring CAP_NET_ADMIN.

CAP_NET_ADMIN is similar to CAP_SYS_ADMIN in the sense that it allows a huge list of operations and is overloaded but its domain is networking. Tasks that this capability enables include actions like binding to a socket, modifying route tables, firewall administration, etc. So it makes sense that if you're loading a networking eBPF program that you'd want to enforce this capability.

With CAP_PERFMON

Here is the list of program types that are required with CAP_PERFMON:

v5.18

static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
{
	switch (prog_type) {
	case BPF_PROG_TYPE_KPROBE:
	case BPF_PROG_TYPE_TRACEPOINT:
	case BPF_PROG_TYPE_PERF_EVENT:
	case BPF_PROG_TYPE_RAW_TRACEPOINT:
	case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE:
	case BPF_PROG_TYPE_TRACING:
	case BPF_PROG_TYPE_LSM:
	case BPF_PROG_TYPE_STRUCT_OPS: /* has access to struct sock */
	case BPF_PROG_TYPE_EXT: /* extends any prog */
		return true;
	default:
		return false;
	}
}

CAP_PERFMON, introduced in 5.8, allows for performance metrics and observability operations. You'll need this for a lot of the fun tracing ones since the whole point of them is to have free access to kernel internals.

Outside of loading specific program types, you'll also need CAP_PERFMON to enable specific helpers:

const struct bpf_func_proto *
bpf_base_func_proto(enum bpf_func_id func_id)
{
	/* ... */
	if (!perfmon_capable())
		return NULL;

	switch (func_id) {
	case BPF_FUNC_trace_printk:
		return bpf_get_trace_printk_proto();
	case BPF_FUNC_get_current_task:
		return &bpf_get_current_task_proto;
	case BPF_FUNC_get_current_task_btf:
		return &bpf_get_current_task_btf_proto;
	case BPF_FUNC_probe_read_user:
		return &bpf_probe_read_user_proto;
	case BPF_FUNC_probe_read_kernel:
		return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
		       NULL : &bpf_probe_read_kernel_proto;
	case BPF_FUNC_probe_read_user_str:
		return &bpf_probe_read_user_str_proto;
	case BPF_FUNC_probe_read_kernel_str:
		return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
		       NULL : &bpf_probe_read_kernel_str_proto;
	case BPF_FUNC_snprintf_btf:
		return &bpf_snprintf_btf_proto;
	case BPF_FUNC_snprintf:
		return &bpf_snprintf_proto;
	case BPF_FUNC_task_pt_regs:
		return &bpf_task_pt_regs_proto;
	case BPF_FUNC_trace_vprintk:
		return bpf_get_trace_vprintk_proto();
	default:
		return NULL;
	}
}

You can see from the list that bpf_trace_printk is there, which is a popular one for debugging.

And sometimes all three

There are cases where you'll need whole band: CAP_BPF, CAP_NET_ADMIN and CAP_PERFMON. Obviously, this depends on what you're doing but as an example, you need CAP_PERFMON for the bpf_trace_printk and bpf_snprintf helpers regardless of the program type which means networking programs that need to dump to the tracing buffer will require all three.

Verifier checks tied to capabilities

The verifier, our frenemy in eBPF development, is altered based on the capabilities of the process loading it (mostly if CAP_PERFMON is enabled).

Digging into this is another (or possibly several) blog posts on its own. I'll set aside some time to get to it. If you want to dig deeper yourself now, here's the snippet of where the verifier behavior is determined:

v5.18:kernel/bpf/verifier.c

int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr) {
	/* ... */

	env->allow_ptr_leaks = bpf_allow_ptr_leaks();
	env->allow_uninit_stack = bpf_allow_uninit_stack();
	env->allow_ptr_to_map_access = bpf_allow_ptr_to_map_access();
	env->bypass_spec_v1 = bpf_bypass_spec_v1();
	env->bypass_spec_v4 = bpf_bypass_spec_v4();
	env->bpf_capable = bpf_capable();

	/* ... */
}

bpf_check is basically the int main() of the verifier.

Update (11/2022): An interesting story by Cloudflare on how capabilities can change the way the verifier works: eBPF can't count?!

What about maps?

eBPF maps, for the most part, follow the general rule: if you can get a file descriptor to it, you can operate on it. CAP_BPF is a way of provisioning fds for its own maps while CAP_SYS_ADMIN allows you to grab fds for a host's entire map set.

With that being said, there are sprinkles of CAP_BPF checks that we don't mention here so you might want to refer to the kernel code if you run into issues (or just wait until I inevitably write about it). As an example, I've found that you will be required to have CAP_SYS_ADMIN if you want to use the BPF_F_ZERO_SEED flag on hash & lru maps (and their per-cpu variants).

Can we get rid of CAP_SYS_ADMIN then?

CAP_SYS_ADMIN, the almighty capability that can do just about anything. It isn't technically the same as root but it might as well be.

The hard reality is that even if you want your processes to have more granular capabilities than just slapping a CAP_SYS_ADMIN on it, it might not be possible. There are situations where it still is required.

As mentioned before, if you want your process to have a greater view of the bpf objects loaded on the system, you'll need CAP_SYS_ADMIN. There are also particular map options and helper functions that require CAP_SYS_ADMIN. Also, hardware offload requires CAP_SYS_ADMIN.

My take on this is that most programs will not end up needing this and that they can be broken down with CAP_BPF. If you can't, then that's enough reason to jump into the kernel code, figure out why and understand the risks involved.

Conclusion

CAP_BPF is still new (in terms of release adoption) so its most likely that most of your eBPF programs are still being loaded with CAP_SYS_ADMIN and that a lot of eBPF-based processes still haven't migrated to CAP_BPF, especially considering that the checks aren't completely straightforward. eBPF is hard!

Hopefully this article helps with understanding why developers should consider CAP_BPF if they're deploying on recent enough kernels. The kernel team has obviously been thinking about the relationship between capabilities and eBPF. The design is meant to be simple, yet effective.

Even so, what happens when software teams need more visibility or nuance to deploying these programs across several hosts:

  • Which programs were loaded with CAP_SYS_ADMIN that can migrate to CAP_BPF?
  • What if we want to further lock down the types of eBPF programs that can run on the host?
  • Or have it be based on other criteria such as number of eBPF programs or a per-type quota? What about filtering out allowable eBPF helpers?

This possibly falls into the concern of an LSM such as AppArmor and SELinux (or even BPF LSM itself). We're researching the answers to these questions at bpfdeploy.io.

Written by