aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Backport mac80211 from 3.4 kernelWolfgang Wiedmeyer2019-07-241-0/+7
| | | | | | | | | The ath9k_htc driver depends on mac80211, but mac80211 can't be build. The reason is that net/wireless is almost completely backported from a 3.4 kernel. To follow suit, mac80211 is also backported from 3.4, more precisely from 3.4.113. This makes mac80211 build. Signed-off-by: Wolfgang Wiedmeyer <wolfgit@wiedmeyer.de>
* block: fix use-after-free in sys_ioprio_get()Omar Sandoval2016-11-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_task_ioprio() accesses the task->io_context without holding the task lock and thus can race with exit_io_context(), leading to a use-after-free. The reproducer below hits this within a few seconds on my 4-core QEMU VM: int main(int argc, char **argv) { pid_t pid, child; long nproc, i; /* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */ syscall(SYS_ioprio_set, 1, 0, 0x6000); nproc = sysconf(_SC_NPROCESSORS_ONLN); for (i = 0; i < nproc; i++) { pid = fork(); assert(pid != -1); if (pid == 0) { for (;;) { pid = fork(); assert(pid != -1); if (pid == 0) { _exit(0); } else { child = wait(NULL); assert(child == pid); } } } pid = fork(); assert(pid != -1); if (pid == 0) { for (;;) { /* ioprio_get(IOPRIO_WHO_PGRP, 0); */ syscall(SYS_ioprio_get, 2, 0); } } } for (;;) { /* ioprio_get(IOPRIO_WHO_PGRP, 0); */ syscall(SYS_ioprio_get, 2, 0); } return 0; } This gets us KASAN dumps like this: [ 35.526914] ================================================================== [ 35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c [ 35.530009] Read of size 2 by task ioprio-gpf/363 [ 35.530009] ============================================================================= [ 35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected [ 35.530009] ----------------------------------------------------------------------------- [ 35.530009] Disabling lock debugging due to kernel taint [ 35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360 [ 35.530009] ___slab_alloc+0x55d/0x5a0 [ 35.530009] __slab_alloc.isra.20+0x2b/0x40 [ 35.530009] kmem_cache_alloc_node+0x84/0x200 [ 35.530009] create_task_io_context+0x2b/0x370 [ 35.530009] get_task_io_context+0x92/0xb0 [ 35.530009] copy_process.part.8+0x5029/0x5660 [ 35.530009] _do_fork+0x155/0x7e0 [ 35.530009] SyS_clone+0x19/0x20 [ 35.530009] do_syscall_64+0x195/0x3a0 [ 35.530009] return_from_SYSCALL_64+0x0/0x6a [ 35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060 [ 35.530009] __slab_free+0x27b/0x3d0 [ 35.530009] kmem_cache_free+0x1fb/0x220 [ 35.530009] put_io_context+0xe7/0x120 [ 35.530009] put_io_context_active+0x238/0x380 [ 35.530009] exit_io_context+0x66/0x80 [ 35.530009] do_exit+0x158e/0x2b90 [ 35.530009] do_group_exit+0xe5/0x2b0 [ 35.530009] SyS_exit_group+0x1d/0x20 [ 35.530009] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ 35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080 [ 35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001 [ 35.530009] ================================================================== Fix it by grabbing the task lock while we poke at the io_context. Change-Id: I4261aaf076fab943a80a45b0a77e023aa4ecbbd8 Cc: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* fs: ext4: disable support for fallocate FALLOC_FL_PUNCH_HOLENick Desaulniers2016-10-191-0/+7
| | | | | | Bug: 28760453 Change-Id: I019c2de559db9e4b95860ab852211b456d78c4ca Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
* mm: add a field to store names for private anonymous memoryColin Cross2016-08-231-0/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Change-Id: I53b093d98dc24f41377824f34e076edced4a6f07
* mnt: Fail collect_mounts when applied to unmounted mountsEric W. Biederman2016-06-301-1/+7
| | | | | | | | | | | | | | | | | | | | | | | The only users of collect_mounts are in audit_tree.c In audit_trim_trees and audit_add_tree_rule the path passed into collect_mounts is generated from kern_path passed an audit_tree pathname which is guaranteed to be an absolute path. In those cases collect_mounts is obviously intended to work on mounted paths and if a race results in paths that are unmounted when collect_mounts it is reasonable to fail early. The paths passed into audit_tag_tree don't have the absolute path check. But are used to play with fsnotify and otherwise interact with the audit_trees, so again operating only on mounted paths appears reasonable. Avoid having to worry about what happens when we try and audit unmounted filesystems by restricting collect_mounts to mounts that appear in the mount tree. Change-Id: I2edfee6d6951a2179ce8f53785b65ddb1eb95629 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* staging: android: lowmemorykiller: implement task's adj rbtreeHong-Mei Li2016-06-132-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on the current LMK implementation, LMK has to scan all processes to select the correct task to kill during low memory. The basic idea for the optimization is to : queue all tasks with oom_score_adj priority, and then LMK just selects the proper task from the queue(rbtree) to kill. performance improvement: the current implementation: average time to find a task to kill : 1004us the optimized implementation: average time to find a task to kill: 43us Change-Id: I4dbbdd5673314dbbdabb71c3eff0dc229ce4ea91 Signed-off-by: Hong-Mei Li <a21834@motorola.com> Reviewed-on: http://gerrit.pcs.mot.com/548917 SLT-Approved: Slta Waiver <sltawvr@motorola.com> Tested-by: Jira Key <jirakey@motorola.com> Reviewed-by: Yi-Wei Zhao <gbjc64@motorola.com> Submit-Approved: Jira Key <jirakey@motorola.com> Signed-off-by: D. Andrei Măceș <dmaces@nd.edu> Conflicts: drivers/staging/android/Kconfig drivers/staging/android/lowmemorykiller.c fs/proc/base.c mm/oom_kill.c Conflicts: drivers/staging/android/lowmemorykiller.c mm/oom_kill.c Conflicts: mm/oom_kill.c Conflicts: drivers/staging/android/lowmemorykiller.c mm/oom_kill.c
* fs: avoid adding non-thread-group task to LMK rbtreeHong-Mei Li2016-06-131-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | To maintain the task adj RB tree, we add a task to the RB tree when fork, and delete it when exit. The place is exactly the same as the linear p->tasks list, say, nly when the task is thread_group_leader. When task group_leader is changing, we make sure to add the new leader into RB tree after its leader flag is set, task->exit_signal. Cherry-picked from (CR): http://gerrit.mot.com/753419/ Change-Id: I8da47998510e531188feb067b491e92306be9414 Signed-off-by: Hong-Mei Li <a21834@motorola.com> Reviewed-on: http://gerrit.mot.com/753419 SLTApproved: Slta Waiver <sltawvr@motorola.com> SME-Granted: SME Approvals Granted Tested-by: Jira Key <jirakey@motorola.com> Reviewed-by: Zhi-Ming Yuan <a14194@motorola.com> Reviewed-by: Yi-Wei Zhao <gbjc64@motorola.com> Submit-Approved: Jira Key <jirakey@motorola.com> Reviewed-on: http://gerrit.mot.com/766106 Reviewed-by: Sudharsan Yettapu <sudharsan.yettapu@motorola.com> Reviewed-by: Ravikumar Vembu <raviv@motorola.com> (cherry picked from commit e9e92d64142625981490dd5c323aa08467d349e8) Reviewed-on: http://gerrit.mot.com/768301 Conflicts: fs/exec.c
* pipe: limit the per-user amount of pages allocated in pipesWilly Tarreau2016-05-031-2/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: Documentation/sysctl/fs.txt fs/pipe.c include/linux/sched.h Change-Id: Ic7c678af18129943e16715fdaa64a97a7f0854be
* pipe: Fix buffer offset after partially failed readBen Hutchings2016-04-051-1/+4
| | | | | | | | | | | | | | | | | | | | Quoting the RHEL advisory: > It was found that the fix for CVE-2015-1805 incorrectly kept buffer > offset and buffer length in sync on a failed atomic read, potentially > resulting in a pipe buffer state corruption. A local, unprivileged user > could use this flaw to crash the system or leak kernel memory to user > space. (CVE-2016-0774, Moderate) The same flawed fix was applied to stable branches from 2.6.32.y to 3.14.y inclusive, and I was able to reproduce the issue on 3.2.y. We need to give pipe_iov_copy_to_user() a separate offset variable and only update the buffer offset if it succeeds. Change-Id: I988802f38acf40c7671fa0978880928b02d29b56 References: https://rhn.redhat.com/errata/RHSA-2016-0103.html Signed-off-by: Ben Hutchings <ben@decadent.org.uk> (cherry picked from commit feae3ca2e5e1a8f44aa6290255d3d9709985d0b2)
* pipe: iovec: Fix memory corruption when retrying atomic copy as non-atomicBen Hutchings2016-03-191-23/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | pipe_iov_copy_{from,to}_user() may be tried twice with the same iovec, the first time atomically and the second time not. The second attempt needs to continue from the iovec position, pipe buffer offset and remaining length where the first attempt failed, but currently the pipe buffer offset and remaining length are reset. This will corrupt the piped data (possibly also leading to an information leak between processes) and may also corrupt kernel memory. This was fixed upstream by commits f0d1bec9d58d ("new helper: copy_page_from_iter()") and 637b58c2887e ("switch pipe_read() to copy_page_to_iter()"), but those aren't suitable for stable. This fix for older kernel versions was made by Seth Jennings for RHEL and I have extracted it from their update. CVE-2015-1805 Bug: 27275324 Change-Id: I459adb9076fcd50ff1f1c557089c4e421b036ec4 References: https://bugzilla.redhat.com/show_bug.cgi?id=1202855 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 85c34d007116f8a8aafb173966a605fb03532f45)
* fuse: break infinite loop in fuse_fill_write_pages()Roman Gushchin2016-03-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got a report about unkillable task eating CPU. Further investigation shows, that the problem is in the fuse_fill_write_pages() function. If iov's first segment has zero length, we get an infinite loop, because we never reach iov_iter_advance() call. Fix this by calling iov_iter_advance() before repeating an attempt to copy data from userspace. A similar problem is described in 124d3b7041f ("fix writev regression: pan hanging unkillable and un-straceable"). If zero-length segmend is followed by segment with invalid address, iov_iter_fault_in_readable() checks only first segment (zero-length), iov_iter_copy_from_user_atomic() skips it, fails at second and returns zero -> goto again without skipping zero-length segment. Patch calls iov_iter_advance() before goto again: we'll skip zero-length segment at second iteraction and iov_iter_fault_in_readable() will detect invalid address. Special thanks to Konstantin Khlebnikov, who helped a lot with the commit description. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Maxim Patlasov <mpatlasov@parallels.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: ea9b9907b82a ("fuse: implement perform_write") Cc: <stable@vger.kernel.org> Conflicts: fs/fuse/file.c Change-Id: Id37193373294dd43191469389cfe68ca1736a54b
* vfs: more mnt_parent cleanupsAl Viro2016-03-154-55/+29
| | | | | | | | | | | | | | | | | | | | | a) mount --move is checking that ->mnt_parent is non-NULL before looking if that parent happens to be shared; ->mnt_parent is never NULL and it's not even an misspelled !mnt_has_parent() b) pivot_root open-codes is_path_reachable(), poorly. c) so does path_is_under(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> (backported from commit afac7cba7ed31968a95e181dc25e204e45009ea8) CVE-2014-7970 BugLink: http://bugs.launchpad.net/bugs/1383356 Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andy Whitcroft <apw@canonical.com> Change-Id: I6b2297f46388f135c1b760a37d45efc0e33542db
* vfs: new internal helper: mnt_has_parent(mnt)Al Viro2016-03-155-12/+18
| | | | | | | | | | | | | | | | | | | | | vfsmounts have ->mnt_parent pointing either to a different vfsmount or to itself; it's never NULL and termination condition in loops traversing the tree towards root is mnt == mnt->mnt_parent. At least one place (see the next patch) is confused about what's going on; let's add an explicit helper checking it right way and use it in all places where we need it. Not that there had been too many, but... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit b2dba1af3c4157040303a76d25216b1713d333d0) CVE-2014-7970 BugLink: http://bugs.launchpad.net/bugs/1383356 Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andy Whitcroft <apw@canonical.com> Change-Id: Iaa5ab510804f3b17fe71197b8919d663a416bf05
* fs: take i_mutex during prepare_binprm for set[ug]id executablesJann Horn2016-03-151-25/+40
| | | | | | | | | | | | | | | | This prevents a race between chown() and execve(), where chowning a setuid-user binary to root would momentarily make the binary setuid root. This patch was mostly written by Linus Torvalds. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Conflicts: fs/exec.c Change-Id: Iecebf23d07e299689e4ba4fd74ea8821ef96e72b
* vfs: read file_handle only once in handle_to_pathSasha Levin2016-03-151-2/+3
| | | | | | | | | | | | | | | We used to read file_handle twice. Once to get the amount of extra bytes, and once to fetch the entire structure. This may be problematic since we do size verifications only after the first read, so if the number of extra bytes changes in userspace between the first and second calls, we'll have an incoherent view of file_handle. Instead, read the constant size once, and copy that over to the final structure without having to re-read it again. Change-Id: Ib05e5129629e27d5a05953098c5bc470fae40d2a Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* mnt: Prevent pivot_root from creating a loop in the mount treeEric W. Biederman2016-03-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andy Lutomirski recently demonstrated that when chroot is used to set the root path below the path for the new ``root'' passed to pivot_root the pivot_root system call succeeds and leaks mounts. In examining the code I see that starting with a new root that is below the current root in the mount tree will result in a loop in the mount tree after the mounts are detached and then reattached to one another. Resulting in all kinds of ugliness including a leak of that mounts involved in the leak of the mount loop. Prevent this problem by ensuring that the new mount is reachable from the current root of the mount tree. [Added stable cc. Fixes CVE-2014-7970. --Andy] Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/87bnpmihks.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> (backported from commit 0d0826019e529f21c84687521d03f60cd241ca7d) CVE-2014-7970 BugLink: http://bugs.launchpad.net/bugs/1383356 Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andy Whitcroft <apw@canonical.com> Change-Id: I0fe1d090eeb4765cc49401784e44a430f9585498
* mnt: Only change user settable mount flags in remountEric W. Biederman2016-03-151-1/+1
| | | | | | | | | | | | | | | | | | | | | commit a6138db815df5ee542d848318e5dae681590fccd upstream. Kenton Varda <kenton@sandstorm.io> discovered that by remounting a read-only bind mount read-only in a user namespace the MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user to the remount a read-only mount read-write. Correct this by replacing the mask of mount flags to preserve with a mask of mount flags that may be changed, and preserve all others. This ensures that any future bugs with this mask and remount will fail in an easy to detect way where new mount flags simply won't change. Change-Id: I8ab8bda03a14b9b43e78f1dc6c818bbec048e986 Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Francis Moreau <francis.moro@gmail.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
* eCryptfs: Remove buggy and unnecessary write in file name decode routineMichael Halcrow2016-03-151-1/+0
| | | | | | | | | | | | | | Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the end of the allocated buffer during encrypted filename decoding. This fix corrects the issue by getting rid of the unnecessary 0 write when the current bit offset is 2. Change-Id: Id8e04a580e550495c46cd36fec430a1ec4342940 Signed-off-by: Michael Halcrow <mhalcrow@google.com> Reported-by: Dmitry Chernenkov <dmitryc@google.com> Suggested-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org # v2.6.29+: 51ca58d eCryptfs: Filename Encryption: Encoding and encryption functions Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
* ext4: make orphan functions be no-op in no-journal modeAnatol Pomozov2016-03-121-4/+3
| | | | | | | | | | | | | | | Instead of checking whether the handle is valid, we check if journal is enabled. This avoids taking the s_orphan_lock mutex in all cases when there is no journal in use, including the error paths where ext4_orphan_del() is called with a handle set to NULL. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Conflicts: fs/ext4/namei.c Change-Id: I734ccb8069fceb12b864e7b9dceb37e27ab94c61
* BACKPORT: pagemap: do not leak physical addresses to non-privileged userspaceKirill A. Shutemov2016-03-101-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | (cherry pick from commit ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce) As pointed by recent post[1] on exploiting DRAM physical imperfection, /proc/PID/pagemap exposes sensitive information which can be used to do attacks. This disallows anybody without CAP_SYS_ADMIN to read the pagemap. [1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html [ Eventually we might want to do anything more finegrained, but for now this is the simple model. - Linus ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Seaborn <mseaborn@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 26038811 Change-Id: Icd68075a32ef6c9be1ae00ae9cf5a68bbe7f4e4f
* block: separate priority boosting from REQ_METAChristoph Hellwig2016-02-168-13/+15
| | | | | | | | | | | | Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule, and lave REQ_META purely for marking requests as metadata in blktrace. All existing callers of REQ_META except for XFS are updated to also set REQ_PRIO for now. Backported to 3.0.x by Ketut Putu Kumajaya <ketut.kumajaya@gmail.com> Change-Id: Iad5ba7a105438776f74788c0aedaf85210c613f9
* f2fs: drop obsolete node page when it is truncatedJaegeuk Kim2016-02-131-0/+4
| | | | | | | | If a node page is trucated, we'd better drop the page in the node_inode's page cache for better memory footprint. Change-Id: I3c7187676b5899c211857b9b94fad4f2a412e34d Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: introduce NODE_MAPPING for code consistencyJaegeuk Kim2016-02-134-31/+30
| | | | | | | | | This patch adds NODE_MAPPING which is similar as META_MAPPING introduced by Gu Zheng. Change-Id: I535e5fc3573f408245c4630934a308db06e1f999 Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: remove the orphan block page arrayGu Zheng2016-02-131-3/+4
| | | | | | | | | | | | | | As the orphan_blocks may be max to 504, so it is not security and rigorous to store such a large array in the kernel stack as Dan Carpenter said. In fact, grab_meta_page has locked the page in the page cache, and we can use find_get_page() to fetch the page safely in the downstream, so we can remove the page array directly. Change-Id: Ie73897038bc06f6a6342d6fbdab8f8ff36b2cd96 Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: add help function META_MAPPINGGu Zheng2016-02-135-8/+13
| | | | | | | | | Introduce help function META_MAPPING() to get the cache meta blocks' address space. Change-Id: I5dcfb055444bbdd6160c3b5b631878f135c767ad Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: move a branch for code redabilityJaegeuk Kim2016-02-131-5/+4
| | | | | | | This patch moves a function in f2fs_delete_entry for code readability. Change-Id: I00419bcbd9d353a95491af6d8ec8e418b45171f3 Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: call mark_inode_dirty to flush dirty pagesJaegeuk Kim2016-02-133-5/+8
| | | | | | | | | If a dentry page is updated, we should call mark_inode_dirty to add the inode into the dirty list, so that its dentry pages are flushed to the disk. Otherwise, the inode can be evicted without flush. Change-Id: Ia31244451976a13a69e99342c5c4f5e5238a7b4f Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: clean checkpatch warningsChris Fries2016-02-1311-30/+40
| | | | | | | | | Fixed a variety of trivial checkpatch warnings. The only delta should be some minor formatting on log strings that were split / too long. Change-Id: I1477b17f7d4e2acd2d5370bb389f0bd064834059 Signed-off-by: Chris Fries <cfries@motorola.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: missing REQ_META and REQ_PRIO when sync_meta_pages(META_FLUSH)Changman Lee2016-02-131-1/+1
| | | | | | | | | | Doing sync_meta_pages with META_FLUSH when checkpoint, we overide rw using WRITE_FLUSH_FUA. At this time, we also should set REQ_META|REQ_PRIO. Change-Id: I079dc24c8d03451c942b150df7650c3967bb3ea1 Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: avoid f2fs_balance_fs call during pageoutJaegeuk Kim2016-02-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch should resolve the following bug. ========================================================= [ INFO: possible irq lock inversion dependency detected ] 3.13.0-rc5.f2fs+ #6 Not tainted --------------------------------------------------------- kswapd0/41 just changed the state of lock: (&sbi->gc_mutex){+.+.-.}, at: [<ffffffffa030503e>] f2fs_balance_fs+0xae/0xd0 [f2fs] but this lock took another, RECLAIM_FS-READ-unsafe lock in the past: (&sbi->cp_rwsem){++++.?} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Chain exists of: &sbi->gc_mutex --> &sbi->cp_mutex --> &sbi->cp_rwsem Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sbi->cp_rwsem); local_irq_disable(); lock(&sbi->gc_mutex); lock(&sbi->cp_mutex); <Interrupt> lock(&sbi->gc_mutex); *** DEADLOCK *** This bug is due to the f2fs_balance_fs call in f2fs_write_data_page. If f2fs_write_data_page is triggered by wbc->for_reclaim via kswapd, it should not call f2fs_balance_fs which tries to get a mutex grabbed by original syscall flow. Change-Id: I795c071696885b5d048750fb10ae53594f896a2f Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: add delimiter to seperate name and value in debug phraseChangman Lee2016-02-131-4/+4
| | | | | | | | | Support for f2fs-tools/tools/f2stat to monitor /sys/kernel/debug/f2fs/status Change-Id: Ic97f5b9da15129a24959d0fbaeb91e1c84c7f1ac Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: use spinlock rather than mutex for better speedGu Zheng2016-02-132-13/+13
| | | | | | | | | | With the 2 previous changes, all the long time operations are moved out of the protection region, so here we can use spinlock rather than mutex (orphan_inode_mutex) for lower overhead. Change-Id: Iafe2320add741da9e119de9f30b0f90799d68b27 Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* Fix merge conflictrogersb112016-02-131-7/+9
| | | | Change-Id: I311a4ca284237c557bdbfbc37d1929c905d3c059
* f2fs: move alloc new orphan node out of lock protection regionGu Zheng2016-02-131-6/+9
| | | | | | | | Move alloc new orphan node out of lock protection region. Change-Id: Iea8bae6e1561e1a0644416ae07177c8f165e5393 Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: remove the needless parameter of f2fs_wait_on_page_writebackYuan Zhong2016-02-136-7/+7
| | | | | | | | | "boo sync" parameter is never referenced in f2fs_wait_on_page_writeback. We should remove this parameter. Change-Id: I14f0cb0eba62c04a53181674d254753b6cdc0f7f Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: add a sysfs entry to control max_victim_searchJaegeuk Kim2016-02-134-3/+12
| | | | | | | | | | | | | | | | | | | | | | | Previously during SSR and GC, the maximum number of retrials to find a victim segment was hard-coded by MAX_VICTIM_SEARCH, 4096 by default. This number makes an effect on IO locality, when SSR mode is activated, which results in performance fluctuation on some low-end devices. If max_victim_search = 4, the victim will be searched like below. ("D" represents a dirty segment, and "*" indicates a selected victim segment.) D1 D2 D3 D4 D5 D6 D7 D8 D9 [ * ] [ * ] [ * ] [ ....] This patch adds a sysfs entry to control the number dynamically through: /sys/fs/f2fs/$dev/max_victim_search Change-Id: I0f98609540351fd7451e3c23bc65f8d165ba13b6 Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: improve write performance under frequent fsync callsJaegeuk Kim2016-02-134-10/+11
| | | | | | | | | | | | | | | | | | | | | | | When considering a bunch of data writes with very frequent fsync calls, we are able to think the following performance regression. N: Node IO, D: Data IO, IO scheduler: cfq Issue pending IOs D1 D2 D3 D4 D1 D2 D3 D4 N1 D2 D3 D4 N1 N2 N1 D3 D4 N2 D1 --> N1 can be selected by cfq becase of the same priority of N and D. Then D3 and D4 would be delayed, resuling in performance degradation. So, when processing the fsync call, it'd better give higher priority to data IOs than node IOs by assigning WRITE and WRITE_SYNC respectively. This patch improves the random wirte performance with frequent fsync calls by up to 10%. Change-Id: I4812ac05db179d83914c7bb0942fa6738280e1ea Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* f2fs: support 3.0arter972016-02-1310-44/+120
| | | | | | | | | | Initial backporting done by nowcomputing, (https://github.com/nowcomputing/f2fs-backports.git) Additional patches required by upstream jaegeuk/f2fs.git/linux-3.4 done by arter97. Change-Id: Ibbd3a608857338482f974fa4b1a8d3c02c267d9f Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
* f2fs: support 3.4Jaegeuk Kim2016-02-1311-56/+49
| | | | | Change-Id: I20b6a2877e072a4b9639001fa0198837cfa74aff Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
* fs: import f2fs from ↵arter972016-02-1326-0/+14822
| | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git (branch dev) Up-to-date with git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git @04a17fb17fafada39f96bfb41ceb2dc1c11b2af6 (f2fs: avoid to read inline data except first page) Change-Id: I1fc76a61defd530c4e97587980ba43e98db6119e Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
* ext3: ignore ext4-option nomblk_io_submitMichael Gernoth2015-12-021-1/+6
| | | | Change-Id: I7b85e62f61aafbb5d46f8a049ffbeea021346353
* Merge remote-tracking branch 'korg/linux-3.0.y' into cm-13.0rogersb112015-11-1074-405/+544
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: crypto/algapi.c drivers/gpu/drm/i915/i915_debugfs.c drivers/gpu/drm/i915/intel_display.c drivers/video/fbmem.c include/linux/nls.h kernel/cgroup.c kernel/signal.c kernel/timeconst.pl net/ipv4/ping.c Change-Id: I1f532925d1743df74d66bcdd6fc92f05c72ee0dd
| * ext4: fix memory leak in xattrDave Jones2013-10-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6e4ea8e33b2057b85d75175dd89b93f5e26de3bc upstream. If we take the 2nd retry path in ext4_expand_extra_isize_ea, we potentionally return from the function without having freed these allocations. If we don't do the return, we over-write the previous allocation pointers, so we leak either way. Spotted with Coverity. [ Fixed by tytso to set is and bs to NULL after freeing these pointers, in case in the retry loop we later end up triggering an error causing a jump to cleanup, at which point we could have a double free bug. -- Ted ] Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * vfs: allow O_PATH file descriptors for fstatfs()Linus Torvalds2013-10-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9d05746e7b16d8565dddbe3200faa1e669d23bbf upstream. Olga reported that file descriptors opened with O_PATH do not work with fstatfs(), found during further development of ksh93's thread support. There is no reason to not allow O_PATH file descriptors here (fstatfs is very much a path operation), so use "fdget_raw()". See commit 55815f70147d ("vfs: make O_PATH file descriptors usable for 'fstat()'") for a very similar issue reported for fstat() by the same team. Reported-and-tested-by: ольга крыжановская <olga.kryzhanovska@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * ext4: avoid hang when mounting non-journal filesystems with orphan listTheodore Ts'o2013-10-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0e9a9a1ad619e7e987815d20262d36a2f95717ca upstream. When trying to mount a file system which does not contain a journal, but which does have a orphan list containing an inode which needs to be truncated, the mount call with hang forever in ext4_orphan_cleanup() because ext4_orphan_del() will return immediately without removing the inode from the orphan list, leading to an uninterruptible loop in kernel code which will busy out one of the CPU's on the system. This can be trivially reproduced by trying to mount the file system found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs source tree. If a malicious user were to put this on a USB stick, and mount it on a Linux desktop which has automatic mounts enabled, this could be considered a potential denial of service attack. (Not a big deal in practice, but professional paranoids worry about such things, and have even been known to allocate CVE numbers for such problems.) -js: This is a fix for CVE-2013-2015. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * Btrfs: change how we queue blocks for backref checkingJosef Bacik2013-10-131-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b6c60c8018c4e9beb2f83fc82c09f9d033766571 upstream. Previously we only added blocks to the list to have their backrefs checked if the level of the block is right above the one we are searching for. This is because we want to make sure we don't add the entire path up to the root to the lists to make sure we process things one at a time. This assumes that if any blocks in the path to the root are going to be not checked (shared in other words) then they will be in the level right above the current block on up. This isn't quite right though since we can have blocks higher up the list that are shared because they are attached to a reloc root. But we won't add this block to be checked and then later on we will BUG_ON(!upper->checked). So instead keep track of wether or not we've queued a block to be checked in this current search, and if we haven't go ahead and queue it to be checked. This patch fixed the panic I was seeing where we BUG_ON(!upper->checked). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * splice: fix racy pipe->buffers usesEric Dumazet2013-10-051-15/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 047fe3605235888f3ebcda0c728cb31937eadfe6 upstream. Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered by splice_shrink_spd() called from vmsplice_to_pipe() commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes) added capability to adjust pipe->buffers. Problem is some paths don't hold pipe mutex and assume pipe->buffers doesn't change for their duration. Fix this by adding nr_pages_max field in struct splice_pipe_desc, and use it in place of pipe->buffers where appropriate. splice_shrink_spd() loses its struct pipe_inode_info argument. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tom Herbert <therbert@google.com> Cc: stable <stable@vger.kernel.org> # 2.6.35 Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * fanotify: dont merge permission eventsLino Sanfilippo2013-10-011-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 03a1cec1f17ac1a6041996b3e40f96b5a2f90e1b upstream. Boyd Yang reported a problem for the case that multiple threads of the same thread group are waiting for a reponse for a permission event. In this case it is possible that some of the threads are never woken up, even if the response for the event has been received (see http://marc.info/?l=linux-kernel&m=131822913806350&w=2). The reason is that we are currently merging permission events if they belong to the same thread group. But we are not prepared to wake up more than one waiter for each event. We do wait_event(group->fanotify_data.access_waitq, event->response || atomic_read(&group->fanotify_data.bypass_perm)); and after that event->response = 0; which is the reason that even if we woke up all waiters for the same event some of them may see event->response being already set 0 again, then go back to sleep and block forever. With this patch we avoid that more than one thread is waiting for a response by not merging permission events for the same thread group any more. Reported-by: Boyd Yang <boyd.yang@gmail.com> Signed-off-by: Lino Sanfilippo <LinoSanfilipp@gmx.de> Signed-off-by: Eric Paris <eparis@redhat.com> Cc: Mihai Donțu <mihai.dontu@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * fuse: invalidate inode attributes on xattr modificationAnand Avati2013-09-261-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | commit d331a415aef98717393dda0be69b7947da08eba3 upstream. Calls like setxattr and removexattr result in updation of ctime. Therefore invalidate inode attributes to force a refresh. Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * fuse: postpone end_page_writeback() in fuse_writepage_locked()Maxim Patlasov2013-09-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4a4ac4eba1010ef9a804569058ab29e3450c0315 upstream. The patch fixes a race between ftruncate(2), mmap-ed write and write(2): 1) An user makes a page dirty via mmap-ed write. 2) The user performs shrinking truncate(2) intended to purge the page. 3) Before fuse_do_setattr calls truncate_pagecache, the page goes to writeback. fuse_writepage_locked fills FUSE_WRITE request and releases the original page by end_page_writeback. 4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex is free. 5) Ordinary write(2) extends i_size back to cover the page. Note that fuse_send_write_pages do wait for fuse writeback, but for another page->index. 6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request. fuse_send_writepage is supposed to crop inarg->size of the request, but it doesn't because i_size has already been extended back. Moving end_page_writeback to the end of fuse_writepage_locked fixes the race because now the fact that truncate_pagecache is successfully returned infers that fuse_writepage_locked has already called end_page_writeback. And this, in turn, infers that fuse_flush_writepages has already called fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2) could not extend it because of i_mutex held by ftruncate(2). Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>