| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
|
|
memzero_page() is the new name for zero_user().
Link: https://lkml.kernel.org/r/20250612143443.2848197-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
VFS has switched to i_rwsem for ten years now (9902af79c01a: parallel
lookups actual switch to rwsem), but the VFS documentation and comments
still has references to i_mutex.
Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu>
Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
After commit c22198e78d52 ("direct-io: remove random prefetches"), Nothing
in this file needs anything from `linux/prefetch.h`.
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Link: https://lore.kernel.org/r/20240603014834.45294-1-youling.tang@linux.dev
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The variable retval is being assigned a value that is not being read,
it is being re-assigned later on in the function. The assignment
is redundant and can be removed.
Cleans up clang scan build warning:
fs/direct-io.c:1220:2: warning: Value stored to 'retval' is never
read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240410162221.292485-1-colin.i.king@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Restore support for passing data lifetime information from filesystems to
block drivers. This patch reverts commit b179c98f7697 ("block: Remove
request.write_hint") and commit c75e707fe1aa ("block: remove the
per-bio/request write hint").
This patch does not modify the size of struct bio because the new
bi_write_hint member fills a hole in struct bio. pahole reports the
following for struct bio on an x86_64 system with this patch applied:
/* size: 112, cachelines: 2, members: 20 */
/* sum members: 110, holes: 1, sum holes: 2 */
/* last cacheline: 48 bytes */
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-7-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Sparse static analysis tools generate a warning with this message
"Using plain integer as NULL pointer". In this case this warning is
being shown because we are trying to initialize pointer to NULL using
integer value 0.
Signed-off-by: Abhinav Singh <singhabhinav9051571833@gmail.com>
Link: https://lore.kernel.org/r/20231108044550.1006555-1-singhabhinav9051571833@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
__read_mostly predates __ro_after_init. Many variables which are marked
__read_mostly should have been __ro_after_init from day 1.
Also, mark some stuff as "const" and "__init" while I'm at it.
[akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning]
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
|
|
Fix dio_bio_cleanup() to advance the head index into the list of pages past
the pages it has released, as __blockdev_direct_IO() will call it twice if
do_direct_IO() fails.
The issue was causing:
WARNING: CPU: 6 PID: 2220 at mm/gup.c:76 try_get_folio
This can be triggered by setting up a clean pair of UDF filesystems on
loopback devices and running the generic/451 xfstest with them as the
scratch and test partitions. Something like the following:
fallocate /mnt2/udf_scratch -l 1G
fallocate /mnt2/udf_test -l 1G
mknod /dev/lo0 b 7 0
mknod /dev/lo1 b 7 1
losetup lo0 /mnt2/udf_scratch
losetup lo1 /mnt2/udf_test
mkfs -t udf /dev/lo0
mkfs -t udf /dev/lo1
cd xfstests
./check generic/451
with xfstests configured by putting the following into local.config:
export FSTYP=udf
export DISABLE_UDF_TEST=1
export TEST_DEV=/dev/lo1
export TEST_DIR=/xfstest.test
export SCRATCH_DEV=/dev/lo0
export SCRATCH_MNT=/xfstest.scratch
Fixes: 1ccf164ec866 ("block: Use iov_iter_extract_pages() and page pinning in direct-io.c")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202306120931.a9606b88-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1193485.1686693279@warthog.procyon.org.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a helper to invalidate page cache after a dio write.
Link: https://lkml.kernel.org/r/20230601145904.1385409-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Change the old block-based direct-I/O code to use iov_iter_extract_pages()
to pin user pages or leave kernel pages unpinned rather than taking refs
when submitting bios.
This makes use of the preceding patches to not take pins on the zero page
(thereby allowing insertion of zero pages in with pinned pages) and to get
additional pins on pages, allowing an extracted page to be used in multiple
bios without having to re-extract it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230526214142.958751-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
always NULL...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
sb_init_dio_done_wq is also used by the iomap code, so move it to
super.c in preparation for building direct-io.c conditionally.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230125065839.191256-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
PSI accounting is now done by the VM code, where it should have been
since the beginning.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Most of the users immediately follow successful iov_iter_get_pages()
with advancing by the amount it had returned.
Provide inline wrappers doing that, convert trivial open-coded
uses of those.
BTW, iov_iter_get_pages() never returns more than it had been asked
to; such checks in cifs ought to be removed someday...
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.
We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.
New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.
DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs iov_iter updates from Al Viro:
"Part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter() and
->write_iter() instead of ->read()/->write().
new_sync_{read,write}() has a surprising amount of overhead, in
particular inside iocb_flags(). That's the explanation for the
beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work..."
* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
first_iovec_segment(): just return address
iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
iov_iter: first_{iovec,bvec}_segment() - simplify a bit
iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
iov_iter_bvec_advance(): don't bother with bvec_iter
copy_page_{to,from}_iter(): switch iovec variants to generic
keep iocb_flags() result cached in struct file
iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
struct file: use anonymous union member for rcuhead and llist
btrfs: use IOMAP_DIO_NOSYNC
teach iomap_dio_rw() to suppress dsync
No need of likely/unlikely on calls of check_copy_size()
|
|
Reduce the size of struct dio by combining the 'op' and 'op_flags' into
the new 'opf' member. Use the new blk_opf_t type to improve static type
checking. This patch does not change any functionality.
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-49-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb). Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Randomly poking into block device internals for manual prefetches isn't
exactly a very maintainable thing to do. And none of the performance
critical direct I/O implementations still use this library function
anyway, so just drop it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220415045258.199825-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.
Now that everybody is passing in zero, kill off the extra argument.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The polling support in the legacy direct-io support is a little crufty.
It already doesn't support the asynchronous polling needed for io_uring
polling, and is hard to adopt to upcoming changes in the polling
interfaces. Given that all the major file systems already use the iomap
direct I/O code, just drop the polling support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
I encountered a hung task issue, but not a performance one. I run DIO
on a device (need lba continuous, for example open channel ssd), maybe
hungtask in below case:
DIO: Checkpoint:
get addr A(at boundary), merge into BIO,
no submit because boundary missing
flush dirty data(get addr A+1), wait IO(A+1)
writeback timeout, because DIO(A) didn't submit
get addr A+2 fail, because checkpoint is doing
dio_send_cur_page() may clear sdio->boundary, so prevent it from missing
a boundary.
Link: https://lkml.kernel.org/r/20210322042253.38312-1-jack.qiu@huawei.com
Fixes: b1058b981272 ("direct-io: submit bio after boundary buffer is added to it")
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull more block updates from Jens Axboe:
"A few stragglers (and one due to me missing it originally), and fixes
for changes in this merge window mostly. In particular:
- blktrace cleanups (Chaitanya, Greg)
- Kill dead blk_pm_* functions (Bart)
- Fixes for the bio alloc changes (Christoph)
- Fix for the partition changes (Christoph, Ming)
- Fix for turning off iopoll with polled IO inflight (Jeffle)
- nbd disconnect fix (Josef)
- loop fsync error fix (Mauricio)
- kyber update depth fix (Yang)
- max_sectors alignment fix (Mikulas)
- Add bio_max_segs helper (Matthew)"
* tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits)
block: Add bio_max_segs
blktrace: fix documentation for blk_fill_rw()
block: memory allocations in bounce_clone_bio must not fail
block: remove the gfp_mask argument to bounce_clone_bio
block: fix bounce_clone_bio for passthrough bios
block-crypto-fallback: use a bio_set for splitting bios
block: fix logging on capacity change
blk-settings: align max_sectors on "logical_block_size" boundary
block: reopen the device in blkdev_reread_part
block: don't skip empty device in in disk_uevent
blktrace: remove debugfs file dentries from struct blk_trace
nbd: handle device refs for DESTROY_ON_DISCONNECT properly
kyber: introduce kyber_depth_updated()
loop: fix I/O error on fsync() in detached loop devices
block: fix potential IO hang when turning off io_poll
block: get rid of the trace rq insert wrapper
blktrace: fix blk_rq_merge documentation
blktrace: fix blk_rq_issue documentation
blktrace: add blk_fill_rwbs documentation comment
block: remove superfluous param in blk_fill_rwbs()
...
|
|
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the
sign to be the same. Introduce bio_max_segs() and change BIO_MAX_PAGES to
be unsigned to make it easier for the users.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Delete duplicate words in fs/*.c.
The doubled words that are being dropped are:
that, be, the, in, and, for
Link: https://lkml.kernel.org/r/20201224052810.25315-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Direct IO does not operate on the current working set of pages managed
by the kernel, so it should not be accounted as memory stall to PSI
infrastructure.
The block layer and iomap direct IO use bio_iov_iter_get_pages()
to build bios, and they are the only users of it, so to avoid PSI
tracking for them clear out BIO_WORKINGSET flag. Do same for
dio_bio_submit() because fs/direct_io constructs bios by hand directly
calling bio_add_page().
Reported-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull direct-io fix from Jan Kara:
"Fix for unaligned direct IO read past EOF in legacy DIO code"
* tag 'dio_for_v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
direct-io: defer alignment check until after the EOF check
direct-io: don't force writeback for reads beyond EOF
direct-io: clean up error paths of do_blockdev_direct_IO
|
|
Prior to commit 9fe55eea7e4b ("Fix race when checking i_size on direct
i/o read"), an unaligned direct read past end of file would trigger EOF,
since generic_file_aio_read detected this read-at-EOF condition and
skipped the direct IO read entirely, returning 0. After that change, the
read now reaches dio_generic, which detects the misalignment and returns
EINVAL.
This consolidates the generic direct-io to follow the same behavior of
filesystems. Apparently, this fix will only affect ocfs2 since other
filesystems do this verification before calling do_blockdev_direct_IO,
with the exception of f2fs, which has the same bug, but is fixed in the
next patch.
it can be verified by a read loop on a file that does a partial read
before EOF (On file that doesn't end at an aligned address). The
following code fails on an unaligned file on filesystems without
prior validation without this patch, but not on btrfs, ext4, and xfs.
while (done < total) {
ssize_t delta = pread(fd, buf + done, total - done, off + done);
if (!delta)
break;
...
}
Fix this regression by moving the misalignment check to after the EOF
check added by commit 74cedf9b6c60 ("direct-io: Fix negative return from
dio read beyond eof").
Based on a patch by Jamie Liu.
Link: https://lore.kernel.org/r/20201008062620.2928326-4-krisman@collabora.com
Reported-by: Jamie Liu <jamieliu@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
If a DIO read starts past EOF, the kernel won't attempt it, so we don't
need to flush dirty pages before failing the syscall.
Link: https://lore.kernel.org/r/20201008062620.2928326-3-krisman@collabora.com
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
In preparation to resort DIO checks, reduce code duplication of error
handling in do_blockdev_direct_IO.
Link: https://lore.kernel.org/r/20201008062620.2928326-2-krisman@collabora.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Since we removed the last user of dio_end_io() when btrfs got converted
to iomap infrastructure ("btrfs: switch to iomap for direct IO"), remove
the helper function dio_end_io().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Just use bd_disk->queue instead.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This reverts the direct io port to iomap infrastructure of btrfs
merged in the first pull request. We found problems in invalidate page
that don't seem to be fixable as regressions or without changing iomap
code that would not affect other filesystems.
There are four reverts in total, but three of them are followup
cleanups needed to revert a43a67a2d715 cleanly. The result is the
buffer head based implementation of direct io.
Reverts are not great, but under current circumstances I don't see
better options"
* tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: switch to iomap_dio_rw() for dio"
Revert "fs: remove dio_end_io()"
Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
Revert "btrfs: split btrfs_direct_IO to read and write part"
|
|
This reverts commit b75b7ca7c27dfd61dba368f390b7d4dc20b3a8cb.
The patch restores a helper that was not necessary after direct IO port
to iomap infrastructure, which gets reverted.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Highlights:
- speedup dead root detection during orphan cleanup, eg. when there
are many deleted subvolumes waiting to be cleaned, the trees are
now looked up in radix tree instead of a O(N^2) search
- snapshot creation with inherited qgroup will mark the qgroup
inconsistent, requires a rescan
- send will emit file capabilities after chown, this produces a
stream that does not need postprocessing to set the capabilities
again
- direct io ported to iomap infrastructure, cleaned up and simplified
code, notably removing last use of struct buffer_head in btrfs code
Core changes:
- factor out backreference iteration, to be used by ordinary
backreferences and relocation code
- improved global block reserve utilization
* better logic to serialize requests
* increased maximum available for unlink
* improved handling on large pages (64K)
- direct io cleanups and fixes
* simplify layering, where cloned bios were unnecessarily created
for some cases
* error handling fixes (submit, endio)
* remove repair worker thread, used to avoid deadlocks during
repair
- refactored block group reading code, preparatory work for new type
of block group storage that should improve mount time on large
filesystems
Cleanups:
- cleaned up (and slightly sped up) set/get helpers for metadata data
structure members
- root bit REF_COWS got renamed to SHAREABLE to reflect the that the
blocks of the tree get shared either among subvolumes or with the
relocation trees
Fixes:
- when subvolume deletion fails due to ENOSPC, the filesystem is not
turned read-only
- device scan deals with devices from other filesystems that changed
ownership due to overwrite (mkfs)
- fix a race between scrub and block group removal/allocation
- fix long standing bug of a runaway balance operation, printing the
same line to the syslog, caused by a stale status bit on a reloc
tree that prevented progress
- fix corrupt log due to concurrent fsync of inodes with shared
extents
- fix space underflow for NODATACOW and buffered writes when it for
some reason needs to fallback to COW mode"
* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
btrfs: fix space_info bytes_may_use underflow during space cache writeout
btrfs: fix space_info bytes_may_use underflow after nocow buffered write
btrfs: fix wrong file range cleanup after an error filling dealloc range
btrfs: remove redundant local variable in read_block_for_search
btrfs: open code key_search
btrfs: split btrfs_direct_IO to read and write part
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
fs: remove dio_end_io()
btrfs: switch to iomap_dio_rw() for dio
iomap: remove lockdep_assert_held()
iomap: add a filesystem hook for direct I/O bio submission
fs: export generic_file_buffered_read()
btrfs: turn space cache writeout failure messages into debug messages
btrfs: include error on messages about failure to write space/inode caches
btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
btrfs: make checksum item extension more efficient
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
btrfs: unexport btrfs_compress_set_level()
btrfs: simplify iget helpers
...
|
|
Since we removed the last user of dio_end_io(), remove the helper
function dio_end_io().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Sync dio could be big, or may take long time in discard or in case of
IO failure.
We have prevented task hung in submit_bio_wait() and blk_execute_rq(),
so apply the same trick for prevent task hung from happening in sync dio.
Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent
task hung warning.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Salman Qazi <sqazi@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Include fs/internal.h to address the following 'sparse' warning:
fs/direct-io.c:591:5: warning: symbol 'sb_init_dio_done_wq' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20191209234544.128302-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This helper prints warning if direct I/O write failed to invalidate cache,
and set EIO at inode to warn usersapce about possible data corruption.
See also commit 5a9d929d6e13 ("iomap: report collisions between directio
and buffered writes to userspace").
Direct I/O is supported by non-disk filesystems, for example NFS. Thus
generic code needs this even in kernel without CONFIG_BLOCK.
Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix kernel-doc warning in fs/direct-io.c:
fs/direct-io.c:258: warning: Excess function parameter 'offset' description in 'dio_complete'
Also, don't mark this function as having kernel-doc notation since it is
not exported.
Link: http://lkml.kernel.org/r/97908511-4328-4a56-17fe-f43a1d7aa470@infradead.org
Fixes: 6d544bb4d901 ("dio: centralize completion in dio_complete()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Zach Brown <zab@zabbo.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use bio_release_pages instead of duplicating it.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
On a DIO_SKIP_HOLES filesystem, the ->get_block() method is currently
not allowed to create blocks for an empty inode. This confusion comes
from trying to bit shift a negative number, so check the size of the
inode first.
The problem is most visible for hfsplus, because the fallback to
buffered I/O doesn't happen and the write fails with EIO. This is in
part the fault of the module, because it gives a wrong return value on
->get_block(); that will be fixed in a separate patch.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull in v4.20-rc5, solving a conflict we'll otherwise get in aio.c and
also getting the merge fix that went into mainline that users are
hitting testing for-4.21/block and/or for-next.
* tag 'v4.20-rc5': (664 commits)
Linux 4.20-rc5
PCI: Fix incorrect value returned from pcie_get_speed_cap()
MAINTAINERS: Update linux-mips mailing list address
ocfs2: fix potential use after free
mm/khugepaged: fix the xas_create_range() error path
mm/khugepaged: collapse_shmem() do not crash on Compound
mm/khugepaged: collapse_shmem() without freezing new_page
mm/khugepaged: minor reorderings in collapse_shmem()
mm/khugepaged: collapse_shmem() remember to clear holes
mm/khugepaged: fix crashes due to misaccounted holes
mm/khugepaged: collapse_shmem() stop if punched or truncated
mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
mm/huge_memory: splitting set mapping+index before unfreeze
mm/huge_memory: rename freeze_page() to unmap_page()
initramfs: clean old path before creating a hardlink
kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
psi: make disabling/enabling easier for vendor kernels
proc: fixup map_files test on arm
debugobjects: avoid recursive calls with kmemleak
userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
...
|
|
commit e259221763a40403d5bb232209998e8c45804ab8 ("fs: simplify the
generic_write_sync prototype") reworked callers of generic_write_sync(),
and ended up dropping the error return for the directio path. Prior to
that commit, in dio_complete(), an error would be bubbled up the stack,
but after that commit, errors passed on to dio_complete were eaten up.
This was reported on the list earlier, and a fix was proposed in
https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/, but
never followed up with. We recently hit this bug in our testing where
fencing io errors, which were previously erroring out with EIO, were
being returned as success operations after this commit.
The fix proposed on the list earlier was a little short -- it would have
still called generic_write_sync() in case `ret` already contained an
error. This fix ensures generic_write_sync() is only called when there's
no pending error in the write. Additionally, transferred is replaced
with ret to bring this code in line with other callers.
Fixes: e259221763a4 ("fs: simplify the generic_write_sync prototype")
Reported-by: Ravi Nankani <rnankani@amazon.com>
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Torsten Mehlan <tomeh@amazon.de>
CC: Uwe Dannowski <uwed@amazon.de>
CC: Amit Shah <aams@amazon.de>
CC: David Woodhouse <dwmw@amazon.co.uk>
CC: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.
Existing callers are converted to pass in 'spin == true', to retain
the old behavior.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We use IOCB_HIPRI to poll for IO in the caller instead of scheduling.
This information is not available for (or after) IO submission. The
driver may make different queue choices based on the type of IO, so
make the fact that we will poll for this IO known to the lower layers
as well.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use accessor functions to access an iterator's type and direction. This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Same numerical value (for now at least), but a much better documentation
of intent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- the v9fs maintainers have been missing for a long time. I've taken
over v9fs patch slinging.
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
mm,oom_reaper: check for MMF_OOM_SKIP before complaining
mm/ksm: fix interaction with THP
mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
headers: untangle kmemleak.h from mm.h
include/linux/mmdebug.h: make VM_WARN* non-rvals
mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
mm: change return type to vm_fault_t
mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
kernel/fork.c: detect early free of a live mm
mm: make counting of list_lru_one::nr_items lockless
mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
block_invalidatepage(): only release page if the full page was invalidated
mm: kernel-doc: add missing parameter descriptions
mm/swap.c: remove @cold parameter description for release_pages()
mm/nommu: remove description of alloc_vm_area
zram: drop max_zpage_size and use zs_huge_class_size()
zsmalloc: introduce zs_huge_class_size()
mm: fix races between swapoff and flush dcache
fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
...
|
|
We already get the block counts and calculate the end block at the
beginning of the function. Let's use the local variables for
consistency and readability. No functional changes
[akpm@linux-foundation.org: constify the locals to prevent future slipups]
Link: http://lkml.kernel.org/r/1519638870-17756-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This flag was added by fe0f07d08ee3 ("direct-io: only inc/deci
inode->i_dio_count for file systems") as means to optimise the atomic
modificaiton of the variable for blockdevices. However with the advent
of 542ff7bf18c6 ("block: new direct I/O implementation") it became
unused. So let's remove it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This flag was added by 6039257378e4 ("direct-io: add flag to allow aio
writes beyond i_size") to support XFS. However, with the rework of
XFS' DIO's path to use iomap in acdda3aae146 ("xfs: use iomap_dio_rw")
it became redundant. So let's remove it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit e864f39569f4 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
way for direct IO to become synchronous and thus trigger fsync from the
IO completion handler. Then commit 9830f4be159b "fs: Use RWF_* flags for
AIO operations" allowed these flags to be set for AIO as well. However
that commit forgot to update the condition checking whether the IO
completion handling should be defered to a workqueue and thus AIO DIO
with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
sleep in atomic.
Fix the problem by checking directly iocb flags (the same way as it is
done in dio_complete()) instead of checking all conditions that could
lead to IO being synchronous.
CC: Christoph Hellwig <hch@lst.de>
CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
CC: stable@vger.kernel.org
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 9830f4be159b29399d107bffb99e0132bc5aedd4
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If two programs simultaneously try to write to the same part of a file
via direct IO and buffered IO, there's a chance that the post-diowrite
pagecache invalidation will fail on the dirty page. When this happens,
the dio write succeeded, which means that the page cache is no longer
coherent with the disk!
Programs are not supposed to mix IO types and this is a clear case of
data corruption, so store an EIO which will be reflected to userspace
during the next fsync. Replace the WARN_ON with a ratelimited pr_crit
so that the developers have /some/ kind of breadcrumb to track down the
offending program(s) and file(s) involved.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
|
|
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.
Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:
- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.
- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.
- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)
- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)
- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).
- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).
- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).
- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).
- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).
- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.
- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.
- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"
* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
|
|
That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pull xfs fixes from Darrick Wong:
- fix some more CONFIG_XFS_RT related build problems
- fix data loss when writeback at eof races eofblocks gc and loses
- invalidate page cache after fs finishes a dio write
- remove dirty page state when invalidating pages so releasepage does
the right thing when handed a dirty page
* tag 'xfs-4.14-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: move two more RT specific functions into CONFIG_XFS_RT
xfs: trim writepage mapping to within eof
fs: invalidate page cache after end_io() in dio completion
xfs: cancel dirty pages on invalidation
|
|
Pull block fixes from Jens Axboe:
"Three small fixes:
- A fix for skd, it was using kfree() to free a structure allocate
with kmem_cache_alloc().
- Stable fix for nbd, fixing a regression using the normal ioctl
based tools.
- Fix for a previous fix in this series, that fixed up
inconsistencies between buffered and direct IO"
* 'for-linus' of git://git.kernel.dk/linux-block:
fs: Avoid invalidation in interrupt context in dio_complete()
nbd: don't set the device size until we're connected
skd: Use kmem_cache_free
|
|
Currently we try to defer completion of async DIO to the process context
in case there are any mapped pages associated with the inode so that we
can invalidate the pages when the IO completes. However the check is racy
and the pages can be mapped afterwards. If this happens we might end up
calling invalidate_inode_pages2_range() in dio_complete() in interrupt
context which could sleep. This can be reproduced by generic/451.
Fix this by passing the information whether we can or can't invalidate
to the dio_complete(). Thanks Eryu Guan for reporting this and Jan Kara
for suggesting a fix.
Fixes: 332391a9935d ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 332391a9935d ("fs: Fix page cache inconsistency when mixing
buffered and AIO DIO") moved page cache invalidation from
iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
path, but before the dio->end_io() call, and it re-introdued the bug
fixed by commit c771c14baa33 ("iomap: invalidate page caches should
be after iomap_dio_complete() in direct write").
I found this because fstests generic/418 started failing on XFS with
v4.14-rc3 kernel, which is the regression test for this specific
bug.
So similarly, fix it by moving dio->end_io() (which does the
unwritten extent conversion) before page cache invalidation, to make
sure next buffer read reads the final real allocations not unwritten
extents. I also add some comments about why should end_io() go first
in case we get it wrong again in the future.
Note that, there's no such problem in the non-iomap based direct
write path, because we didn't remove the page cache invalidation
after the ->direct_IO() in generic_file_direct_write() call, but I
decided to fix dio_complete() too so we don't leave a landmine
there, also be consistent with iomap_dio_complete().
Fixes: 332391a9935d ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
|
|
In the code added to function submit_page_section by commit b1058b981,
sdio->bio can currently be NULL when calling dio_bio_submit. This then
leads to a NULL pointer access in dio_bio_submit, so check for a NULL
bio in submit_page_section before trying to submit it instead.
Fixes xfstest generic/250 on gfs2.
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.
Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.
This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.
This was based on proposal from Jeff Moyer, thanks!
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.
Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Only read bio->bi_error once in the common path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.
This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'
Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.
[geliangtang@gmail.com: truncate: use i_blocksize()]
Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The code currently uses sdio->blkbits to compute the number of blocks to
be cleaned. However sdio->blkbits is derived from the logical block size
of the underlying block device (Refer to the definition of
do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
fail when executed on an ext4 filesystem with 64k as the block size and
when using a virtio based disk (having 512 byte as the logical block
size) inside a kvm guest.
This commit fixes the bug by using inode->i_blkbits to compute the
number of blocks to be cleaned.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once,
to avoid issues with block size changes with IO in flight.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"There is quite a varied bunch of stuff in this update, and some of it
you will have already merged through the ext4 tree which imported the
dax-4.10-iomap-pmd topic branch from the XFS tree.
There is also a new direct IO implementation that uses the iomap
infrastructure. It's much simpler, faster, and has lower IO latency
than the existing direct IO infrastructure.
Summary:
- DAX PMD faults via iomap infrastructure
- Direct-io support in iomap infrastructure
- removal of now-redundant XFS inode iolock, replaced with VFS
i_rwsem
- synchronisation with fixes and changes in userspace libxfs code
- extent tree lookup helpers
- lots of little corruption detection improvements to verifiers
- optimised CRC calculations
- faster buffer cache lookups
- deprecation of barrier/nobarrier mount options - we always use
REQ_FUA/REQ_FLUSH where appropriate for data integrity now
- cleanups to speculative preallocation
- miscellaneous minor bug fixes and cleanups"
* tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
xfs: nuke unused tracepoint definitions
xfs: use GPF_NOFS when allocating btree cursors
xfs: use xfs_vn_setattr_size to check on new size
xfs: deprecate barrier/nobarrier mount option
xfs: Always flush caches when integrity is required
xfs: ignore leaf attr ichdr.count in verifier during log replay
xfs: use rhashtable to track buffer cache
xfs: optimise CRC updates
xfs: make xfs btree stats less huge
xfs: don't cap maximum dedupe request length
xfs: don't allow di_size with high bit set
xfs: error out if trying to add attrs and anextents > 0
xfs: don't crash if reading a directory results in an unexpected hole
xfs: complain if we don't get nextents bmap records
xfs: check for bogus values in btree block headers
xfs: forbid AG btrees with level == 0
xfs: several xattr functions can be void
xfs: handle cow fork in xfs_bmap_trace_exlist
xfs: pass state not whichfork to trace_xfs_extlist
xfs: Move AGI buffer type setting to xfs_read_agi
...
|
|
Pull fs meta data unmap optimization from Jens Axboe:
"A series from Jan Kara, providing a more efficient way for unmapping
meta data from in the buffer cache than doing it block-by-block.
Provide a general helper that existing callers can use"
* 'for-4.10/fs-unmap' of git://git.kernel.dk/linux-block:
fs: Remove unmap_underlying_metadata
fs: Add helper to clean bdev aliases under a bh and use it
ext2: Use clean_bdev_aliases() instead of iteration
ext4: Use clean_bdev_aliases() instead of iteration
direct-io: Use clean_bdev_aliases() instead of handmade iteration
fs: Provide function to unmap metadata for a range of blocks
|
|
We want to use the per-sb completion workqueue from the new iomap
direct I/O code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Use new provided function instead of an iteration through all allocated
blocks.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Make local filesystems treat a fault as shortened IO,
returning -EFAULT only if nothing had been transferred.
That's how everything else (NFS, FUSE, ceph, Lustre)
behaves.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This patch has the dio code use a REQ_OP for the op and rq_flag_bits
for bi_rw flags. To set/get the op it uses the bio_set_op_attrs/bio_op
accssors.
It also begins to convert btrfs's dio_submit_t because of the dio
submit_io callout use. The next patches will completely convert
this code and the reset of the btrfs code paths.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
before calling get_block() callback), if it's a sparse file, direct
writes fall back to buffered writes to avoid stale data exposure from
concurrent buffered read. But there're two cases that can result in
stale data exposure are not correctly detected.
1. The detection for "writing inside i_size" is not sufficient,
writes can be treated as "extending writes" wrongly. For example,
direct write 1FSB (file system block) to a 1FSB sparse file on
ext2/3/4, starting from offset 0, in this case it's writing inside
i_size, but 'create' is non-zero, because 'block_in_file' and
'(i_size_read(inode) >> blkbits' are both zero.
2. Direct writes starting from or beyong i_size (not inside i_size)
also could trigger block allocation and expose stale data. For
example, consider a sparse file with i_size of 2k, and a write to
offset 2k or 3k into the file, with a filesystem block size of 4k.
(Thanks to Jeff Moyer for pointing this case out in his review.)
The first problem can be demostrated by running ltp-aiodio test ADSP045
many times. When testing on extN filesystems, I see test failures
occasionally, buffered read could read non-zero (stale) data.
ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1
dio_sparse 0 TINFO : Dirtying free blocks
dio_sparse 0 TINFO : Starting I/O tests
non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
non-zero read at offset 0
dio_sparse 0 TINFO : Killing childrens(s)
dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally
The second problem can also be reproduced easily by a hacked dio_sparse
program, which accepts an option to specify the write offset.
What we should really do is to disable block allocation for writes that
could result in filling holes inside i_size.
Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The kiocb already has the new position, so use that. The only interesting
case is AIO, where we currently don't bother updating ki_pos. We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.
While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.
XXX: Will need a few additional audits for O_DSYNC
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
It has to be identical to ki_pos of the iocb, so use that instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"There's quite a lot in this request, and there's some cross-over with
ext4, dax and quota code due to the nature of the changes being made.
As for the rest of the XFS changes, there are lots of little things
all over the place, which add up to a lot of changes in the end.
The major changes are that we've reduced the size of the struct
xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
>10%), the writepage code now only does a single set of mapping tree
lockups so uses less CPU, delayed allocation reservations won't
overrun under random write loads anymore, and we added compile time
verification for on-disk structure sizes so we find out when a commit
or platform/compiler change breaks the on disk structure as early as
possible.
Change summary:
- error propagation for direct IO failures fixes for both XFS and
ext4
- new quota interfaces and XFS implementation for iterating all the
quota IDs in the filesystem
- locking fixes for real-time device extent allocation
- reduction of duplicate information in the xfs and vfs inode, saving
roughly 100 bytes of memory per cached inode.
- buffer flag cleanup
- rework of the writepage code to use the generic write clustering
mechanisms
- several fixes for inode flag based DAX enablement
- rework of remount option parsing
- compile time verification of on-disk format structure sizes
- delayed allocation reservation overrun fixes
- lots of little error handling fixes
- small memory leak fixes
- enable xfsaild freezing again"
* tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
xfs: always set rvalp in xfs_dir2_node_trim_free
xfs: ensure committed is initialized in xfs_trans_roll
xfs: borrow indirect blocks from freed extent when available
xfs: refactor delalloc indlen reservation split into helper
xfs: update freeblocks counter after extent deletion
xfs: debug mode forced buffered write failure
xfs: remove impossible condition
xfs: check sizes of XFS on-disk structures at compile time
xfs: ioends require logically contiguous file offsets
xfs: use named array initializers for log item dumping
xfs: fix computation of inode btree maxlevels
xfs: reinitialise per-AG structures if geometry changes during recovery
xfs: remove xfs_trans_get_block_res
xfs: fix up inode32/64 (re)mount handling
xfs: fix format specifier , should be %llx and not %llu
xfs: sanitize remount options
xfs: convert mount option parsing to tokens
xfs: fix two memory leaks in xfs_attr_list.c error paths
xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
- Preparations of parallel lookups (the remaining main obstacle is the
need to move security_d_instantiate(); once that becomes safe, the
rest will be a matter of rather short series local to fs/*.c
- preadv2/pwritev2 series from Christoph
- assorted fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
splice: handle zero nr_pages in splice_to_pipe()
vfs: show_vfsstat: do not ignore errors from show_devname method
dcache.c: new helper: __d_add()
don't bother with __d_instantiate(dentry, NULL)
untangle fsnotify_d_instantiate() a bit
uninline d_add()
replace d_add_unique() with saner primitive
quota: use lookup_one_len_unlocked()
cifs_get_root(): use lookup_one_len_unlocked()
nfs_lookup: don't bother with d_instantiate(dentry, NULL)
kill dentry_unhash()
ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)
autofs4: don't bother with d_instantiate(dentry, NULL) in ->lookup()
configfs: move d_rehash() into configfs_create() for regular files
ceph: don't bother with d_rehash() in splice_dentry()
namei: teach lookup_slow() to skip revalidate
namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()
lookup_one_len_unlocked(): use lookup_dcache()
namei: simplify invalidation logics in lookup_dcache()
namei: change calling conventions for lookup_{fast,slow} and follow_managed()
...
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.
Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
kasan reported the following error when i ran xfstest:
[ 701.826854] ==================================================================
[ 701.826864] BUG: KASAN: use-after-free in dio_bio_complete+0x41a/0x600 at addr ffff880080b95f94
[ 701.826870] Read of size 4 by task loop2/3874
[ 701.826879] page:ffffea000202e540 count:0 mapcount:0 mapping: (null) index:0x0
[ 701.826890] flags: 0x100000000000000()
[ 701.826895] page dumped because: kasan: bad access detected
[ 701.826904] CPU: 3 PID: 3874 Comm: loop2 Tainted: G B W L 4.5.0-rc1-next-20160129 #83
[ 701.826910] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013
[ 701.826917] ffff88008fadf800 ffff88008fadf758 ffffffff81ca67bb 0000000041b58ab3
[ 701.826941] ffffffff830d1e74 ffffffff81ca6724 ffff88008fadf748 ffffffff8161c05c
[ 701.826963] 0000000000000282 ffff88008fadf800 ffffed0010172bf2 ffffea000202e540
[ 701.826987] Call Trace:
[ 701.826997] [<ffffffff81ca67bb>] dump_stack+0x97/0xdc
[ 701.827005] [<ffffffff81ca6724>] ? _atomic_dec_and_lock+0xc4/0xc4
[ 701.827014] [<ffffffff8161c05c>] ? __dump_page+0x32c/0x490
[ 701.827023] [<ffffffff816b0d03>] kasan_report_error+0x5f3/0x8b0
[ 701.827033] [<ffffffff817c302a>] ? dio_bio_complete+0x41a/0x600
[ 701.827040] [<ffffffff816b1119>] __asan_report_load4_noabort+0x59/0x80
[ 701.827048] [<ffffffff817c302a>] ? dio_bio_complete+0x41a/0x600
[ 701.827053] [<ffffffff817c302a>] dio_bio_complete+0x41a/0x600
[ 701.827057] [<ffffffff81bd19c8>] ? blk_queue_exit+0x108/0x270
[ 701.827060] [<ffffffff817c32b0>] dio_bio_end_aio+0xa0/0x4d0
[ 701.827063] [<ffffffff817c3210>] ? dio_bio_complete+0x600/0x600
[ 701.827067] [<ffffffff81bd2806>] ? blk_account_io_completion+0x316/0x5d0
[ 701.827070] [<ffffffff81bafe89>] bio_endio+0x79/0x200
[ 701.827074] [<ffffffff81bd2c9f>] blk_update_request+0x1df/0xc50
[ 701.827078] [<ffffffff81c02c27>] blk_mq_end_request+0x57/0x120
[ 701.827081] [<ffffffff81c03670>] __blk_mq_complete_request+0x310/0x590
[ 701.827084] [<ffffffff812348d8>] ? set_next_entity+0x2f8/0x2ed0
[ 701.827088] [<ffffffff8124b34d>] ? put_prev_entity+0x22d/0x2a70
[ 701.827091] [<ffffffff81c0394b>] blk_mq_complete_request+0x5b/0x80
[ 701.827094] [<ffffffff821e2a33>] loop_queue_work+0x273/0x19d0
[ 701.827098] [<ffffffff811f6578>] ? finish_task_switch+0x1c8/0x8e0
[ 701.827101] [<ffffffff8129d058>] ? trace_hardirqs_on_caller+0x18/0x6c0
[ 701.827104] [<ffffffff821e27c0>] ? lo_read_simple+0x890/0x890
[ 701.827108] [<ffffffff8129dd60>] ? debug_check_no_locks_freed+0x350/0x350
[ 701.827111] [<ffffffff811f63b0>] ? __hrtick_start+0x130/0x130
[ 701.827115] [<ffffffff82a0c8f6>] ? __schedule+0x936/0x20b0
[ 701.827118] [<ffffffff811dd6bd>] ? kthread_worker_fn+0x3ed/0x8d0
[ 701.827121] [<ffffffff811dd4ed>] ? kthread_worker_fn+0x21d/0x8d0
[ 701.827125] [<ffffffff8129d058>] ? trace_hardirqs_on_caller+0x18/0x6c0
[ 701.827128] [<ffffffff811dd57f>] kthread_worker_fn+0x2af/0x8d0
[ 701.827132] [<ffffffff811dd2d0>] ? __init_kthread_worker+0x170/0x170
[ 701.827135] [<ffffffff82a1ea46>] ? _raw_spin_unlock_irqrestore+0x36/0x60
[ 701.827138] [<ffffffff811dd2d0>] ? __init_kthread_worker+0x170/0x170
[ 701.827141] [<ffffffff811dd2d0>] ? __init_kthread_worker+0x170/0x170
[ 701.827144] [<ffffffff811dd00b>] kthread+0x24b/0x3a0
[ 701.827148] [<ffffffff811dcdc0>] ? kthread_create_on_node+0x4c0/0x4c0
[ 701.827151] [<ffffffff8129d70d>] ? trace_hardirqs_on+0xd/0x10
[ 701.827155] [<ffffffff8116d41d>] ? do_group_exit+0xdd/0x350
[ 701.827158] [<ffffffff811dcdc0>] ? kthread_create_on_node+0x4c0/0x4c0
[ 701.827161] [<ffffffff82a1f52f>] ret_from_fork+0x3f/0x70
[ 701.827165] [<ffffffff811dcdc0>] ? kthread_create_on_node+0x4c0/0x4c0
[ 701.827167] Memory state around the buggy address:
[ 701.827170] ffff880080b95e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827172] ffff880080b95f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827175] >ffff880080b95f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827177] ^
[ 701.827179] ffff880080b96000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827182] ffff880080b96080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827183] ==================================================================
The problem is that bio_check_pages_dirty calls bio_put, so we must
not access bio fields after bio_check_pages_dirty.
Fixes: 9b81c842355ac96097ba ("block: don't access bio->bi_error after bio_put()").
Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
eof"
Sure, it's better to bail out of past-the-eof read and return 0 than return
a bogus negative value on such. Only we'd better make sure we are bailing out
with 0 and not -ENOMEM...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Assume a filesystem with 4KB blocks. When a file has size 1000 bytes and
we issue direct IO read at offset 1024, blockdev_direct_IO() reads the
tail of the last block and the logic for handling short DIO reads in
dio_complete() results in a return value -24 (1000 - 1024) which
obviously confuses userspace.
Fix the problem by bailing out early once we sample i_size and can
reliably check that direct IO read starts beyond i_size.
Reported-by: Avi Kivity <avi@scylladb.com>
Fixes: 9fe55eea7e4b444bafc42fa0000cc2d1d2847275
CC: stable@vger.kernel.org
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Pull block IO poll support from Jens Axboe:
"Various groups have been doing experimentation around IO polling for
(really) fast devices. The code has been reviewed and has been
sitting on the side for a few releases, but this is now good enough
for coordinated benchmarking and further experimentation.
Currently O_DIRECT sync read/write are supported. A framework is in
the works that allows scalable stats tracking so we can auto-tune
this. And we'll add libaio support as well soon. Fow now, it's an
opt-in feature for test purposes"
* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
direct-io: be sure to assign dio->bio_bdev for both paths
directio: add block polling support
NVMe: add blk polling support
block: add block polling support
blk-mq: return tag/queue combo in the make_request_fn handlers
block: change ->make_request_fn() and users to return a queue cookie
|
|
btrfs sets ->submit_io(), and we failed to set the block dev for
that path. That resulted in a potential NULL dereference when
we later wait for IO in dio_await_one().
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This adds support for sync O_DIRECT read/write poll support.
Signed-off-by: Jens Axboe <axboe@fb.com>
[hch: split from a larger patch, minor updates]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
|
|
__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When direct read IO is submitted from kernel, it is often
unnecessary to dirty pages, for example of loop, dirtying pages
have been considered in the upper filesystem(over loop) side
already, and they don't need to be dirtied again.
So this patch doesn't dirtying pages for ITER_BVEC/ITER_KVEC
direct read, and loop should be the 1st case to use ITER_BVEC/ITER_KVEC
for direct read I/O.
The patch is based on previous Dave's patch.
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Commit 4246a0b6 ("block: add a bi_error field to struct bio") has added a few
dereferences of 'bio' after a call to bio_put(). This causes use-after-frees
such as:
[521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714
[521120.720638] Read of size 4 by task mount.ocfs2/9644
[521120.721212] =============================================================================
[521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected
[521120.722968] -----------------------------------------------------------------------------
[521120.722968]
[521120.723915] Disabling lock debugging due to kernel taint
[521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080
[521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800
[521120.726037]
[521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff ...6............
[521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ................
[521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00 ................
[521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff ................
[521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff ...........6....
[521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff .s".....@..<....
[521120.736667] Object ffff880f36b38790: 00 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00 ................
[521120.737596] Object ffff880f36b387a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.738524] Object ffff880f36b387b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.739388] Object ffff880f36b387c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.740277] Object ffff880f36b387d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.741187] Object ffff880f36b387e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.742233] Object ffff880f36b387f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.743229] CPU: 41 PID: 9644 Comm: mount.ocfs2 Tainted: G B 4.2.0-rc6-next-20150810-sasha-00039-gf909086 #2420
[521120.744274] ffff880f36b38000 ffff880d89c8f638 ffffffffb6e9ba8a ffff880101c0e5c0
[521120.745025] ffff880d89c8f668 ffffffffad76a313 ffff880101c0e5c0 ffffea003cdace00
[521120.745908] ffff880f36b38700 ffff880f36b38798 ffff880d89c8f690 ffffffffad772854
[521120.747063] Call Trace:
[521120.747520] dump_stack (lib/dump_stack.c:52)
[521120.748053] print_trailer (mm/slub.c:653)
[521120.748582] object_err (mm/slub.c:660)
[521120.749079] kasan_report_error (include/linux/kasan.h:20 mm/kasan/report.c:152 mm/kasan/report.c:194)
[521120.750834] __asan_report_load4_noabort (mm/kasan/report.c:250)
[521120.753580] dio_bio_complete (fs/direct-io.c:478)
[521120.755752] do_blockdev_direct_IO (fs/direct-io.c:494 fs/direct-io.c:1291)
[521120.759765] __blockdev_direct_IO (fs/direct-io.c:1322)
[521120.761658] blkdev_direct_IO (fs/block_dev.c:162)
[521120.762993] generic_file_read_iter (mm/filemap.c:1738)
[521120.767405] blkdev_read_iter (fs/block_dev.c:1649)
[521120.768556] __vfs_read (fs/read_write.c:423 fs/read_write.c:434)
[521120.772126] vfs_read (fs/read_write.c:454)
[521120.773118] SyS_pread64 (fs/read_write.c:607 fs/read_write.c:594)
[521120.776062] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
[521120.777375] Memory state around the buggy address:
[521120.778118] ffff880f36b38600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.779211] ffff880f36b38680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.780315] >ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.781465] ^
[521120.782083] ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.783717] ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[521120.784818] ==================================================================
This patch fixes a few of those places that I caught while auditing the patch, but the
original patch should be audited further for more occurences of this issue since I'm
not too familiar with the code.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.
For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:
clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]
After:
clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]
In other setups, Robert Elliott reported seeing good performance
improvements:
https://lkml.org/lkml/2015/4/3/557
The more applications accessing the device, the worse it gets.
Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Most callers in the kernel want to perform synchronous file I/O, but
still have to bloat the stack with a full struct kiocb. Split out
the parts needed in filesystem code from those in the aio code, and
only allocate those needed to pass down argument on the stack. The
aio code embedds the generic iocb in the one it allocates and can
easily get back to it by using container_of.
Also add a ->ki_complete method to struct kiocb, this is used to call
into the aio code and thus removes the dependency on aio for filesystems
impementing asynchronous operations. It will also allow other callers
to substitute their own completion callback.
We also add a new ->ki_flags field to work around the nasty layering
violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
eventfd notification about ffs events").
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
bytes a caller asked to pack into fuse request. This value may be lesser
than capacity of fuse request or iov_iter. So fuse_get_user_pages() must
ensure that *nbytesp won't grow.
Now, when helper iov_iter_get_pages() performs all hard work of extracting
pages from iov_iter, it can be done by passing properly calculated
"maxsize" to the helper.
The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
this capability, so pass LONG_MAX as the maxsize argument here.
Fixes: c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()")
Reported-by: Werner Baumann <werner.baumann@onlinehome.de>
Tested-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... instead of maximal size.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The direct-io.c rewrite to use the iov_iter infrastructure stopped updating
the size field in struct dio_submit, and thus rendered the check for
allowing asynchronous completions to always return false. Fix this by
comparing it to the count of bytes in the iov_iter instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
|
|
The following warnings:
fs/direct-io.c: In function ‘__blockdev_direct_IO’:
fs/direct-io.c:1011:12: warning: ‘to’ may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/direct-io.c:913:16: note: ‘to’ was declared here
fs/direct-io.c:1011:12: warning: ‘from’ may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/direct-io.c:913:10: note: ‘from’ was declared here
are false positive because dio_get_page() either fails, or sets both
'from' and 'to'.
Paul Bolle said ...
Maybe it's better to move initializing "to" and "from" out of
dio_get_page(). That _might_ make it easier for both the the reader and
the compiler to understand what's going on. Something like this:
Christoph Hellwig said ...
The fix of moving the code definitively looks nicer, while I think
uninitialized_var is horrible wart that won't get anywhere near my code.
Boaz Harrosh: I agree with Christoph and Paul
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
counts the pages covered by iov_iter, up to given limit.
do_block_direct_io() and fuse_iter_npages() switched to
it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
iov_iter_get_pages(iter, pages, maxsize, &start) grabs references pinning
the pages of up to maxsize of (contiguous) data from iter. Returns the
amount of memory grabbed or -error. In case of success, the requested
area begins at offset start in pages[0] and runs through pages[1], etc.
Less than requested amount might be returned - either because the contiguous
area in the beginning of iterator is smaller than requested, or because
the kernel failed to pin that many pages.
direct-io.c switched to using iov_iter_get_pages()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
returns the value aligned as badly as the worst remaining segment
in iov_iter is. Use instead of open-coded equivalents.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Pull xfs update from Dave Chinner:
"There are a couple of new fallocate features in this request - it was
decided that it was easiest to push them through the XFS tree using
topic branches and have the ext4 support be based on those branches.
Hence you may see some overlap with the ext4 tree merge depending on
how they including those topic branches into their tree. Other than
that, there is O_TMPFILE support, some cleanups and bug fixes.
The main changes in the XFS tree for 3.15-rc1 are:
- O_TMPFILE support
- allowing AIO+DIO writes beyond EOF
- FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
implementation
- FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
implementation
- IO verifier cleanup and rework
- stack usage reduction changes
- vm_map_ram NOIO context fixes to remove lockdep warings
- various bug fixes and cleanups"
* tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
xfs: fix directory hash ordering bug
xfs: extra semi-colon breaks a condition
xfs: Add support for FALLOC_FL_ZERO_RANGE
fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
xfs: inode log reservations are still too small
xfs: xfs_check_page_type buffer checks need help
xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
xfs: use NOIO contexts for vm_map_ram
xfs: don't leak EFSBADCRC to userspace
xfs: fix directory inode iolock lockdep false positive
xfs: allocate xfs_da_args to reduce stack footprint
xfs: always do log forces via the workqueue
xfs: modify verifiers to differentiate CRC from other errors
xfs: print useful caller information in xfs_error_report
xfs: add xfs_verifier_error()
xfs: add helper for updating checksums on xfs_bufs
xfs: add helper for verifying checksums on xfs_bufs
xfs: Use defines for CRC offsets in all cases
xfs: skip pointless CRC updates after verifier failures
xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
...
|
|
The return value of bio_get_nr_vecs() cannot be bigger than
BIO_MAX_PAGES, so we can remove redundant the comparison between
nr_pages and BIO_MAX_PAGES.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some filesystems can handle direct I/O writes beyond i_size safely,
so allow them to opt into receiving them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
|
|
Not using the return value can in the generic case be racy, so it's
in general good practice to check the return value instead.
This also resolved the warning caused on ARM and other architectures:
fs/direct-io.c: In function 'sb_init_dio_done_wq':
fs/direct-io.c:557:2: warning: value computed is not used [-Wunused-value]
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: H Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue. This replaces opencoded and less efficient
code in XFS and ext4 (we save a memory allocation for each direct IO)
and will be needed to properly support O_(D)SYNC for AIO.
The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.
Currently this creates a per-superblock unbound workqueue for these
completions, which is taken from an earlier patch by Jan Kara. I'm
not really convinced about this use and would prefer a "normal" global
workqueue with a high concurrency limit, but this needs further discussion.
JK: Fixed ext4 part, dynamic allocation of the workqueue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
|
|
Faster kernel compiles by way of fewer unnecessary includes.
[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, dio_send_cur_page() submits bio before current page and cached
sdio->cur_page is added to the bio if sdio->boundary is set. This is
actually wrong because sdio->boundary means the current buffer is the last
one before metadata needs to be read. So we should rather submit the bio
after the current page is added to it.
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When we read/write a file sequentially, we will read/write not only the
data blocks but also the indirect blocks that may not be physically
adjacent to the data blocks. So filesystems set the BH_Boundary flag to
submit the previous I/O before reading/writing an indirect block.
However the generic direct IO code mishandles buffer_boundary(), setting
sdio->boundary before each submit_page_section() call which results in
sending only one page bios as underlying code thinks this page is the last
in the contiguous extent. So fix the problem by setting sdio->boundary
only if the current page is really the last one in the mapped extent.
With this patch and "direct-io: submit bio after boundary buffer is added
to it" I've measured about 10% throughput improvement of direct IO reads
on ext3 with SATA harddrive (from 90 MB/s to 100 MB/s). With ramdisk, the
improvement was about 3-fold (from 350 MB/s to 1.2 GB/s). For other
filesystems (such as ext4), the improvements won't be as visible because
the frequency of BH_Boundary flag being set is much smaller.
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
More prep work for immutable bvecs:
A few places in the code were either open coding or using the wrong
version - fix.
After we introduce the bvec iter, it'll no longer be possible to modify
the biovec through bio_for_each_segment_all() - it doesn't increment a
pointer to the current bvec, you pass in a struct bio_vec (not a
pointer) which is updated with what the current biovec would be (taking
into account bi_bvec_done and bi_size).
So because of that it's more worthwhile to be consistent about
bio_for_each_segment()/bio_for_each_segment_all() usage.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Alexander Viro <viro@zeniv.linux.org.uk>
|
|
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: Christoph Hellwig <hch@infradead.org>
CC: Jens Axboe <axboe@kernel.dk>
CC: Jeff Moyer <jmoyer@redhat.com>
CC: stable@vger.kernel.org
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Since directio can work on a raw block device, and the block size of the
device can change under it, we need to do the same thing that
fs/buffer.c now does: read the block size a single time, using
ACCESS_ONCE().
Reading it multiple times can get different results, which will then
confuse the code because it actually encodes the i_blksize in
relationship to the underlying logical blocksize.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.
CC: Li Shaohua <shli@fusionio.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
READ is 0, so the result of the bit-and operation is 0. Rewrite with == as
done elsewhere in the same file.
This problem was found using Coccinelle (http://coccinelle.lip6.fr/).
Signed-off-by: Julia Lawall <julia@diku.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Use the same mechanism as the block devices are using, but move the
helper functions from fs/direct-io.c into fs/inode.c to remove the
dependency on CONFIG_BLOCK.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With kernel 3.1, Christoph removed i_alloc_sem and replaced it with
calls (namely inode_dio_wait() and inode_dio_done()) which are
EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and
further inode_dio_wait() was pushed from notify_change() into the file
system ->setattr() method but no non-GPL file system can make this call.
That means non-GPL file systems cannot exist any more unless they do not
use any VFS functionality related to reading/writing as far as I can
tell or at least as long as they want to implement direct i/o.
Both Linus and Al (and others) have said on LKML that this breakage of
the VFS API should not have happened and that the change was simply
missed as it was not documented in the change logs of the patches that
did those changes.
This patch changes the two function exports in question to be
EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible
for all modules.
Christoph, who introduced the two functions and exported them GPL-only
is CC-ed on this patch to give him the opportunity to object to the
symbols being changed in this manner if he did indeed intend them to be
GPL-only and does not want them to become available to all modules.
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some investigation of a transaction processing workload showed that a
major consumer of cycles in __blockdev_direct_IO is the cache miss while
accessing the block size. This is because it has to walk the chain from
block_dev to gendisk to queue.
The block size is needed early on to check alignment and sizes. It's only
done if the check for the inode block size fails. But the costly block
device state is unconditionally fetched.
- Reorganize the code to only fetch block dev state when actually
needed.
Then do a prefetch on the block dev early on in the direct IO path. This
is worth it, because there is substantial code run before we actually
touch the block dev now.
- I also added some unlikelies to make it clear the compiler that block
device fetch code is not normally executed.
This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In get_more_blocks(), we use dio_count to calcuate fs_count and do some
tricky things to increase fs_count if dio_count isn't aligned. But
actually it still has some corner cases that can't be coverd. See the
following example:
dio_write foo -s 1024 -w 4096
(direct write 4096 bytes at offset 1024). The same goes if the offset
isn't aligned to fs_blocksize.
In this case, the old calculation counts fs_count to be 1, but actually we
will write into 2 different blocks (if fs_blocksize=4096). The old code
just works, since it will call get_block twice (and may have to allocate
and create extents twice for filesystems like ext4). So we'd better call
get_block just once with the proper fs_count.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This doesn't change anything for the compiler, but hch thought it would
make the code clearer.
I moved the reference counting into its own little inline.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Add inlines to all the submission path functions. While this increases
code size it also gives gcc a lot of optimization opportunities
in this critical hotpath.
In particular -- together with some other changes -- this
allows gcc to get rid of the unnecessary clearing of
sdio at the beginning and optimize the messy parameter passing.
Any non inlining of a function which takes a sdio parameter
would break this optimization because they cannot be done if the
address of a structure is taken.
Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.
This gives about 2.2% improvement on a large database benchmark
with a high IOPS rate.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Only a single b_private field in the map_bh buffer head is needed after
the submission path. Move map_bh separately to avoid storing
this information in the long term slab.
This avoids the weird 104 byte hole in struct dio_submit which also needed
to be memseted early.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
A direct slab call is slightly faster than kmalloc and can be better cached
per CPU. It also avoids rounding to the next kmalloc slab.
In addition this enforces cache line alignment for struct dio to avoid
any false sharing.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Fix most problems reported by pahole.
There is still a weird 104 byte hole after map_bh. I'm not sure what
causes this.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
There's nothing on the stack, even before my changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This large, but largely mechanic, patch moves all fields in struct dio
that are only used in the submission path into a separate on stack
data structure. This has the advantage that the memory is very likely
cache hot, which is not guaranteed for memory fresh out of kmalloc.
This also gives gcc more optimization potential because it can easier
determine that there are no external aliases for these variables.
The sdio initialization is a initialization now instead of memset.
This allows gcc to break sdio into individual fields and optimize
away unnecessary zeroing (after all the functions are inlined)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For filesystems that delay their end_io processing we should keep our
i_dio_count until the the processing is done. Enable this by moving
the inode_dio_done call to the end_io handler if one exist. Note that
the actual move to the workqueue for ext4 and XFS is not done in
this patch yet, but left to the filesystem maintainers. At least
for XFS it's not needed yet either as XFS has an internal equivalent
to i_dio_count.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code. Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.
Behaviour doesn't change for filesystems never calling inode_dio_wait.
For ext4 behaviour changes when using the dioread_nonlock option, which
previously was missing any protection between truncate and direct I/O reads.
For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code now enable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.
Replace it with a hand-grown construct:
- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.
This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
|
|
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
|
|
When using devices that support max_segments > BIO_MAX_PAGES (256), direct
IO tries to allocate a bio with more pages than allowed, which leads to an
oops in dio_bio_alloc(). Clamp the request to the supported maximum, and
change dio_bio_alloc() to reflect that bio_alloc() will always return a
bio when called with __GFP_WAIT and a valid number of vectors.
[akpm@linux-foundation.org: remove redundant BUG_ON()]
Signed-off-by: David Dillow <dillowda@ornl.gov>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
Fix up truncation (ssize_t->int). This only matters with >2G
reads/writes, which the kernel doesn't permit.
Signed-off-by: Edward Shishkin <edward@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
commit c2c6ca4 (direct-io: do not merge logically non-contiguous requests)
introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
to the block layer. The problem is that the code expected
dio->block_in_file to correspond to the current page in the dio. In fact,
it corresponds to the previous page submitted via submit_page_section.
This was purely an oversight, as the dio->cur_page_fs_offset field was
introduced for just this purpose. This patch simply uses the correct
variable when calculating whether there is a mismatch between contiguous
logical blocks and contiguous physical blocks (as described in the
comments).
I also switched the if conditional following this check to an else if, to
ensure that we never call dio_bio_submit twice for the same dio (in
theory, this should not happen, anyway).
I've tested this by running blktrace and verifying that a 64KB I/O was
submitted as a single I/O. I also ran the patched kernel through
xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
and verified that there were no regressions as compared to an unpatched
kernel.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Josef Bacik <jbacik@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: <stable@kernel.org> [2.6.35.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Move the call to vmtruncate to get rid of accessive blocks to the callers
in prepearation of the new truncate calling sequence. This was only done
for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
its _newtrunc variant while at it as just opencoding the two additional
paramters is shorted than the name suffix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited. That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.
This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Btrfs cannot handle having logically non-contiguous requests submitted. For
example if you have
Logical: [0-4095][HOLE][8192-12287]
Physical: [0-4095] [4096-8191]
Normally the DIO code would put these into the same BIO's. The problem is we
need to know exactly what offset is associated with what BIO so we can do our
checksumming and unlocking properly, so putting them in the same BIO doesn't
work. So add another check where we submit the current BIO if the physical
blocks are not contigous OR the logical blocks are not contiguous.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
|
Because BTRFS can do RAID and such, we need our own submit hook so we can setup
the bio's in the correct fashion, and handle checksum errors properly. So there
are a few changes here
1) The submit_io hook. This is straightforward, just call this instead of
submit_bio.
2) Allow the fs to return -ENOTBLK for reads. Usually this has only worked for
writes, since writes can fallback onto buffered IO. But BTRFS needs the option
of falling back on buffered IO if it encounters a compressed extent, since we
need to read the entire extent in and decompress it. So if we get -ENOTBLK back
from get_block we'll return back and fallback on buffered just like the write
case.
I've tested these changes with fsx and everything seems to work. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Currently the locking in blockdev_direct_IO is a mess, we have three
different locking types and very confusing checks for some of them. The
most complicated one is DIO_OWN_LOCKING for reads, which happens to not
actually be used.
This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
case is unused anyway, and the write side is almost identical to
DIO_NO_LOCKING. The difference is that DIO_NO_LOCKING always sets the
create argument for the get_blocks callback to zero, but we can easily
move that to the actual get_blocks callbacks. There are four users of the
DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
fine with the new version, ocfs2 only errors out if create were ever set,
and we can remove this dead code now, the block device code only ever uses
create for an error message if we are fully beyond the device which can
never happen, and last but not least XFS will need the new behavour for
writes.
Now we can replace the lock_type variable with a flags one, where no flag
means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
flag. Separate out the check for not allowing to fill holes into a
separate flag, although for now both flags always get set at the same
time.
Also revamp the documentation of the locking scheme to actually make
sense.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alex Elder <aelder@sgi.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Intel reported a performance regression caused by the following commit:
commit 848c4dd5153c7a0de55470ce99a8e13a63b4703f
Author: Zach Brown <zach.brown@oracle.com>
Date: Mon Aug 20 17:12:01 2007 -0700
dio: zero struct dio with kzalloc instead of manually
This patch uses kzalloc to zero all of struct dio rather than
manually trying to track which fields we rely on being zero. It
passed aio+dio stress testing and some bug regression testing on
ext3.
This patch was introduced by Linus in the conversation that lead up
to Badari's minimal fix to manually zero .map_bh.b_state in commit:
6a648fa72161d1f6468dabd96c5d3c0db04f598a
It makes the code a bit smaller. Maybe a couple fewer cachelines to
load, if we're lucky:
text data bss dec hex filename
3285925 568506 1304616 5159047 4eb887 vmlinux
3285797 568506 1304616 5158919 4eb807 vmlinux.patched
I was unable to measure a stable difference in the number of cpu
cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
at ~340MB/s.
So the resulting intent of the patch isn't a performance gain but to
avoid exposing ourselves to the risk of finding another field like
.map_bh.b_state where we rely on zeroing but don't enforce it in the
code.
Zach surmised that zeroing out the page array was what caused most of
the problem, and suggested the approach taken in the attached patch for
resolving the issue. Intel re-tested with this patch and saw a 0.6%
performance gain (the original regression was 0.5%).
[akpm@linux-foundation.org: add comment]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There seems to be a regression in direct write path due to following
commit in for-2.6.33 branch of block tree.
commit 1af60fbd759d31f565552fea315c2033947cfbe6
Author: Jeff Moyer <jmoyer@redhat.com>
Date: Fri Oct 2 18:56:53 2009 -0400
block: get rid of the WRITE_ODIRECT flag
Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
more request from the queue and not idle on it (despite the fact that
queue's think time is less and it is not seeky).
So direct writers lose big time when competing with sequential readers.
Using fio, I have run one direct writer and two sequential readers and
following are the results with 2.6.32-rc7 kernel and with for-2.6.33
branch.
Test
====
1 direct writer and 2 sequential reader running simultaneously.
[global]
directory=/mnt/sdc/fio/
runtime=10
[seqwrite]
rw=write
size=4G
direct=1
[seqread]
rw=read
size=2G
numjobs=2
2.6.32-rc7
==========
direct writes: aggrb=2,968KB/s
readers : aggrb=101MB/s
for-2.6.33 branch
=================
direct write: aggrb=19KB/s
readers aggrb=137MB/s
This patch brings back the WRITE_ODIRECT flag, with the difference that we
don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
submission of request and an explicit unplug from submitter is required.
That way we fix the jeff's issue of not enough merging taking place in aio
path as well as make sure direct writes get their fair share.
After the fix
=============
for-2.6.33 + fix
----------------
direct writes: aggrb=2,728KB/s
reads: aggrb=103MB/s
Thanks
Vivek
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Hi,
Some workloads issue batches of small I/O, and the performance is poor
due to the call to blk_run_address_space for every single iocb. Nathan
Roberts pointed this out, and suggested that by deferring this call
until all I/Os in the iocb array are submitted to the block layer, we
can realize some impressive performance gains (up to 30% for sequential
4k reads in batches of 16).
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Hi,
The WRITE_ODIRECT flag is only used in one place, and that code path
happens to also call blk_run_address_space. The introduction of this
flag, then, could result in the device being unplugged twice for every
I/O.
Further, with the batching changes in the next patch, we don't want an
O_DIRECT write to imply a queue unplug.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.
This patch renames hardsect_size to logical_block_size.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_KERNEL implies __GFP_WAIT.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
By default, CFQ will anticipate more IO from a given io context if the
previously completed IO was sync. This used to be fine, since the only
sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
being used now, we don't want to anticipate for those.
Add a bio/request flag that informs the IO scheduler that this is a sync
request that we should not idle for. Introduce WRITE_ODIRECT specifically
for O_DIRECT writes, and make sure that the other sync writes set this
flag.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In case of error extending write may have instantiated a few blocks
outside i_size. We need to trim these blocks. We have to do it
*regardless* to blocksize. At least ext2, ext3 and reiserfs interpret
(i_size < biggest block) condition as error. Fsck will complain about
wrong i_size. Then fsck will fix the error by changing i_size according
to the biggest block. This is bad because this blocks contain garbage
from previous write attempt. And result in data corruption.
####TESTCASE_BEGIN
$touch /mnt/test/BIG_FILE
## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
## size and block sould't be changed because write op failed.
$stat /mnt/test/BIG_FILE
File: `/mnt/test/BIG_FILE'
Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
<<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx
Device: fe07h/65031d Inode: 14 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2007-01-24 20:03:38.000000000 +0300
Modify: 2007-01-24 20:03:38.000000000 +0300
Change: 2007-01-24 20:03:39.000000000 +0300
#fsck.ext3 -f /dev/VG/test
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 56556544. Fix<y>? yes
Pass 2: Checking directory structure
....
#####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
index af0558d..4e88bea 100644
[akpm@linux-foundation.org: use i_size_read()]
Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
People can use the real name an an index into MAINTAINERS to find the
current email address.
Signed-off-by: Francois Cami <francois.cami@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use get_user_pages_fast in the common/generic block and fs direct IO paths.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Simplify page cache zeroing of segments of pages through 3 functions
zero_user_segments(page, start1, end1, start2, end2)
Zeros two segments of the page. It takes the position where to
start and end the zeroing which avoids length calculations and
makes code clearer.
zero_user_segment(page, start, end)
Same for a single segment.
zero_user(page, start, length)
Length variant for the case where we know the length.
We remove the zero_user_page macro. Issues:
1. Its a macro. Inline functions are preferable.
2. The KM_USER0 macro is only defined for HIGHMEM.
Having to treat this special case everywhere makes the
code needlessly complex. The parameter for zeroing is always
KM_USER0 except in one single case that we open code.
Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.
Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.
Also extract the flushing of the caches to be outside of the kmap.
[akpm@linux-foundation.org: fix nfs and ntfs build]
[akpm@linux-foundation.org: fix ntfs build some more]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note
A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
(and thus mapcounted and count towards shared rss). These writes to
the struct page could cause excessive cacheline bouncing on big
systems. There are a number of ways this could be addressed if it is
an issue.
And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).
There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely
I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.
Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).
As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.
When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.
Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.
The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
much use to them except on benchmarks. All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
|
|
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
This patch uses kzalloc to zero all of struct dio rather than manually
trying to track which fields we rely on being zero. It passed aio+dio
stress testing and some bug regression testing on ext3.
This patch was introduced by Linus in the conversation that lead up to
Badari's minimal fix to manually zero .map_bh.b_state in commit:
6a648fa72161d1f6468dabd96c5d3c0db04f598a
It makes the code a bit smaller. Maybe a couple fewer cachelines to
load, if we're lucky:
text data bss dec hex filename
3285925 568506 1304616 5159047 4eb887 vmlinux
3285797 568506 1304616 5158919 4eb807 vmlinux.patched
I was unable to measure a stable difference in the number of cpu cycles
spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
~340MB/s.
So the resulting intent of the patch isn't a performance gain but to
avoid exposing ourselves to the risk of finding another field like
.map_bh.b_state where we rely on zeroing but don't enforce it in the
code.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Need to initialize map_bh.b_state to zero. Otherwise, in case of a faulty
user-buffer its possible to go into dio_zero_block() and submit a page by
mistake - since it checks for buffer_new().
http://marc.info/?l=linux-kernel&m=118551339032528&w=2
akpm: Linus had a (better) patch to just do a kzalloc() in there, but it got
lost. Probably this version is better for -stable anwyay.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Joe Jin <joe.jin@oracle.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Cc: gurudas pai <gurudas.pai@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Badari Pulavarty reported a case of this BUG_ON is triggering during
testing. It's completely bogus and should be removed.
It's trying to notice if we left references to the dio hanging around in
the sync case. They should have been dropped as IO completed while this
path was in dio_await_completion(). This condition will also be
checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
lines lower. So to start this BUG_ON() is redundant.
More fatally, it's dereferencing dio-> after having dropped its
reference. It's only safe to dereference the dio after releasing the
lock if the final reference was just dropped. Another CPU might free
the dio in bio completion and reuse the memory after this path drops the
dio lock but before the BUG_ON() is evaluated.
This patch passed aio+dio regression unit tests and aio-stress on ext3.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
sound: convert "sound" subdirectory to UTF-8
MAINTAINERS: Add cxacru website/mailing list
include files: convert "include" subdirectory to UTF-8
general: convert "kernel" subdirectory to UTF-8
documentation: convert the Documentation directory to UTF-8
Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
remove broken URLs from net drivers' output
Magic number prefix consistency change to Documentation/magic-number.txt
trivial: s/i_sem /i_mutex/
fix file specification in comments
drivers/base/platform.c: fix small typo in doc
misc doc and kconfig typos
Remove obsolete fat_cvf help text
Fix occurrences of "the the "
Fix minor typoes in kernel/module.c
Kconfig: Remove reference to external mqueue library
Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
Correct comments in genrtc.c to refer to correct /proc file.
Fix more "deprecated" spellos.
Fix "deprecated" typoes.
...
Fix trivial comment conflict in kernel/relay.c.
|
|
It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset(). There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.
This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones. Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call. The diffstat below shows the entire
patchset.
[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix the misspellings of "propogate", "writting" and (oh, the shame
:-) "kenrel" in the source tree.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
The wait_for_more_bios() function name was poorly chosen. While looking to
clean it up it I noticed that the dio struct refcounting between the bio
completion and dio submission paths was racey.
The bio submission path was simply freeing the dio struct if
atomic_dec_and_test() indicated that it dropped the final reference.
The aio bio completion path was dereferencing its dio struct pointer *after
dropping its reference* based on the remaining number of references.
These two paths could race and result in the aio bio completion path
dereferencing a freed dio, though this was not observed in the wild.
This moves the refcount under the bio lock so that bio completion can drop
its reference and decide to wake all in one atomic step.
Once testing and waking is locked dio_await_one() can test its sleeping
condition and mark itself uninterruptible under the lock. It gets simpler
and wait_for_more_bios() disappears.
The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
looks alarming. This lock acquiry existed in that path before the recent
dio completion patch set. We shouldn't expect significant performance
regression from returning to the behaviour that existed before the
completion clean up work.
This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <xfs-masters@oss.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The only time it is safe to call aio_complete() is when the ->ki_retry
function returns -EIOCBQUEUED to the AIO core. direct_io_worker() has
historically done this by relying on its caller to translate positive return
codes into -EIOCBQUEUED for the aio case. It did this by trying to keep
conditionals in sync. direct_io_worker() knew when finished_one_bio() was
going to call aio_complete(). It would reverse the test and wait and free the
dio in the cases it thought that finished_one_bio() wasn't going to.
Not surprisingly, it ended up getting it wrong. 'ret' could be a negative
errno from the submission path but it failed to communicate this to
finished_one_bio(). direct_io_worker() would return < 0, it's callers
wouldn't raise -EIOCBQUEUED, and aio_complete() would be called. In the
future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
would be called for a second time which can manifest as an oops.
The previous cleanups have whittled the sync and async completion paths down
to the point where we can collapse them and clearly reassert the invariant
that we must only call aio_complete() after returning -EIOCBQUEUED.
direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
drop the dio refcount and the aio bio completion path will only call
aio_complete() when it is the last to drop the dio refcount.
direct_io_worker() can ensure that it is the last to drop the reference count
by waiting for bios to drain. It does this for sync ops, of course, and for
partial dio writes that must fall back to buffered and for aio ops that saw
errors during submission.
This means that operations that end up waiting, even if they were issued as
aio ops, will not call aio_complete() from dio. Instead we return the return
code of the operation and let the aio core call aio_complete(). This is
purposely done to fix a bug where AIO DIO file extensions would call
aio_complete() before their callers have a chance to update i_size.
Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
no longer have to translate for it. XFS needs to be careful not to free
resources that will be used during AIO completion if -EIOCBQUEUED is returned.
We maintain the previous behaviour of trying to write fs metadata for O_SYNC
aio+dio writes.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <xfs-masters@oss.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Now that we have a single refcount and waiting path we can reuse it in the
async 'should_wait' path. It continues to rely on the fragile link between
the conditional in dio_complete_aio() which decides to complete the AIO and
the conditional in direct_io_worker() which decides to wait and free.
By waiting before dropping the reference we stop dio_bio_end_aio() from
calling dio_complete_aio() which used to wake up the waiter after seeing the
reference count drop to 0. We hoist this wake up into dio_bio_end_aio() which
now notices when it's left a single remaining reference that is held by the
waiter.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Previously we had two confusing counts of bio progress. 'bio_count' was
decremented as bios were processed and freed by the dio core. It was used to
indicate final completion of the dio operation. 'bios_in_flight' reflected
how many bios were between submit_bio() and bio->end_io. It was used by the
sync path to decide when to wake up and finish completing bios and was ignored
by the async path.
This patch collapses the two notions into one notion of a dio reference count.
bios hold a dio reference when they're between submit_bio and bio->end_io.
Since bios_in_flight was only used in the sync path it is now equivalent to
dio->refcount - 1 which accounts for direct_io_worker() holding a reference
for the duration of the operation.
dio_bio_complete() -> finished_one_bio() was called from the sync path after
finding bios on the list that the bio->end_io function had deposited.
finished_one_bio() can not drop the dio reference on behalf of these bios now
because bio->end_io already has. The is_async test in finished_one_bio()
meant that it never actually did anything other than drop the bio_count for
sync callers. So we remove its refcount decrement, don't call it from
dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
caller after an explicit refcount decrement. It is renamed dio_complete_aio()
to reflect the remaining work it actually does.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We only need to call blk_run_address_space() once after all the bios for the
direct IO op have been submitted. This removes the chance of calling
blk_run_address_space() after spurious wake ups as the sync path waits for
bios to drain. It's also one less difference betwen the sync and async paths.
In the process we remove a redundant dio_bio_submit() that its caller had
already performed.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There have been a lot of bugs recently due to the way direct_io_worker() tries
to decide how to finish direct IO operations. In the worst examples it has
failed to call aio_complete() at all (hang) or called it too many times
(oops).
This set of patches cleans up the completion phase with the goal of removing
the complexity that lead to these bugs. We end up with one path that
calculates the result of the operation after all off the bios have completed.
We decide when to generate a result of the operation using that path based on
the final release of a refcount on the dio structure.
I tried to progress towards the final state in steps that were relatively easy
to understand. Each step should compile but I only tested the final result of
having all the patches applied.
I've tested these on low end PC drives with aio-stress, the direct IO tests I
could manage to get running in LTP, orasim, and some home-brew functional
tests.
In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3
running DIO LTP tests. They found that XFS bug which has since been addressed
in the patch series.
This patch:
The mechanics which decide the result of a direct IO operation were duplicated
in the sync and async paths.
The async path didn't check page_errors which can manifest as silently
returning success when the final pointer in an operation faults and its
matching file region is filled with zeros.
The sync path and async path differed in whether they passed errors to the
caller's dio->end_io operation. The async path was passing errors to it which
trips an assertion in XFS, though it is apparently harmless.
This centralizes the completion phase of dio ops in one place. AIO will now
return EFAULT consistently and all paths fall back to the previously sync
behaviour of passing the number of bytes 'transferred' to the dio->end_io
callback, regardless of errors.
dio_await_completion() doesn't have to propogate EIO from non-uptodate bios
now that it's being propogated through dio_complete() via dio->io_error. This
lets it return void which simplifies its sole caller.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Account for direct-io.
Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Teach special (rwsem-in-irq) locking code to the lock validator. Has no
effect on non-lockdep kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
A process flag to indicate whether we are doing sync io is incredibly
ugly. It also causes performance problems when one does a lot of async
io and then proceeds to sync it. Part of the io will go out as async,
and the other part as sync. This causes a disconnect between the
previously submitted io and the synced io. For io schedulers such as CFQ,
this will cause us lost merges and suboptimal behaviour in scheduling.
Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
the O_DIRECT path just directly indicate that the writes are sync
by using WRITE_SYNC instead.
Signed-off-by: Jens Axboe <axboe@suse.de>
|
|
this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|