| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull ceph fixes from Ilya Dryomov:
"A patch to make sparse read handling work in msgr2 secure mode from
Slava and a couple of fixes from Ziming and myself to avoid operating
on potentially invalid memory, all marked for stable"
* tag 'ceph-for-6.18-rc8' of https://github.com/ceph/ceph-client:
libceph: prevent potential out-of-bounds writes in handle_auth_session_key()
libceph: replace BUG_ON with bounds check for map->max_osd
ceph: fix crash in process_v2_sparse_read() for encrypted directories
libceph: drop started parameter of __ceph_open_session()
libceph: fix potential use-after-free in have_mon_and_osd_map()
|
|
With the previous commit revamping the timeout handling, started isn't
used anymore. It could be taken into account by adjusting the initial
value of the timeout, but there is little point as both callers capture
the timestamp shortly before calling __ceph_open_session() -- the only
thing of note that happens in the interim is taking client->mount_mutex
and that isn't expected to take multiple seconds.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
|
|
When having a multiuser mount with domain= specified and using
cifscreds, cifs_set_cifscreds() will end up setting @ctx->domainname,
so it needs to be freed before leaving cifs_construct_tcon().
This fixes the following memory leak reported by kmemleak:
mount.cifs //srv/share /mnt -o domain=ZELDA,multiuser,...
su - testuser
cifscreds add -d ZELDA -u testuser
...
ls /mnt/1
...
umount /mnt
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8881203c3f08 (size 8):
comm "ls", pid 5060, jiffies 4307222943
hex dump (first 8 bytes):
5a 45 4c 44 41 00 cc cc ZELDA...
backtrace (crc d109a8cf):
__kmalloc_node_track_caller_noprof+0x572/0x710
kstrdup+0x3a/0x70
cifs_sb_tlink+0x1209/0x1770 [cifs]
cifs_get_fattr+0xe1/0xf50 [cifs]
cifs_get_inode_info+0xb5/0x240 [cifs]
cifs_revalidate_dentry_attr+0x2d1/0x470 [cifs]
cifs_getattr+0x28e/0x450 [cifs]
vfs_getattr_nosec+0x126/0x180
vfs_statx+0xf6/0x220
do_statx+0xab/0x110
__x64_sys_statx+0xd5/0x130
do_syscall_64+0xbb/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: f2aee329a68f ("cifs: set domainName when a domain-key is used in multiuser")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Jay Shin <jaeshin@redhat.com>
Cc: stable@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull xfs fix from Carlos Maiolino:
"A single out-of-bounds fix, nothing special"
* tag 'xfs-fixes-6.18-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix out of bounds memory read error in symlink repair
|
|
Pull smb client fixes from Steve French:
- Fix potential memory leak in mount
- Add some missing read tracepoints
- Fix locking issue with directory leases
* tag 'v6.18-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Add the smb3_read_* tracepoints to SMB1
cifs: fix memory leak in smb3_fs_context_parse_param error path
smb: client: introduce close_cached_dir_locked()
|
|
xfs/286 produced this report on my test fleet:
==================================================================
BUG: KFENCE: out-of-bounds read in memcpy_orig+0x54/0x110
Out-of-bounds read at 0xffff88843fe9e038 (184B right of kfence-#184):
memcpy_orig+0x54/0x110
xrep_symlink_salvage_inline+0xb3/0xf0 [xfs]
xrep_symlink_salvage+0x100/0x110 [xfs]
xrep_symlink+0x2e/0x80 [xfs]
xrep_attempt+0x61/0x1f0 [xfs]
xfs_scrub_metadata+0x34f/0x5c0 [xfs]
xfs_ioc_scrubv_metadata+0x387/0x560 [xfs]
xfs_file_ioctl+0xe23/0x10e0 [xfs]
__x64_sys_ioctl+0x76/0xc0
do_syscall_64+0x4e/0x1e0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
kfence-#184: 0xffff88843fe9df80-0xffff88843fe9dfea, size=107, cache=kmalloc-128
allocated by task 3470 on cpu 1 at 263329.131592s (192823.508886s ago):
xfs_init_local_fork+0x79/0xe0 [xfs]
xfs_iformat_local+0xa4/0x170 [xfs]
xfs_iformat_data_fork+0x148/0x180 [xfs]
xfs_inode_from_disk+0x2cd/0x480 [xfs]
xfs_iget+0x450/0xd60 [xfs]
xfs_bulkstat_one_int+0x6b/0x510 [xfs]
xfs_bulkstat_iwalk+0x1e/0x30 [xfs]
xfs_iwalk_ag_recs+0xdf/0x150 [xfs]
xfs_iwalk_run_callbacks+0xb9/0x190 [xfs]
xfs_iwalk_ag+0x1dc/0x2f0 [xfs]
xfs_iwalk_args.constprop.0+0x6a/0x120 [xfs]
xfs_iwalk+0xa4/0xd0 [xfs]
xfs_bulkstat+0xfa/0x170 [xfs]
xfs_ioc_fsbulkstat.isra.0+0x13a/0x230 [xfs]
xfs_file_ioctl+0xbf2/0x10e0 [xfs]
__x64_sys_ioctl+0x76/0xc0
do_syscall_64+0x4e/0x1e0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
CPU: 1 UID: 0 PID: 1300113 Comm: xfs_scrub Not tainted 6.18.0-rc4-djwx #rc4 PREEMPT(lazy) 3d744dd94e92690f00a04398d2bd8631dcef1954
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014
==================================================================
On further analysis, I realized that the second parameter to min() is
not correct. xfs_ifork::if_bytes is the size of the xfs_ifork::if_data
buffer. if_bytes can be smaller than the data fork size because:
(a) the forkoff code tries to keep the data area as large as possible
(b) for symbolic links, if_bytes is the ondisk file size + 1
(c) forkoff is always a multiple of 8.
Case in point: for a single-byte symlink target, forkoff will be
8 but the buffer will only be 2 bytes long.
In other words, the logic here is wrong and we walk off the end of the
incore buffer. Fix that.
Cc: stable@vger.kernel.org # v6.10
Fixes: 2651923d8d8db0 ("xfs: online repair of symbolic links")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add the smb3_read_* tracepoints to SMB1's cifs_async_readv() and
cifs_readv_callback().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add proper cleanup of ctx->source and fc->source to the
cifs_parse_mount_err error handler. This ensures that memory allocated
for the source strings is correctly freed on all error paths, matching
the cleanup already performed in the success path by
smb3_cleanup_fs_context_contents().
Pointers are also set to NULL after freeing to prevent potential
double-free issues.
This change fixes a memory leak originally detected by syzbot. The
leak occurred when processing Opt_source mount options if an error
happened after ctx->source and fc->source were successfully
allocated but before the function completed.
The specific leak sequence was:
1. ctx->source = smb3_fs_context_fullpath(ctx, '/') allocates memory
2. fc->source = kstrdup(ctx->source, GFP_KERNEL) allocates more memory
3. A subsequent error jumps to cifs_parse_mount_err
4. The old error handler freed passwords but not the source strings,
causing the memory to leak.
This issue was not addressed by commit e8c73eb7db0a ("cifs: client:
fix memory leak in smb3_fs_context_parse_param"), which only fixed
leaks from repeated fsconfig() calls but not this error path.
Patch updated with minor change suggested by kernel test robot
Reported-by: syzbot+87be6809ed9bf6d718e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=87be6809ed9bf6d718e3
Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api")
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Replace close_cached_dir() calls under cfid_list_lock with a new
close_cached_dir_locked() variant that uses kref_put() instead of
kref_put_lock() to avoid recursive locking when dropping references.
While the existing code works if the refcount >= 2 invariant holds,
this area has proven error-prone. Make deadlocks impossible and WARN
on invariant violations.
Cc: stable@vger.kernel.org
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix unitialized variable in statmount_string()
- Fix hostfs mounting when passing host root during boot
- Fix dynamic lookup to fail on cell lookup failure
- Fix missing file type when reading bfs inodes from disk
- Enforce checking of sb_min_blocksize() calls and update all callers
accordingly
- Restore write access before closing files opened by open_exec() in
binfmt_misc
- Always freeze efivarfs during suspend/hibernate cycles
- Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to
actually allow mount namespace file descriptor in addition to mount
namespace ids
- Fix tmpfs remount when noswap is specified
- Switch Landlock to iput_not_last() to remove false-positives from
might_sleep() annotations in iput()
- Remove dead node_to_mnt_ns() code
- Ensure that per-queue kobjects are successfully created
* tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
landlock: fix splats from iput() after it started calling might_sleep()
fs: add iput_not_last()
shmem: fix tmpfs reconfiguration (remount) when noswap is set
fs/namespace: correctly handle errors returned by grab_requested_mnt_ns
power: always freeze efivarfs
binfmt_misc: restore write access before closing files opened by open_exec()
block: add __must_check attribute to sb_min_blocksize()
virtio-fs: fix incorrect check for fsvq->kobj
xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super
isofs: check the return value of sb_min_blocksize() in isofs_fill_super
exfat: check return value of sb_min_blocksize in exfat_read_boot_sector
vfat: fix missing sb_min_blocksize() return value checks
mnt: Remove dead code which might prevent from building
bfs: Reconstruct file type when loading from disk
afs: Fix dynamic lookup to fail on cell lookup failure
hostfs: Fix only passing host root in boot stage with new mount
fs: Fix uninitialized 'offp' in statmount_string()
|
|
Pull NFS client fixes from Anna Schumaker:
- Various fixes when using NFS with TLS
- Localio direct-IO fixes
- Fix error handling in nfs_atomic_open_v23()
- Fix sysfs memory leak when nfs_client kobject add fails
- Fix an incorrect parameter when calling nfs4_call_sync()
- Fix a failing LTP test when using delegated timestamps
* tag 'nfs-for-6.18-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Fix LTP test failures when timestamps are delegated
NFSv4: Fix an incorrect parameter when calling nfs4_call_sync()
NFS: sysfs: fix leak when nfs_client kobject add fails
NFSv2/v3: Fix error handling in nfs_atomic_open_v23()
nfs/localio: do not issue misaligned DIO out-of-order
nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
nfs/localio: backfill missing partial read support for misaligned DIO
nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
NFS: Check the TLS certificate fields in nfs_match_client()
pnfs: Set transport security policy to RPC_XPRTSEC_NONE unless using TLS
pnfs: Fix TLS logic in _nfs4_pnfs_v4_ds_connect()
pnfs: Fix TLS logic in _nfs4_pnfs_v3_ds_connect()
|
|
Pull smb client fixes from Steve French:
- Multichannel reconnect channel selection fix
- Fix for smbdirect (RDMA) disconnect bug
- Fix for incorrect username length check
- Fix memory leak in mount parm processing
* tag 'v6.18-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: let smbd_disconnect_rdma_connection() turn CREATED into DISCONNECTED
smb: fix invalid username check in smb3_fs_context_parse_param()
cifs: client: fix memory leak in smb3_fs_context_parse_param
smb: client: fix cifs_pick_channel when channel needs reconnect
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Add Chunhai Guo as a EROFS reviewer to get more eyes from interested
industry vendors
- Fix infinite loop caused by incomplete crafted zstd-compressed data
(thanks to Robert again!)
* tag 'erofs-for-6.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: avoid infinite loop due to incomplete zstd-compressed data
MAINTAINERS: erofs: add myself as reviewer
|
|
Pull smb server fixes from Steve French:
- Fix smbdirect (RDMA) disconnect hang bug
- Fix potential Denial of Service when connection limit exceeded
- Fix smbdirect (RDMA) connection (potentially accessing freed memory)
bug
* tag 'v6.18-rc5-smb-server-fixes' of git://git.samba.org/ksmbd:
smb: server: let smb_direct_disconnect_rdma_connection() turn CREATED into DISCONNECTED
ksmbd: close accepted socket when per-IP limit rejects connection
smb: server: rdma: avoid unmapping posted recv on accept failure
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Address recently reported issues or issues found at the recent NFS
bake-a-thon held in Raleigh, NC.
Issues reported with v6.18-rc:
- Address a kernel build issue
- Reorder SEQUENCE processing to avoid spurious NFS4ERR_SEQ_MISORDERED
Issues that need expedient stable backports:
- Close a refcount leak exposure
- Report support for NFSv4.2 CLONE correctly
- Fix oops during COPY_NOTIFY processing
- Prevent rare crash after XDR encoding failure
- Prevent crash due to confused or malicious NFSv4.1 client"
* tag 'nfsd-6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
Revert "SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it"
nfsd: ensure SEQUENCE replay sends a valid reply.
NFSD: Never cache a COMPOUND when the SEQUENCE operation fails
NFSD: Skip close replay processing if XDR encoding fails
NFSD: free copynotify stateid in nfs4_free_ol_stateid()
nfsd: add missing FATTR4_WORD2_CLONE_BLKSIZE from supported attributes
nfsd: fix refcount leak in nfsd_set_fh_dentry()
|
|
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251105212025.807549-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
grab_requested_mnt_ns was changed to return error codes on failure, but
its callers were not updated to check for error pointers, still checking
only for a NULL return value.
This commit updates the callers to use IS_ERR() or IS_ERR_OR_NULL() and
PTR_ERR() to correctly check for and propagate errors.
This also makes sure that the logic actually works and mount namespace
file descriptors can be used to refere to mounts.
Christian Brauner <brauner@kernel.org> says:
Rework the patch to be more ergonomic and in line with our overall error
handling patterns.
Fixes: 7b9d14af8777 ("fs: allow mount namespace fd")
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20251111062815.2546189-1-avagin@google.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The efivarfs filesystems must always be frozen and thawed to resync
variable state. Make it so.
Link: https://patch.msgid.link/20251105-vorbild-zutreffen-fe00d1dd98db@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix new inode name tracking in tree-log
- fix conventional zone and stripe calculations in zoned mode
- fix bio reference counts on error paths in relocation and scrub
* tag 'for-6.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: release root after error in data_reloc_print_warning_inode()
btrfs: scrub: put bio after errors in scrub_raid56_parity_stripe()
btrfs: do not update last_log_commit when logging inode due to a new name
btrfs: zoned: fix stripe width calculation
btrfs: zoned: fix conventional zone capacity calculation
|
|
DISCONNECTED
When smbd_disconnect_rdma_connection() turns SMBDIRECT_SOCKET_CREATED
into SMBDIRECT_SOCKET_ERROR, we'll have the situation that
smbd_disconnect_rdma_work() will set SMBDIRECT_SOCKET_DISCONNECTING
and call rdma_disconnect(), which likely fails as we never reached
the RDMA_CM_EVENT_ESTABLISHED. it means that
wait_event(sc->status_wait, sc->status == SMBDIRECT_SOCKET_DISCONNECTED)
in smbd_destroy() will hang forever in SMBDIRECT_SOCKET_DISCONNECTING
never reaching SMBDIRECT_SOCKET_DISCONNECTED.
So we directly go from SMBDIRECT_SOCKET_CREATED to
SMBDIRECT_SOCKET_DISCONNECTED.
Fixes: ffbfc73e84eb ("smb: client: let smbd_disconnect_rdma_connection() set SMBDIRECT_SOCKET_ERROR...")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Since the maximum return value of strnlen(..., CIFS_MAX_USERNAME_LEN)
is CIFS_MAX_USERNAME_LEN, length check in smb3_fs_context_parse_param()
is always FALSE and invalid.
Fix the comparison in if statement.
Signed-off-by: Yiqi Sun <sunyiqixm@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
DISCONNECTED
When smb_direct_disconnect_rdma_connection() turns SMBDIRECT_SOCKET_CREATED
into SMBDIRECT_SOCKET_ERROR, we'll have the situation that
smb_direct_disconnect_rdma_work() will set SMBDIRECT_SOCKET_DISCONNECTING
and call rdma_disconnect(), which likely fails as we never reached
the RDMA_CM_EVENT_ESTABLISHED. it means that
wait_event(sc->status_wait, sc->status == SMBDIRECT_SOCKET_DISCONNECTED)
in free_transport() will hang forever in SMBDIRECT_SOCKET_DISCONNECTING
never reaching SMBDIRECT_SOCKET_DISCONNECTED.
So we directly go from SMBDIRECT_SOCKET_CREATED to
SMBDIRECT_SOCKET_DISCONNECTED.
Fixes: b3fd52a0d85c ("smb: server: let smb_direct_disconnect_rdma_connection() set SMBDIRECT_SOCKET_ERROR...")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The utimes01 and utime06 tests fail when delegated timestamps are
enabled, specifically in subtests that modify the atime and mtime
fields using the 'nobody' user ID.
The problem can be reproduced as follow:
# echo "/media *(rw,no_root_squash,sync)" >> /etc/exports
# export -ra
# mount -o rw,nfsvers=4.2 127.0.0.1:/media /tmpdir
# cd /opt/ltp
# ./runltp -d /tmpdir -s utimes01
# ./runltp -d /tmpdir -s utime06
This issue occurs because nfs_setattr does not verify the inode's
UID against the caller's fsuid when delegated timestamps are
permitted for the inode.
This patch adds the UID check and if it does not match then the
request is sent to the server for permission checking.
Fixes: e12912d94137 ("NFSv4: Add support for delegated atime and mtime attributes")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
The Smatch static checker noted that in _nfs4_proc_lookupp(), the flag
RPC_TASK_TIMEOUT is being passed as an argument to nfs4_init_sequence(),
which is clearly incorrect.
Since LOOKUPP is an idempotent operation, nfs4_init_sequence() should
not ask the server to cache the result. The RPC_TASK_TIMEOUT flag needs
to be passed down to the RPC layer.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Fixes: 76998ebb9158 ("NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
If adding the second kobject fails, drop both references to avoid sysfs
residue and memory leak.
Fixes: e96f9268eea6 ("NFS: Make all of /sys/fs/nfs network-namespace unique")
Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Reviewed-by: Benjamin Coddington <ben.coddington@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
When nfs_do_create() returns an EEXIST error, it means that a regular
file could not be created. That could mean that a symlink needs to be
resolved. If that's the case, a lookup needs to be kicked off.
Reported-by: Stephen Abbene <sabbene87@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220710
Fixes: 7c6c5249f061 ("NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
From https://lore.kernel.org/linux-nfs/aQHASIumLJyOoZGH@infradead.org/
On Wed, Oct 29, 2025 at 12:20:40AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> > LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> > middle (via AIO completion of that aligned middle). So out of order
> > relative to file offset.
>
> That's in general a really bad idea. It will obviously work, but
> both on SSDs and out of place write file systems it is a sure way
> to increase your garbage collection overhead a lot down the line.
Fix this by never issuing misaligned DIO out of order. This fix means
the DIO-aligned middle will only use AIO completion if there is no
misaligned end segment. Otherwise, all 3 segments of a misaligned DIO
will be issued without AIO completion to ensure file offset increases
properly for all partial READ or WRITE situations.
Factoring out nfs_local_iter_setup() helps standardize repetitive
nfs_local_iters_setup_dio() code and is inspired by cleanup work that
Chuck Lever did on the NFSD Direct code.
Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
LOCALIO's misaligned DIO WRITE support requires synchronous IO for any
misaligned head and/or tail that are issued using buffered IO. In
addition, it is important that the O_DIRECT middle be on stable
storage upon its completion via AIO.
Otherwise, a misaligned DIO WRITE could mix buffered IO for the
head/tail and direct IO for the DIO-aligned middle -- which could lead
to problems associated with deferred writes to stable storage (such as
out of order partial completions causing incorrect advancement of the
file's offset, etc).
Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Misaligned DIO read can be split into 3 IOs, must handle potential for
short read from each component IO (follows same pattern used for
handling partial writes, except upper layer read code handles advancing
offset before retry).
Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Improve completion handling of as many as 3 IOs associated with each
misaligned DIO by using a atomic_t to track completion of each IO.
Update nfs_local_pgio_done() to use precise atomic_t accounting for
remaining iov_iter (up to 3) associated with each iocb, so that each
NFS LOCALIO pgio header is only released after all IOs have completed.
But also allow early return if/when a short read or write occurs.
Fixes reported BUG: KASAN: slab-use-after-free in nfs_local_call_read:
https://lore.kernel.org/linux-nfs/aPSvi5Yr2lGOh5Jh@dell-per750-06-vm-07.rhts.eng.pek2.redhat.com/
Reported-by: Yongcheng Yang <yoyang@redhat.com>
Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Each filesystem is meant to fallback to retrying DIO in terms buffered
IO when it might encounter -ENOTBLK when issuing DIO (which can happen
if the VFS cannot invalidate the page cache).
So NFS doesn't need special handling for -ENOTBLK.
Also, explicitly initialize a couple DIO related iocb members rather
than simply rely on data structure zeroing.
Fixes: c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and WRITE")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
If the TLS security policy is of type RPC_XPRTSEC_TLS_X509, then the
cert_serial and privkey_serial fields need to match as well since they
define the client's identity, as presented to the server.
Fixes: 90c9550a8d65 ("NFS: support the kernel keyring for TLS")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
The default setting for the transport security policy must be
RPC_XPRTSEC_NONE, when using a TCP or RDMA connection without TLS.
Conversely, when using TLS, the security policy needs to be set.
Fixes: 6c0a8c5fcf71 ("NFS: Have struct nfs_client carry a TLS policy field")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Don't try to add an RDMA transport to a client that is already marked as
being a TCP/TLS transport.
Fixes: a35518cae4b3 ("NFSv4.1/pnfs: fix NFS with TLS in pnfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Don't try to add an RDMA transport to a client that is already marked as
being a TCP/TLS transport.
Fixes: 04a15263662a ("pnfs/flexfiles: connect to NFSv3 DS using TLS if MDS connection uses TLS")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
nfsd4_enc_sequence_replay() uses nfsd4_encode_operation() to encode a
new SEQUENCE reply when replaying a request from the slot cache - only
ops after the SEQUENCE are replayed from the cache in ->sl_data.
However it does this in nfsd4_replay_cache_entry() which is called
*before* nfsd4_sequence() has filled in reply fields.
This means that in the replayed SEQUENCE reply:
maxslots will be whatever the client sent
target_maxslots will be -1 (assuming init to zero, and
nfsd4_encode_sequence() subtracts 1)
status_flags will be zero
The incorrect maxslots value, in particular, can cause the client to
think the slot table has been reduced in size so it can discard its
knowledge of current sequence number of the later slots, though the
server has not discarded those slots. When the client later wants to
use a later slot, it can get NFS4ERR_SEQ_MISORDERED from the server.
This patch moves the setup of the reply into a new helper function and
call it *before* nfsd4_replay_cache_entry() is called. Only one of the
updated fields was used after this point - maxslots. So the
nfsd4_sequence struct has been extended to have separate maxslots for
the request and the response.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Closes: https://lore.kernel.org/linux-nfs/20251010194449.10281-1-okorniev@redhat.com/
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
RFC 8881 normatively mandates that operations where the initial
SEQUENCE operation in a compound fails must not modify the slot's
replay cache.
nfsd4_cache_this() doesn't prevent such caching. So when SEQUENCE
fails, cstate.data_offset is not set, allowing
read_bytes_from_xdr_buf() to access uninitialized memory.
Reported-by: rtm@csail.mit.edu
Closes: https://lore.kernel.org/linux-nfs/c3628d57-94ae-48cf-8c9e-49087a28cec9@oracle.com/T/#t
Fixes: 468de9e54a90 ("nfsd41: expand solo sequence check")
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The replay logic added by commit 9411b1d4c7df ("nfsd4: cleanup
handling of nfsv4.0 closed stateid's") cannot be done if encoding
failed due to a short send buffer; there's no guarantee that the
operation encoder has actually encoded the data that is being copied
to the replay cache.
Reported-by: rtm@csail.mit.edu
Closes: https://lore.kernel.org/linux-nfs/c3628d57-94ae-48cf-8c9e-49087a28cec9@oracle.com/T/#t
Fixes: 9411b1d4c7df ("nfsd4: cleanup handling of nfsv4.0 closed stateid's")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Typically copynotify stateid is freed either when parent's stateid
is being close/freed or in nfsd4_laundromat if the stateid hasn't
been used in a lease period.
However, in case when the server got an OPEN (which created
a parent stateid), followed by a COPY_NOTIFY using that stateid,
followed by a client reboot. New client instance while doing
CREATE_SESSION would force expire previous state of this client.
It leads to the open state being freed thru release_openowner->
nfs4_free_ol_stateid() and it finds that it still has copynotify
stateid associated with it. We currently print a warning and is
triggerred
WARNING: CPU: 1 PID: 8858 at fs/nfsd/nfs4state.c:1550 nfs4_free_ol_stateid+0xb0/0x100 [nfsd]
This patch, instead, frees the associated copynotify stateid here.
If the parent stateid is freed (without freeing the copynotify
stateids associated with it), it leads to the list corruption
when laundromat ends up freeing the copynotify state later.
[ 1626.839430] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 1626.842828] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth cfg80211 rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd nfs_acl lockd grace nfs_localio ext4 crc16 mbcache jbd2 overlay uinput snd_seq_dummy snd_hrtimer qrtr rfkill vfat fat uvcvideo snd_hda_codec_generic videobuf2_vmalloc videobuf2_memops snd_hda_intel uvc snd_intel_dspcfg videobuf2_v4l2 videobuf2_common snd_hda_codec snd_hda_core videodev snd_hwdep snd_seq mc snd_seq_device snd_pcm snd_timer snd soundcore sg loop auth_rpcgss vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs 8021q garp stp llc mrp nvme ghash_ce e1000e nvme_core sr_mod nvme_keyring nvme_auth cdrom vmwgfx drm_ttm_helper ttm sunrpc dm_mirror dm_region_hash dm_log iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse dm_multipath dm_mod nfnetlink
[ 1626.855594] CPU: 2 UID: 0 PID: 199 Comm: kworker/u24:33 Kdump: loaded Tainted: G B W 6.17.0-rc7+ #22 PREEMPT(voluntary)
[ 1626.857075] Tainted: [B]=BAD_PAGE, [W]=WARN
[ 1626.857573] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024
[ 1626.858724] Workqueue: nfsd4 laundromat_main [nfsd]
[ 1626.859304] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 1626.860010] pc : __list_del_entry_valid_or_report+0x148/0x200
[ 1626.860601] lr : __list_del_entry_valid_or_report+0x148/0x200
[ 1626.861182] sp : ffff8000881d7a40
[ 1626.861521] x29: ffff8000881d7a40 x28: 0000000000000018 x27: ffff0000c2a98200
[ 1626.862260] x26: 0000000000000600 x25: 0000000000000000 x24: ffff8000881d7b20
[ 1626.862986] x23: ffff0000c2a981e8 x22: 1fffe00012410e7d x21: ffff0000920873e8
[ 1626.863701] x20: ffff0000920873e8 x19: ffff000086f22998 x18: 0000000000000000
[ 1626.864421] x17: 20747562202c3839 x16: 3932326636383030 x15: 3030666666662065
[ 1626.865092] x14: 6220646c756f6873 x13: 0000000000000001 x12: ffff60004fd9e4a3
[ 1626.865713] x11: 1fffe0004fd9e4a2 x10: ffff60004fd9e4a2 x9 : dfff800000000000
[ 1626.866320] x8 : 00009fffb0261b5e x7 : ffff00027ecf2513 x6 : 0000000000000001
[ 1626.866938] x5 : ffff00027ecf2510 x4 : ffff60004fd9e4a3 x3 : 0000000000000000
[ 1626.867553] x2 : 0000000000000000 x1 : ffff000096069640 x0 : 000000000000006d
[ 1626.868167] Call trace:
[ 1626.868382] __list_del_entry_valid_or_report+0x148/0x200 (P)
[ 1626.868876] _free_cpntf_state_locked+0xd0/0x268 [nfsd]
[ 1626.869368] nfs4_laundromat+0x6f8/0x1058 [nfsd]
[ 1626.869813] laundromat_main+0x24/0x60 [nfsd]
[ 1626.870231] process_one_work+0x584/0x1050
[ 1626.870595] worker_thread+0x4c4/0xc60
[ 1626.870893] kthread+0x2f8/0x398
[ 1626.871146] ret_from_fork+0x10/0x20
[ 1626.871422] Code: aa1303e1 aa1403e3 910e8000 97bc55d7 (d4210000)
[ 1626.871892] SMP: stopping secondary CPUs
Reported-by: rtm@csail.mit.edu
Closes: https://lore.kernel.org/linux-nfs/d8f064c1-a26f-4eed-b4f0-1f7f608f415f@oracle.com/T/#t
Fixes: 624322f1adc5 ("NFSD add COPY_NOTIFY operation")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Because kthread_stop did not stop sc_task properly and returned -EINTR,
the sc_timer was not properly closed, ultimately causing the problem [1]
reported by syzbot when freeing sci due to the sc_timer not being closed.
Because the thread sc_task main function nilfs_segctor_thread() returns 0
when it succeeds, when the return value of kthread_stop() is not 0 in
nilfs_segctor_destroy(), we believe that it has not properly closed
sc_timer.
We use timer_shutdown_sync() to sync wait for sc_timer to shutdown, and
set the value of sc_task to NULL under the protection of lock
sc_state_lock, so as to avoid the issue caused by sc_timer not being
properly shutdowned.
[1]
ODEBUG: free active (active state 0) object: 00000000dacb411a object type: timer_list hint: nilfs_construction_timeout
Call trace:
nilfs_segctor_destroy fs/nilfs2/segment.c:2811 [inline]
nilfs_detach_log_writer+0x668/0x8cc fs/nilfs2/segment.c:2877
nilfs_put_super+0x4c/0x12c fs/nilfs2/super.c:509
Link: https://lkml.kernel.org/r/20251029225226.16044-1-konishi.ryusuke@gmail.com
Fixes: 3f66cc261ccb ("nilfs2: use kthread_create and kthread_stop for the log writer thread")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+24d8b70f039151f65590@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=24d8b70f039151f65590
Tested-by: syzbot+24d8b70f039151f65590@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Cc: <stable@vger.kernel.org> [6.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pde is erased from subdir rbtree through rb_erase(), but not set the node
to EMPTY, which may result in uaf access. We should use RB_CLEAR_NODE()
set the erased node to EMPTY, then pde_subdir_next() will return NULL to
avoid uaf access.
We found an uaf issue while using stress-ng testing, need to run testcase
getdent and tun in the same time. The steps of the issue is as follows:
1) use getdent to traverse dir /proc/pid/net/dev_snmp6/, and current
pde is tun3;
2) in the [time windows] unregister netdevice tun3 and tun2, and erase
them from rbtree. erase tun3 first, and then erase tun2. the
pde(tun2) will be released to slab;
3) continue to getdent process, then pde_subdir_next() will return
pde(tun2) which is released, it will case uaf access.
CPU 0 | CPU 1
-------------------------------------------------------------------------
traverse dir /proc/pid/net/dev_snmp6/ | unregister_netdevice(tun->dev) //tun3 tun2
sys_getdents64() |
iterate_dir() |
proc_readdir() |
proc_readdir_de() | snmp6_unregister_dev()
pde_get(de); | proc_remove()
read_unlock(&proc_subdir_lock); | remove_proc_subtree()
| write_lock(&proc_subdir_lock);
[time window] | rb_erase(&root->subdir_node, &parent->subdir);
| write_unlock(&proc_subdir_lock);
read_lock(&proc_subdir_lock); |
next = pde_subdir_next(de); |
pde_put(de); |
de = next; //UAF |
rbtree of dev_snmp6
|
pde(tun3)
/ \
NULL pde(tun2)
Link: https://lkml.kernel.org/r/20251025024233.158363-1-albin_yang@163.com
Signed-off-by: Wei Yang <albinwyang@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: wangzijie <wangzijie1@honor.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When the per-IP connection limit is exceeded in ksmbd_kthread_fn(),
the code sets ret = -EAGAIN and continues the accept loop without
closing the just-accepted socket. That leaks one socket per rejected
attempt from a single IP and enables a trivial remote DoS.
Release client_sk before continuing.
This bug was found with ZeroPath.
Cc: stable@vger.kernel.org
Signed-off-by: Joshua Rogers <linux@joshua.hu>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
smb_direct_prepare_negotiation() posts a recv and then, if
smb_direct_accept_client() fails, calls put_recvmsg() on the same
buffer. That unmaps and recycles a buffer that is still posted on
the QP., which can lead to device DMA into unmapped or reused memory.
Track whether the recv was posted and only return it if it was never
posted. If accept fails after a post, leave it for teardown to drain
and complete safely.
Signed-off-by: Joshua Rogers <linux@joshua.hu>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The user calls fsconfig twice, but when the program exits, free() only
frees ctx->source for the second fsconfig, not the first.
Regarding fc->source, there is no code in the fs context related to its
memory reclamation.
To fix this memory leak, release the source memory corresponding to ctx
or fc before each parsing.
syzbot reported:
BUG: memory leak
unreferenced object 0xffff888128afa360 (size 96):
backtrace (crc 79c9c7ba):
kstrdup+0x3c/0x80 mm/util.c:84
smb3_fs_context_parse_param+0x229b/0x36c0 fs/smb/client/fs_context.c:1444
BUG: memory leak
unreferenced object 0xffff888112c7d900 (size 96):
backtrace (crc 79c9c7ba):
smb3_fs_context_fullpath+0x70/0x1b0 fs/smb/client/fs_context.c:629
smb3_fs_context_parse_param+0x2266/0x36c0 fs/smb/client/fs_context.c:1438
Reported-by: syzbot+72afd4c236e6bc3f4bac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=72afd4c236e6bc3f4bac
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
cifs_pick_channel iterates candidate channels using cur. The
reconnect-state test mistakenly used a different variable.
This checked the wrong slot and would cause us to skip a healthy channel
and to dispatch on one that needs reconnect, occasionally failing
operations when a channel was down.
Fix by replacing for the correct variable.
Fixes: fc43a8ac396d ("cifs: cifs_pick_channel should try selecting active channels")
Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull smb client fixes from Steve French:
- Fix change notify packet validation check
- Refcount fix (e.g. rename error paths)
- Fix potential UAF due to missing locks on directory lease refcount
* tag 'v6.18rc4-SMB-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: validate change notify buffer before copy
smb: client: fix refcount leak in smb2_set_path_attr
smb: client: fix potential UAF in smb2_close_cached_fid()
|
|
Pull xfs fixes from Carlos Maiolino:
"This contain fixes for the RT and zoned allocator, and a few fixes for
atomic writes"
* tag 'xfs-fixes-6.18-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: free xfs_busy_extents structure when no RT extents are queued
xfs: fix zone selection in xfs_select_open_zone_mru
xfs: fix a rtgroup leak when xfs_init_zone fails
xfs: fix various problems in xfs_atomic_write_cow_iomap_begin
xfs: fix delalloc write failures in software-provided atomic writes
|
|
SMB2_change_notify called smb2_validate_iov() but ignored the return
code, then kmemdup()ed using server provided OutputBufferOffset/Length.
Check the return of smb2_validate_iov() and bail out on error.
Discovered with help from the ZeroPath security tooling.
Signed-off-by: Joshua Rogers <linux@joshua.hu>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: stable@vger.kernel.org
Fixes: e3e9463414f61 ("smb3: improve SMB3 change notification support")
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull smb server fixes from Steve French:
- More safely detect RDMA capable devices correctly
* tag 'v6.18-rc4-smb-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: detect RDMA capable netdevs include IPoIB
ksmbd: detect RDMA capable lower devices when bridge and vlan netdev is used
|
|
Pull fscrypt fix from Eric Biggers:
"Fix an UBSAN warning that started occurring when the block layer
started supporting logical_block_size > PAGE_SIZE"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: fix left shift underflow when inode->i_blkbits > PAGE_SHIFT
|
|
Currently, the decompression logic incorrectly spins if compressed
data is truncated in crafted (deliberately corrupted) images.
Fixes: 7c35de4df105 ("erofs: Zstandard compression support")
Reported-by: Robert Morris <rtm@csail.mit.edu>
Closes: https://lore.kernel.org/r/50958.1761605413@localhost
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
|
|
kmemleak occasionally reports leaking xfs_busy_extents structure
from xfs_scrub calls after running xfs/528 (but attributed to following
tests), which seems to be caused by not freeing the xfs_busy_extents
structure when tr.queued is 0 and xfs_trim_rtgroup_extents breaks out
of the main loop. Free the structure in this case.
Fixes: a3315d11305f ("xfs: use rtgroup busy extent list for FITRIM")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
data_reloc_print_warning_inode() calls btrfs_get_fs_root() to obtain
local_root, but fails to release its reference when paths_from_inode()
returns an error. This causes a potential memory leak.
Add a missing btrfs_put_root() call in the error path to properly
decrease the reference count of local_root.
Fixes: b9a9a85059cde ("btrfs: output affected files when relocation fails")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
scrub_raid56_parity_stripe() allocates a bio with bio_alloc(), but
fails to release it on some error paths, leading to a potential
memory leak.
Add the missing bio_put() calls to properly drop the bio reference
in those error cases.
Fixes: 1009254bf22a3 ("btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When logging that a new name exists, we skip updating the inode's
last_log_commit field to prevent a later explicit fsync against the inode
from doing nothing (as updating last_log_commit makes btrfs_inode_in_log()
return true). We are detecting, at btrfs_log_inode(), that logging a new
name is happening by checking the logging mode is not LOG_INODE_EXISTS,
but that is not enough because we may log parent directories when logging
a new name of a file in LOG_INODE_ALL mode - we need to check that the
logging_new_name field of the log context too.
An example scenario where this results in an explicit fsync against a
directory not persisting changes to the directory is the following:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/foo
$ sync
$ mkdir /mnt/dir
# Write some data to our file and fsync it.
$ xfs_io -c "pwrite -S 0xab 0 64K" -c "fsync" /mnt/foo
# Add a new link to our file. Since the file was logged before, we
# update it in the log tree by calling btrfs_log_new_name().
$ ln /mnt/foo /mnt/dir/bar
# fsync the root directory - we expect it to persist the dentry for
# the new directory "dir".
$ xfs_io -c "fsync" /mnt
<power fail>
After mounting the fs the entry for directory "dir" does not exists,
despite the explicit fsync on the root directory.
Here's why this happens:
1) When we fsync the file we log the inode, so that it's present in the
log tree;
2) When adding the new link we enter btrfs_log_new_name(), and since the
inode is in the log tree we proceed to updating the inode in the log
tree;
3) We first set the inode's last_unlink_trans to the current transaction
(early in btrfs_log_new_name());
4) We then eventually enter btrfs_log_inode_parent(), and after logging
the file's inode, we call btrfs_log_all_parents() because the inode's
last_unlink_trans matches the current transaction's ID (updated in the
previous step);
5) So btrfs_log_all_parents() logs the root directory by calling
btrfs_log_inode() for the root's inode with a log mode of LOG_INODE_ALL
so that new dentries are logged;
6) At btrfs_log_inode(), because the log mode is LOG_INODE_ALL, we
update root inode's last_log_commit to the last transaction that
changed the inode (->last_sub_trans field of the inode), which
corresponds to the current transaction's ID;
7) Then later when user space explicitly calls fsync against the root
directory, we enter btrfs_sync_file(), which calls skip_inode_logging()
and that returns true, since its call to btrfs_inode_in_log() returns
true and there are no ordered extents (it's a directory, never has
ordered extents). This results in btrfs_sync_file() returning without
syncing the log or committing the current transaction, so all the
updates we did when logging the new name, including logging the root
directory, are not persisted.
So fix this by but updating the inode's last_log_commit if we are sure
we are not logging a new name (if ctx->logging_new_name is false).
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/03c5d7ec-5b3d-49d1-95bc-8970a7f82d87@gmail.com/
Fixes: 130341be7ffa ("btrfs: always update the logged transaction when logging new names")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The stripe offset calculation in the zoned code for raid0 and raid10
wrongly uses map->stripe_size to calculate it. In fact, map->stripe_size is
the size of the device extent composing the block group, which always is
the zone_size on the zoned setup.
Fix it by using BTRFS_STRIPE_LEN and BTRFS_STRIPE_LEN_SHIFT. Also, optimize
the calculation a bit by doing the common calculation only once.
Fixes: c0d90a79e8e6 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups")
CC: stable@vger.kernel.org # 6.17+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a block group contains both conventional zone and sequential zone, the
capacity of the block group is wrongly set to the block group's full
length. The capacity should be calculated in btrfs_load_block_group_* using
the last allocation offset.
Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # v6.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
xfs_select_open_zone_mru needs to pass XFS_ZONE_ALLOC_OK to
xfs_try_use_zone because we only want to tightly pack into zones of the
same or a compatible temperature instead of any available zone.
This got broken in commit 0301dae732a5 ("xfs: refactor hint based zone
allocation"), which failed to update this particular caller when
switching to an enum. xfs/638 sometimes, but not reliably fails due to
this change.
Fixes: 0301dae732a5 ("xfs: refactor hint based zone allocation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Drop the rtgrop reference when xfs_init_zone fails for a conventional
device.
Fixes: 4e4d52075577 ("xfs: add the zoned space allocator")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
I think there are several things wrong with this function:
A) xfs_bmapi_write can return a much larger unwritten mapping than what
the caller asked for. We convert part of that range to written, but
return the entire written mapping to iomap even though that's
inaccurate.
B) The arguments to xfs_reflink_convert_cow_locked are wrong -- an
unwritten mapping could be *smaller* than the write range (or even
the hole range). In this case, we convert too much file range to
written state because we then return a smaller mapping to iomap.
C) It doesn't handle delalloc mappings. This I covered in the patch
that I already sent to the list.
D) Reassigning count_fsb to handle the hole means that if the second
cmap lookup attempt succeeds (due to racing with someone else) we
trim the mapping more than is strictly necessary. The changing
meaning of count_fsb makes this harder to notice.
E) The tracepoint is kinda wrong because @length is mutated. That makes
it harder to chase the data flows through this function because you
can't just grep on the pos/bytecount strings.
F) We don't actually check that the br_state = XFS_EXT_NORM assignment
is accurate, i.e that the cow fork actually contains a written
mapping for the range we're interested in
G) Somewhat inadequate documentation of why we need to xfs_trim_extent
so aggressively in this function.
H) Not sure why xfs_iomap_end_fsb is used here, the vfs already clamped
the write range to s_maxbytes.
Fix these issues, and then the atomic writes regressions in generic/760,
generic/617, generic/091, generic/263, and generic/521 all go away for
me.
Cc: stable@vger.kernel.org # v6.16
Fixes: bd1d2c21d5d249 ("xfs: add xfs_atomic_write_cow_iomap_begin()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
With the 20 Oct 2025 release of fstests, generic/521 fails for me on
regular (aka non-block-atomic-writes) storage:
QA output created by 521
dowrite: write: Input/output error
LOG DUMP (8553 total operations):
1( 1 mod 256): SKIPPED (no operation)
2( 2 mod 256): WRITE 0x7e000 thru 0x8dfff (0x10000 bytes) HOLE
3( 3 mod 256): READ 0x69000 thru 0x79fff (0x11000 bytes)
4( 4 mod 256): FALLOC 0x53c38 thru 0x5e853 (0xac1b bytes) INTERIOR
5( 5 mod 256): COPY 0x55000 thru 0x59fff (0x5000 bytes) to 0x25000 thru 0x29fff
6( 6 mod 256): WRITE 0x74000 thru 0x88fff (0x15000 bytes)
7( 7 mod 256): ZERO 0xedb1 thru 0x11693 (0x28e3 bytes)
with a warning in dmesg from iomap about XFS trying to give it a
delalloc mapping for a directio write. Fix the software atomic write
iomap_begin code to convert the reservation into a written mapping.
This doesn't fix the data corruption problems reported by generic/760,
but it's a start.
Cc: stable@vger.kernel.org # v6.16
Fixes: bd1d2c21d5d249 ("xfs: add xfs_atomic_write_cow_iomap_begin()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
sb_min_blocksize() may return 0. Check its return value to avoid the
filesystem super block when sb->s_blocksize is 0.
Cc: stable@vger.kernel.org # v6.15
Fixes: a64e5a596067bd ("bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Link: https://patch.msgid.link/20251104125009.2111925-5-yangyongpeng.storage@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
sb_min_blocksize() may return 0. Check its return value to avoid
opt->blocksize and sb->s_blocksize is 0.
Cc: stable@vger.kernel.org # v6.15
Fixes: 1b17a46c9243e9 ("isofs: convert isofs to use the new mount API")
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Link: https://patch.msgid.link/20251104125009.2111925-4-yangyongpeng.storage@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
sb_min_blocksize() may return 0. Check its return value to avoid
accessing the filesystem super block when sb->s_blocksize is 0.
Cc: stable@vger.kernel.org # v6.15
Fixes: 719c1e1829166d ("exfat: add super block operations")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Link: https://patch.msgid.link/20251104125009.2111925-3-yangyongpeng.storage@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When emulating an nvme device on qemu with both logical_block_size and
physical_block_size set to 8 KiB, but without format, a kernel panic
was triggered during the early boot stage while attempting to mount a
vfat filesystem.
[95553.682035] EXT4-fs (nvme0n1): unable to set blocksize
[95553.684326] EXT4-fs (nvme0n1): unable to set blocksize
[95553.686501] EXT4-fs (nvme0n1): unable to set blocksize
[95553.696448] ISOFS: unsupported/invalid hardware sector size 8192
[95553.697117] ------------[ cut here ]------------
[95553.697567] kernel BUG at fs/buffer.c:1582!
[95553.697984] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[95553.698602] CPU: 0 UID: 0 PID: 7212 Comm: mount Kdump: loaded Not tainted 6.18.0-rc2+ #38 PREEMPT(voluntary)
[95553.699511] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[95553.700534] RIP: 0010:folio_alloc_buffers+0x1bb/0x1c0
[95553.701018] Code: 48 8b 15 e8 93 18 02 65 48 89 35 e0 93 18 02 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc <0f> 0b 90 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
[95553.702648] RSP: 0018:ffffd1b0c676f990 EFLAGS: 00010246
[95553.703132] RAX: ffff8cfc4176d820 RBX: 0000000000508c48 RCX: 0000000000000001
[95553.703805] RDX: 0000000000002000 RSI: 0000000000000000 RDI: 0000000000000000
[95553.704481] RBP: ffffd1b0c676f9c8 R08: 0000000000000000 R09: 0000000000000000
[95553.705148] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[95553.705816] R13: 0000000000002000 R14: fffff8bc8257e800 R15: 0000000000000000
[95553.706483] FS: 000072ee77315840(0000) GS:ffff8cfdd2c8d000(0000) knlGS:0000000000000000
[95553.707248] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95553.707782] CR2: 00007d8f2a9e5a20 CR3: 0000000039d0c006 CR4: 0000000000772ef0
[95553.708439] PKRU: 55555554
[95553.708734] Call Trace:
[95553.709015] <TASK>
[95553.709266] __getblk_slow+0xd2/0x230
[95553.709641] ? find_get_block_common+0x8b/0x530
[95553.710084] bdev_getblk+0x77/0xa0
[95553.710449] __bread_gfp+0x22/0x140
[95553.710810] fat_fill_super+0x23a/0xfc0
[95553.711216] ? __pfx_setup+0x10/0x10
[95553.711580] ? __pfx_vfat_fill_super+0x10/0x10
[95553.712014] vfat_fill_super+0x15/0x30
[95553.712401] get_tree_bdev_flags+0x141/0x1e0
[95553.712817] get_tree_bdev+0x10/0x20
[95553.713177] vfat_get_tree+0x15/0x20
[95553.713550] vfs_get_tree+0x2a/0x100
[95553.713910] vfs_cmd_create+0x62/0xf0
[95553.714273] __do_sys_fsconfig+0x4e7/0x660
[95553.714669] __x64_sys_fsconfig+0x20/0x40
[95553.715062] x64_sys_call+0x21ee/0x26a0
[95553.715453] do_syscall_64+0x80/0x670
[95553.715816] ? __fs_parse+0x65/0x1e0
[95553.716172] ? fat_parse_param+0x103/0x4b0
[95553.716587] ? vfs_parse_fs_param_source+0x21/0xa0
[95553.717034] ? __do_sys_fsconfig+0x3d9/0x660
[95553.717548] ? __x64_sys_fsconfig+0x20/0x40
[95553.717957] ? x64_sys_call+0x21ee/0x26a0
[95553.718360] ? do_syscall_64+0xb8/0x670
[95553.718734] ? __x64_sys_fsconfig+0x20/0x40
[95553.719141] ? x64_sys_call+0x21ee/0x26a0
[95553.719545] ? do_syscall_64+0xb8/0x670
[95553.719922] ? x64_sys_call+0x1405/0x26a0
[95553.720317] ? do_syscall_64+0xb8/0x670
[95553.720702] ? __x64_sys_close+0x3e/0x90
[95553.721080] ? x64_sys_call+0x1b5e/0x26a0
[95553.721478] ? do_syscall_64+0xb8/0x670
[95553.721841] ? irqentry_exit+0x43/0x50
[95553.722211] ? exc_page_fault+0x90/0x1b0
[95553.722681] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[95553.723166] RIP: 0033:0x72ee774f3afe
[95553.723562] Code: 73 01 c3 48 8b 0d 0a 33 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 49 89 ca b8 af 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d da 32 0f 00 f7 d8 64 89 01 48
[95553.725188] RSP: 002b:00007ffe97148978 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
[95553.725892] RAX: ffffffffffffffda RBX: 00005dcfe53d0080 RCX: 000072ee774f3afe
[95553.726526] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
[95553.727176] RBP: 00007ffe97148ac0 R08: 0000000000000000 R09: 000072ee775e7ac0
[95553.727818] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[95553.728459] R13: 00005dcfe53d04b0 R14: 000072ee77670b00 R15: 00005dcfe53d1a28
[95553.729086] </TASK>
The panic occurs as follows:
1. logical_block_size is 8KiB, causing {struct super_block *sb}->s_blocksize
is initialized to 0.
vfat_fill_super
- fat_fill_super
- sb_min_blocksize
- sb_set_blocksize //return 0 when size is 8KiB.
2. __bread_gfp is called with size == 0, causing folio_alloc_buffers() to
compute an offset equal to folio_size(folio), which triggers a BUG_ON.
fat_fill_super
- sb_bread
- __bread_gfp // size == {struct super_block *sb}->s_blocksize == 0
- bdev_getblk
- __getblk_slow
- grow_buffers
- grow_dev_folio
- folio_alloc_buffers // size == 0
- folio_set_bh //offset == folio_size(folio) and panic
To fix this issue, add proper return value checks for
sb_min_blocksize().
Cc: stable@vger.kernel.org # v6.15
Fixes: a64e5a596067bd ("bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()")
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Link: https://patch.msgid.link/20251104125009.2111925-2-yangyongpeng.storage@gmail.com
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
bm_register_write() opens an executable file using open_exec(), which
internally calls do_open_execat() and denies write access on the file to
avoid modification while it is being executed.
However, when an error occurs, bm_register_write() closes the file using
filp_close() directly. This does not restore the write permission, which
may cause subsequent write operations on the same file to fail.
Fix this by calling exe_file_allow_write_access() before filp_close() to
restore the write permission properly.
Fixes: e7850f4d844e ("binfmt_misc: fix possible deadlock in bm_register_write")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Link: https://patch.msgid.link/20251105022923.1813587-1-zilin@seu.edu.cn
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In virtio_fs_add_queues_sysfs(), the code incorrectly checks fs->mqs_kobj
after calling kobject_create_and_add(). Change the check to fsvq->kobj
(fs->mqs_kobj -> fsvq->kobj) to ensure the per-queue kobject is
successfully created.
Fixes: 87cbdc396a31 ("virtio_fs: add sysfs entries for queue information")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20251027104658.1668537-1-alok.a.tiwari@oracle.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When simulating an nvme device on qemu with both logical_block_size and
physical_block_size set to 8 KiB, an error trace appears during
partition table reading at boot time. The issue is caused by
inode->i_blkbits being larger than PAGE_SHIFT, which leads to a left
shift of -1 and triggering a UBSAN warning.
[ 2.697306] ------------[ cut here ]------------
[ 2.697309] UBSAN: shift-out-of-bounds in fs/crypto/inline_crypt.c:336:37
[ 2.697311] shift exponent -1 is negative
[ 2.697315] CPU: 3 UID: 0 PID: 274 Comm: (udev-worker) Not tainted 6.18.0-rc2+ #34 PREEMPT(voluntary)
[ 2.697317] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 2.697320] Call Trace:
[ 2.697324] <TASK>
[ 2.697325] dump_stack_lvl+0x76/0xa0
[ 2.697340] dump_stack+0x10/0x20
[ 2.697342] __ubsan_handle_shift_out_of_bounds+0x1e3/0x390
[ 2.697351] bh_get_inode_and_lblk_num.cold+0x12/0x94
[ 2.697359] fscrypt_set_bio_crypt_ctx_bh+0x44/0x90
[ 2.697365] submit_bh_wbc+0xb6/0x190
[ 2.697370] block_read_full_folio+0x194/0x270
[ 2.697371] ? __pfx_blkdev_get_block+0x10/0x10
[ 2.697375] ? __pfx_blkdev_read_folio+0x10/0x10
[ 2.697377] blkdev_read_folio+0x18/0x30
[ 2.697379] filemap_read_folio+0x40/0xe0
[ 2.697382] filemap_get_pages+0x5ef/0x7a0
[ 2.697385] ? mmap_region+0x63/0xd0
[ 2.697389] filemap_read+0x11d/0x520
[ 2.697392] blkdev_read_iter+0x7c/0x180
[ 2.697393] vfs_read+0x261/0x390
[ 2.697397] ksys_read+0x71/0xf0
[ 2.697398] __x64_sys_read+0x19/0x30
[ 2.697399] x64_sys_call+0x1e88/0x26a0
[ 2.697405] do_syscall_64+0x80/0x670
[ 2.697410] ? __x64_sys_newfstat+0x15/0x20
[ 2.697414] ? x64_sys_call+0x204a/0x26a0
[ 2.697415] ? do_syscall_64+0xb8/0x670
[ 2.697417] ? irqentry_exit_to_user_mode+0x2e/0x2a0
[ 2.697420] ? irqentry_exit+0x43/0x50
[ 2.697421] ? exc_page_fault+0x90/0x1b0
[ 2.697422] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 2.697425] RIP: 0033:0x75054cba4a06
[ 2.697426] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
[ 2.697427] RSP: 002b:00007fff973723a0 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
[ 2.697430] RAX: ffffffffffffffda RBX: 00005ea9a2c02760 RCX: 000075054cba4a06
[ 2.697432] RDX: 0000000000002000 RSI: 000075054c190000 RDI: 000000000000001b
[ 2.697433] RBP: 00007fff973723c0 R08: 0000000000000000 R09: 0000000000000000
[ 2.697434] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[ 2.697434] R13: 00005ea9a2c027c0 R14: 00005ea9a2be5608 R15: 00005ea9a2be55f0
[ 2.697436] </TASK>
[ 2.697436] ---[ end trace ]---
This situation can happen for block devices because when
CONFIG_TRANSPARENT_HUGEPAGE is enabled, the maximum logical_block_size
is 64 KiB. set_init_blocksize() then sets the block device
inode->i_blkbits to 13, which is within this limit.
File I/O does not trigger this problem because for filesystems that do
not support the FS_LBS feature, sb_set_blocksize() prevents
sb->s_blocksize_bits from being larger than PAGE_SHIFT. During inode
allocation, alloc_inode()->inode_init_always() assigns inode->i_blkbits
from sb->s_blocksize_bits. Currently, only xfs_fs_type has the FS_LBS
flag, and since xfs I/O paths do not reach submit_bh_wbc(), it does not
hit the left-shift underflow issue.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Fixes: 47dd67532303 ("block/bdev: lift block size restrictions to 64k")
Cc: stable@vger.kernel.org
[EB: use folio_pos() and consolidate the two shifts by i_blkbits]
Link: https://lore.kernel.org/r/20251105003642.42796-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Fix refcount leak in `smb2_set_path_attr` when path conversion fails.
Function `cifs_get_writable_path` returns `cfile` with its reference
counter `cfile->count` increased on success. Function `smb2_compound_op`
would decrease the reference counter for `cfile`, as stated in its
comment. By calling `smb2_rename_path`, the reference counter of `cfile`
would leak if `cifs_convert_path_to_utf16` fails in `smb2_set_path_attr`.
Fixes: 8de9e86c67ba ("cifs: create a helper to find a writeable handle by path name")
Acked-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
RFC 7862 Section 4.1.2 says that if the server supports CLONE it MUST
support clone_blksize attribute.
Fixes: d6ca7d2643ee ("NFSD: Implement FATTR4_CLONE_BLKSIZE attribute")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
nfsd exports a "pseudo root filesystem" which is used by NFSv4 to find
the various exported filesystems using LOOKUP requests from a known root
filehandle. NFSv3 uses the MOUNT protocol to find those exported
filesystems and so is not given access to the pseudo root filesystem.
If a v3 (or v2) client uses a filehandle from that filesystem,
nfsd_set_fh_dentry() will report an error, but still stores the export
in "struct svc_fh" even though it also drops the reference (exp_put()).
This means that when fh_put() is called an extra reference will be dropped
which can lead to use-after-free and possible denial of service.
Normal NFS usage will not provide a pseudo-root filehandle to a v3
client. This bug can only be triggered by the client synthesising an
incorrect filehandle.
To fix this we move the assignments to the svc_fh later, after all
possible error cases have been detected.
Reported-and-tested-by: tianshuo han <hantianshuo233@gmail.com>
Fixes: ef7f6c4904d0 ("nfsd: move V4ROOT version check to nfsd_set_fh_dentry()")
Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
find_or_create_cached_dir() could grab a new reference after kref_put()
had seen the refcount drop to zero but before cfid_list_lock is acquired
in smb2_close_cached_fid(), leading to use-after-free.
Switch to kref_put_lock() so cfid_release() is called with
cfid_list_lock held, closing that gap.
Fixes: ebe98f1447bb ("cifs: enable caching of directories for which a lease is held")
Cc: stable@vger.kernel.org
Reported-by: Jay Shin <jaeshin@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Current ksmbd_rdma_capable_netdev fails to mark certain RDMA-capable
inerfaces such as IPoIB as RDMA capable after reverting GUID matching code
due to layer violation.
This patch check the ARPHRD_INFINIBAND type safely identifies an IPoIB
interface without introducing a layer violation, ensuring RDMA
functionality is correctly enabled for these interfaces.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If user set bridge interface as actual RDMA-capable NICs are lower devices,
ksmbd can not detect as RDMA capable. This patch can detect the RDMA
capable lower devices from bridge master or VLAN. With this change, ksmbd
can accept both TCP and RDMA connections through the same bridge IP
address, allowing mixed transport operation without requiring separate
interfaces.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix memory leak in qgroup relation ioctl when qgroup levels are
invalid
- don't write back dirty metadata on filesystem with errors
- properly log renamed links
- properly mark prealloc extent range beyond inode size as dirty (when
no-noles is not enabled)
* tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: mark dirty extent range for out of bound prealloc extents
btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name
btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation
btrfs: ensure no dirty metadata is written back for an fs with errors
|
|
Pull xfs fixes from Carlos Maiolino:
"Just a single bug fix (and documentation for the issue)"
* tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: document another racy GC case in xfs_zoned_map_extent
xfs: prevent gc from picking the same zone twice
|
|
Pull smb client fixes from Steve French:
- fix potential UAF in statfs
- DFS fix for expired referrals
- fix minor modinfo typo
- small improvement to reconnect for smbdirect
* tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: call smbd_destroy() in the same splace as kernel_sock_shutdown()/sock_release()
smb: client: handle lack of IPC in dfs_cache_refresh()
smb: client: fix potential cfid UAF in smb2_query_info_compound
cifs: fix typo in enable_gcm_256 module parameter
|
|
Besides blocks being invalidated, there is another case when the original
mapping could have changed between querying the rmap for GC and calling
xfs_zoned_map_extent. Document it there as it took us quite some time
to figure out what is going on while developing the multiple-GC
protection fix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
When we are picking a zone for gc it might already be in the pipeline
which can lead to us moving the same data twice resulting in in write
amplification and a very unfortunate case where we keep on garbage
collecting the zone we just filled with migrated data stopping all
forward progress.
Fix this by introducing a count of on-going GC operations on a zone, and
skip any zone with ongoing GC when picking a new victim.
Fixes: 080d01c41 ("xfs: implement zoned garbage collection")
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
In btrfs_fallocate(), when the allocated range overlaps with a prealloc
extent and the extent starts after i_size, the range doesn't get marked
dirty in file_extent_tree. This results in persisting an incorrect
disk_i_size for the inode when not using the no-holes feature.
This is reproducible since commit 41a2ee75aab0 ("btrfs: introduce
per-inode file extent tree"), then became hidden since commit 3d7db6e8bd22
("btrfs: don't allocate file extent tree for non regular files") and then
visible again after commit 8679d2687c35 ("btrfs: initialize
inode::file_extent_tree after i_mode has been set"), which fixes the
previous commit.
The following reproducer triggers the problem:
$ cat test.sh
MNT=/mnt/test
DEV=/dev/vdb
mkdir -p $MNT
mkfs.btrfs -f -O ^no-holes $DEV
mount $DEV $MNT
touch $MNT/file1
fallocate -n -o 1M -l 2M $MNT/file1
umount $MNT
mount $DEV $MNT
len=$((1 * 1024 * 1024))
fallocate -o 1M -l $len $MNT/file1
du --bytes $MNT/file1
umount $MNT
mount $DEV $MNT
du --bytes $MNT/file1
umount $MNT
Running the reproducer gives the following result:
$ ./test.sh
(...)
2097152 /mnt/test/file1
1048576 /mnt/test/file1
The difference is exactly 1048576 as we assigned.
Fix by adding a call to btrfs_inode_set_file_extent_range() in
btrfs_fallocate_update_isize().
Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree")
Signed-off-by: austinchang <austinchang@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we are logging a new name make sure our inode has the runtime flag
BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find
new inode refs/extrefs in the subvolume tree and copy them into the log
tree.
We are currently doing it when adding a new link but we are missing it
when renaming.
An example where this makes a new name not persisted:
1) create symlink with name foo in directory A
2) fsync directory A, which persists the symlink
3) rename the symlink from foo to bar
4) fsync directory A to persist the new symlink name
Step 4 isn't working correctly as it's not logging the new name and also
leaving the old inode ref in the log tree, so after a power failure the
symlink still has the old name of "foo". This is because when we first
fsync directoy A we log the symlink's inode (as it's a new entry) and at
btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because
we are using that mode and the inode has the runtime flag
BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag
BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode,
during the rename through the call to btrfs_log_new_name() (calling
btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search
the subvolume tree for new refs/extrefs and jump directory to the
'log_extents' label.
Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode
when we are about to log a new name. A test case for fstests will follow
soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When btrfs_add_qgroup_relation() is called with invalid qgroup levels
(src >= dst), the function returns -EINVAL directly without freeing the
preallocated qgroup_list structure passed by the caller. This causes a
memory leak because the caller unconditionally sets the pointer to NULL
after the call, preventing any cleanup.
The issue occurs because the level validation check happens before the
mutex is acquired and before any error handling path that would free
the prealloc pointer. On this early return, the cleanup code at the
'out' label (which includes kfree(prealloc)) is never reached.
In btrfs_ioctl_qgroup_assign(), the code pattern is:
prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL);
ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc);
prealloc = NULL; // Always set to NULL regardless of return value
...
kfree(prealloc); // This becomes kfree(NULL), does nothing
When the level check fails, 'prealloc' is never freed by either the
callee or the caller, resulting in a 64-byte memory leak per failed
operation. This can be triggered repeatedly by an unprivileged user
with access to a writable btrfs mount, potentially exhausting kernel
memory.
Fix this by freeing prealloc before the early return, ensuring prealloc
is always freed on all error paths.
Fixes: 4addc1ffd67a ("btrfs: qgroup: preallocate memory before adding a relation")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
During development of a minor feature (make sure all btrfs_bio::end_io()
is called in task context), I noticed a crash in generic/388, where
metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just
using RAID5 for metadata and the same workload from generic/388 is going
to trigger the use-after-free.
[CAUSE]
If btrfs hits an error, the fs is marked as error, no new
transaction is allowed thus metadata is in a frozen state.
But there are some metadata modifications before that error, and they are
still in the btree inode page cache.
Since there will be no real transaction commit, all those dirty folios
are just kept as is in the page cache, and they can not be invalidated
by invalidate_inode_pages2() call inside close_ctree(), because they are
dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree
inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue
new works into rmw_workers, which is already stopped, causing warning
from queue_work() and use-after-free.
[FIX]
Add a special handling for write_one_eb(), that if the fs is already in
an error state, immediately mark the bbio as failure, instead of really
submitting them.
Then during close_ctree(), iput() will just discard all those dirty
tree blocks without really writing them back, thus no more new jobs for
already stopped-and-freed workqueues.
The extra discard in write_one_eb() also acts as an extra safenet.
E.g. the transaction abort is triggered by some extent/free space
tree corruptions, and since extent/free space tree is already corrupted
some tree blocks may be allocated where they shouldn't be (overwriting
existing tree blocks). In that case writing them back will further
corrupting the fs.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
kernel_sock_shutdown()/sock_release()
With commit b0432201a11b ("smb: client: let destroy_mr_list() keep
smbdirect_mr_io memory if registered") the changes from commit
214bab448476 ("cifs: Call MID callback before destroying transport") and
commit 1d2a4f57cebd ("cifs:smbd When reconnecting to server, call
smbd_destroy() after all MIDs have been called") are no longer needed.
And it's better to use the same logic flow, so that
the chance of smbdirect related problems is smaller.
Fixes: 214bab448476 ("cifs: Call MID callback before destroying transport")
Fixes: 1d2a4f57cebd ("cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs have been called")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
In very rare cases, DFS mounts could end up with SMB sessions without
any IPC connections. These mounts are only possible when having
unexpired cached DFS referrals, hence not requiring any IPC
connections during the mount process.
Try to establish those missing IPC connections when refreshing DFS
referrals. If the server is still rejecting it, then simply ignore
and leave expired cached DFS referral for any potential DFS failovers.
Reported-by: Jay Shin <jaeshin@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Clang, in particular, is not happy about dead code:
fs/namespace.c:135:37: error: unused function 'node_to_mnt_ns' [-Werror,-Wunused-function]
135 | static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
| ^~~~~~~~~~~~~~
1 error generated.
Remove a leftover from the previous cleanup.
Fixes: 7d7d16498958 ("mnt: support ns lookup")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20251024132336.1666382-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
syzbot is reporting that S_IFMT bits of inode->i_mode can become bogus when
the S_IFMT bits of the 32bits "mode" field loaded from disk are corrupted
or when the 32bits "attributes" field loaded from disk are corrupted.
A documentation says that BFS uses only lower 9 bits of the "mode" field.
But I can't find an explicit explanation that the unused upper 23 bits
(especially, the S_IFMT bits) are initialized with 0.
Therefore, ignore the S_IFMT bits of the "mode" field loaded from disk.
Also, verify that the value of the "attributes" field loaded from disk is
either BFS_VREG or BFS_VDIR (because BFS supports only regular files and
the root directory).
Reported-by: syzbot+895c23f6917da440ed0d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=895c23f6917da440ed0d
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://patch.msgid.link/fabce673-d5b9-4038-8287-0fd65d80203b@I-love.SAKURA.ne.jp
Reviewed-by: Tigran Aivazian <aivazian.tigran@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When a process tries to access an entry in /afs, normally what happens is
that an automount dentry is created by ->lookup() and then triggered, which
jumps through the ->d_automount() op. Currently, afs_dynroot_lookup() does
not do cell DNS lookup, leaving that to afs_d_automount() to perform -
however, it is possible to use access() or stat() on the automount point,
which will always return successfully, have briefly created an afs_cell
record if one did not already exist.
This means that something like:
test -d "/afs/.west" && echo Directory exists
will print "Directory exists" even though no such cell is configured. This
breaks the "west" python module available on PIP as it expects this access
to fail.
Now, it could be possible to make afs_dynroot_lookup() perform the DNS[*]
lookup, but that would make "ls --color /afs" do this for each cell in /afs
that is listed but not yet probed. kafs-client, probably wrongly, preloads
the entire cell database and all the known cells are then listed in /afs -
and doing ls /afs would be very, very slow, especially if any cell supplied
addresses but was wholly inaccessible.
[*] When I say "DNS", actually read getaddrinfo(), which could use any one
of a host of mechanisms. Could also use static configuration.
To fix this, make the following changes:
(1) Create an enum to specify the origination point of a call to
afs_lookup_cell() and pass this value into that function in place of
the "excl" parameter (which can be derived from it). There are six
points of origination:
- Cell preload through /proc/net/afs/cells
- Root cell config through /proc/net/afs/rootcell
- Lookup in dynamic root
- Automount trigger
- Direct mount with mount() syscall
- Alias check where YFS tells us the cell name is different
(2) Add an extra state into the afs_cell state machine to indicate a cell
that's been initialised, but not yet looked up. This is separate from
one that can be considered active and has been looked up at least
once.
(3) Make afs_lookup_cell() vary its behaviour more, depending on where it
was called from:
If called from preload or root cell config, DNS lookup will not happen
until we definitely want to use the cell (dynroot mount, automount,
direct mount or alias check). The cell will appear in /afs but stat()
won't trigger DNS lookup.
If the cell already exists, dynroot will not wait for the DNS lookup
to complete. If the cell did not already exist, dynroot will wait.
If called from automount, direct mount or alias check, it will wait
for the DNS lookup to complete.
(4) Make afs_lookup_cell() return an error if lookup failed in one way or
another. We try to return -ENOENT if the DNS says the cell does not
exist and -EDESTADDRREQ if we couldn't access the DNS.
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220685
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/1784747.1761158912@warthog.procyon.org.uk
Fixes: 1d0b929fc070 ("afs: Change dynroot to create contents on demand")
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull smb server fixes from Steve French:
- Improve check for malformed payload
- Fix free transport smbdirect potential race
- Fix potential race in credit allocation during smbdirect negotiation
* tag 'v6.18-rc3-smb-server-fixes' of git://git.samba.org/ksmbd:
smb: server: let smb_direct_cm_handler() call ib_drain_qp() after smb_direct_disconnect_rdma_work()
smb: server: call smb_direct_post_recv_credits() when the negotiation is done
ksmbd: transport_ipc: validate payload size before reading handle
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Regression fixes:
- Revert the patch that removed the cap on MAX_OPS_PER_COMPOUND
- Address a kernel build issue
Stable fixes:
- Fix crash when a client queries new attributes on forechannel
- Fix rare NFSD crash when tracing is enabled"
* tag 'nfsd-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
Revert "NFSD: Remove the cap on number of operations per NFSv4 COMPOUND"
nfsd: Avoid strlen conflict in nfsd4_encode_components_esc()
NFSD: Fix crash in nfsd4_read_release()
NFSD: Define actions for the new time_deleg FATTR4 attributes
|
|
When smb2_query_info_compound() retries, a previously allocated cfid may
have been freed in the first attempt.
Because cfid wasn't reset on replay, later cleanup could act on a stale
pointer, leading to a potential use-after-free.
Reinitialize cfid to NULL under the replay label.
Example trace (trimmed):
refcount_t: underflow; use-after-free.
WARNING: CPU: 1 PID: 11224 at ../lib/refcount.c:28 refcount_warn_saturate+0x9c/0x110
[...]
RIP: 0010:refcount_warn_saturate+0x9c/0x110
[...]
Call Trace:
<TASK>
smb2_query_info_compound+0x29c/0x5c0 [cifs f90b72658819bd21c94769b6a652029a07a7172f]
? step_into+0x10d/0x690
? __legitimize_path+0x28/0x60
smb2_queryfs+0x6a/0xf0 [cifs f90b72658819bd21c94769b6a652029a07a7172f]
smb311_queryfs+0x12d/0x140 [cifs f90b72658819bd21c94769b6a652029a07a7172f]
? kmem_cache_alloc+0x18a/0x340
? getname_flags+0x46/0x1e0
cifs_statfs+0x9f/0x2b0 [cifs f90b72658819bd21c94769b6a652029a07a7172f]
statfs_by_dentry+0x67/0x90
vfs_statfs+0x16/0xd0
user_statfs+0x54/0xa0
__do_sys_statfs+0x20/0x50
do_syscall_64+0x58/0x80
Cc: stable@kernel.org
Fixes: 4f1fffa237692 ("cifs: commands that are retried should have replay flag set")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
smb_direct_disconnect_rdma_work()
All handlers triggered by ib_drain_qp() should already see the
broken connection.
smb_direct_cm_handler() is called under a mutex of the rdma_cm,
we should make sure ib_drain_qp() and all rdma layer logic completes
and unlocks the mutex.
It means free_transport() will also already see the connection
as SMBDIRECT_SOCKET_DISCONNECTED, so we need to call
crdma_[un]lock_handler(sc->rdma.cm_id) around
ib_drain_qp(), rdma_destroy_qp(), ib_free_cq() and ib_dealloc_pd().
Otherwise we free resources while the ib_drain_qp() within
smb_direct_cm_handler() is still running.
We have to unlock before rdma_destroy_id() as it locks again.
Fixes: 141fa9824c0f ("ksmbd: call ib_drain_qp when disconnected")
Fixes: 4c564f03e23b ("smb: server: make use of common smbdirect_socket")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We now activate sc->recv_io.posted.refill_work and sc->idle.immediate_work
only after a successful negotiation, before sending the negotiation
response.
It means the queue_work(sc->workqueue, &sc->recv_io.posted.refill_work)
in put_recvmsg() of the negotiate request, is a no-op now.
It also means our explicit smb_direct_post_recv_credits() will
have queue_work(sc->workqueue, &sc->idle.immediate_work) as no-op.
This should make sure we don't have races and post any immediate
data_transfer message that tries to grant credits to the peer,
before we send the negotiation response, as that will grant
the initial credits to the peer.
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Fixes: 1cde0a74a7a8 ("smb: server: don't use delayed_work for post_recv_credits_work")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
handle_response() dereferences the payload as a 4-byte handle without
verifying that the declared payload size is at least 4 bytes. A malformed
or truncated message from ksmbd.mountd can lead to a 4-byte read past the
declared payload size. Validate the size before dereferencing.
This is a minimal fix to guard the initial handle read.
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Cc: stable@vger.kernel.org
Reported-by: Qianchang Zhao <pioooooooooip@gmail.com>
Signed-off-by: Qianchang Zhao <pioooooooooip@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix typo in description of enable_gcm_256 module parameter
Suggested-by: Thomas Spear <speeddymon@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Remove dead code leftovers after a recent mitigations cleanup which
fail a Clang build
- Make sure a Retbleed mitigation message is printed only when
necessary
- Correct the last Zen1 microcode revision for which Entrysign sha256
check is needed
- Fix a NULL ptr deref when mounting the resctrl fs on a system which
supports assignable counters but where L3 total and local bandwidth
monitoring has been disabled at boot
* tag 'x86_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Remove dead code which might prevent from building
x86/bugs: Qualify RETBLEED_INTEL_MSG
x86/microcode: Fix Entrysign revision check for Zen1/Naples
x86,fs/resctrl: Fix NULL pointer dereference with events force-disabled in mbm_event mode
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:
- In Device::parent(), do not make any assumptions on the device
context of the parent device
- Check visibility before changing ownership of a sysfs attribute
group
- In topology_parse_cpu_capacity(), replace an incorrect usage of
PTR_ERR_OR_ZERO() with IS_ERR_OR_NULL()
- In devcoredump, fix a circular locking dependency between
struct devcd_entry::mutex and kernfs
- Do not warn about a pending fw_devlink sync state
* tag 'driver-core-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
arch_topology: Fix incorrect error check in topology_parse_cpu_capacity()
rust: device: fix device context of Device::parent()
sysfs: check visibility before changing group attribute ownership
devcoredump: Fix circular locking dependency with devcd->mutex.
driver core: fw_devlink: Don't warn about sync_state() pending
|
|
Pull xfs fixes from Carlos Maiolino:
"The main highlight here is a fix for a bug brought in by the removal
of attr2 mount option, where some installations might actually have
'attr2' explicitly configured in fstab preventing system to boot by
not being able to remount the rootfs as RW.
Besides that there are a couple fix to the zonefs implementation,
changing XFS_ONLINE_SCRUB_STATS to depend on DEBUG_FS (was select
before), and some other minor changes"
* tag 'xfs-fixes-6.18-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix locking in xchk_nlinks_collect_dir
xfs: loudly complain about defunct mount options
xfs: always warn about deprecated mount options
xfs: don't set bt_nr_sectors to a negative number
xfs: don't use __GFP_NOFAIL in xfs_init_fs_context
xfs: cache open zone in inode->i_private
xfs: avoid busy loops in GCD
xfs: XFS_ONLINE_SCRUB_STATS should depend on DEBUG_FS
xfs: do not tightly pack-write large files
xfs: Improve CONFIG_XFS_RT Kconfig help
|
|
Pull smb server fixes from Steve French:
"smbdirect (RDMA) fixes in order avoid potential submission queue
overflows:
- free transport teardown fix
- credit related fixes (five server related, one client related)"
* tag 'v6.18-rc2-smb-server-fixes' of git://git.samba.org/ksmbd:
smb: server: let free_transport() wait for SMBDIRECT_SOCKET_DISCONNECTED
smb: client: make use of smbdirect_socket.send_io.lcredits.*
smb: server: make use of smbdirect_socket.send_io.lcredits.*
smb: server: simplify sibling_list handling in smb_direct_flush_send_list/send_done
smb: server: smb_direct_disconnect_rdma_connection() already wakes all waiters on error
smb: smbdirect: introduce smbdirect_socket.send_io.lcredits.*
smb: server: allocate enough space for RW WRs and ib_drain_qp()
|
|
Pull smb client fixes from Steve French:
- add missing tracepoints
- smbdirect (RDMA) fix
- fix potential issue with credits underflow
- rename fix
- improvement to calc_signature and additional cleanup patch
* tag '6.18-rc2-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: #include cifsglob.h before trace.h to allow structs in tracepoints
cifs: Call the calc_signature functions directly
smb: client: get rid of d_drop() in cifs_do_rename()
cifs: Fix TCP_Server_Info::credits to be signed
cifs: Add a couple of missing smb3_rw_credits tracepoints
smb: client: allocate enough space for MR WRs and ib_drain_qp()
|
|
We should wait for the rdma_cm to become SMBDIRECT_SOCKET_DISCONNECTED!
At least on the client side (with similar code)
wait_event_interruptible() often returns with -ERESTARTSYS instead of
waiting for SMBDIRECT_SOCKET_DISCONNECTED.
We should use wait_event() here too, which makes the code be identical
in client and server, which will help when moving to common functions.
Fixes: b31606097de8 ("smb: server: move smb_direct_disconnect_rdma_work() into free_transport()")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- in send, fix duplicated rmdir operations when using extrefs
(hardlinks), receive can fail with ENOENT
- fixup of error check when reading extent root in ref-verify and
damaged roots are allowed by mount option (found by smatch)
- fix freeing partially initialized fs info (found by syzkaller)
- fix use-after-free when printing ref_tracking status of delayed
inodes
* tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree()
btrfs: fix delayed_node ref_tracker use after free
btrfs: send: fix duplicated rmdir operations when using extrefs
btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots()
|
|
Make cifs #include cifsglob.h in advance of #including trace.h so that the
structures defined in cifsglob.h can be accessed directly by the cifs
tracepoints rather than the callers having to manually pass in the bits and
pieces.
This should allow the tracepoints to be made more efficient to use as well
as easier to read in the code.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
As the SMB1 and SMB2/3 calc_signature functions are called from separate
sign and verify paths, just call them directly rather than using a function
pointer. The SMB3 calc_signature then jumps to the SMB2 variant if
necessary.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Enzo Matsumiya <ematsumiya@suse.de>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
There is no need to force a lookup by unhashing the moved dentry after
successfully renaming the file on server. The file metadata will be
re-fetched from server, if necessary, in the next call to
->d_revalidate() anyways.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix TCP_Server_Info::credits to be signed, just as echo_credits and
oplock_credits are. This also fixes what ought to get at least a
compilation warning if not an outright error in *get_credits_field() as a
pointer to the unsigned server->credits field is passed back as a pointer
to a signed int.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Acked-by: Pavel Shilovskiy <pshilovskiy@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This makes the logic to prevent on overflow of
the send submission queue with ib_post_send() easier.
As we first get a local credit and then a remote credit
before we mark us as pending.
For now we'll keep the logic around smbdirect_socket.send_io.pending.*,
but that will likely change or be removed completely.
The server will get a similar logic soon, so
we'll be able to share the send code in future.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This introduces logic to prevent on overflow of
the send submission queue with ib_post_send() easier.
As we first get a local credit and then a remote credit
before we mark us as pending.
From reading the git history of the linux smbdirect
implementations in client and server) it was seen
that a peer granted more credits than we requested.
I guess that only happened because of bugs in our
implementation which was active as client and server.
I guess Windows won't do that.
So the local credits make sure we only use the amount
of credits we asked for.
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
smb_direct_flush_send_list/send_done
We have a list handling that is much easier to understand:
1. Before smb_direct_flush_send_list() is called all
struct smbdirect_send_io messages are part of
send_ctx->msg_list
2. Before smb_direct_flush_send_list() calls
smb_direct_post_send() we remove the last
element in send_ctx->msg_list and move all
others into last->sibling_list. As only
last has IB_SEND_SIGNALED and gets a completion
vis send_done().
3. send_done() has an easy way to free all others
in sendmsg->sibling_list (if there are any).
And use list_for_each_entry_safe() instead of
a complex custom logic.
This will help us to share send_done() in common
code soon, as it will work fine for the client too,
where last->sibling_list is currently always an empty list.
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
waiters on error
There's no need to care about pending or credit counters when we
already disconnecting.
And all related wait_event conditions already check for broken
connections too.
This will simplify the code and makes the following changes simpler.
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This will be used to implement a logic in order to make sure
we don't overflow the send submission queue for ib_post_send().
We will initialize the local credits with the
fixed sp->send_credit_target value, which matches
the reserved slots in the submission queue for ib_post_send().
We will be a local credit first and then wait for a remote credit,
if we managed to get both we are allowed to post an
IB_WR_SEND[_WITH_INV]. The local credit is given back to
the pool when we get the local ib_post_send() completion,
while remote credits are granted by the peer.
From reading the git history of the linux smbdirect
implementations in client and server) it was seen
that a peer granted more credits than we requested.
I guess that only happened because of bugs in our
implementation which was active as client and server.
I guess Windows won't do that.
So the local credits make sure we only use the amount
of credits we asked for.
The client already has some logic for this based on
smbdirect_socket.send_io.pending.count, but that
counts in the order direction and makes it complex it
share common logic for various credits classes.
That logic will be replaced soon.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Make use of rdma_rw_mr_factor() to calculate the number of rw
credits and the number of pages per RDMA RW operation.
We get the same numbers for iWarp connections, tested
with siw.ko and irdma.ko (in iWarp mode).
siw:
CIFS: max_qp_rd_atom=128, max_fast_reg_page_list_len = 256
CIFS: max_sgl_rd=0, max_sge_rd=1
CIFS: responder_resources=32 max_frmr_depth=256 mr_io.type=0
CIFS: max_send_wr 384, device reporting max_cqe 3276800 max_qp_wr 32768
ksmbd: max_fast_reg_page_list_len = 256, max_sgl_rd=0, max_sge_rd=1
ksmbd: device reporting max_cqe 3276800 max_qp_wr 32768
ksmbd: Old sc->rw_io.credits: max = 9, num_pages = 256
ksmbd: New sc->rw_io.credits: max = 9, num_pages = 256, maxpages=2048
ksmbd: Info: rdma_send_wr 27 + max_send_wr 256 = 283
irdma (in iWarp mode):
CIFS: max_qp_rd_atom=127, max_fast_reg_page_list_len = 262144
CIFS: max_sgl_rd=0, max_sge_rd=13
CIFS: responder_resources=32 max_frmr_depth=2048 mr_io.type=0
CIFS: max_send_wr 384, device reporting max_cqe 1048574 max_qp_wr 4063
ksmbd: max_fast_reg_page_list_len = 262144, max_sgl_rd=0, max_sge_rd=13
ksmbd: device reporting max_cqe 1048574 max_qp_wr 4063
ksmbd: Old sc->rw_io.credits: max = 9, num_pages = 256
ksmbd: New sc->rw_io.credits: max = 9, num_pages = 256, maxpages=2048
ksmbd: rdma_send_wr 27 + max_send_wr 256 = 283
This means that we get the different correct numbers for ROCE,
tested with rdma_rxe.ko and irdma.ko (in RoCEv2 mode).
rxe:
CIFS: max_qp_rd_atom=128, max_fast_reg_page_list_len = 512
CIFS: max_sgl_rd=0, max_sge_rd=32
CIFS: responder_resources=32 max_frmr_depth=512 mr_io.type=0
CIFS: max_send_wr 384, device reporting max_cqe 32767 max_qp_wr 1048576
ksmbd: max_fast_reg_page_list_len = 512, max_sgl_rd=0, max_sge_rd=32
ksmbd: device reporting max_cqe 32767 max_qp_wr 1048576
ksmbd: Old sc->rw_io.credits: max = 9, num_pages = 256
ksmbd: New sc->rw_io.credits: max = 65, num_pages = 32, maxpages=2048
ksmbd: rdma_send_wr 65 + max_send_wr 256 = 321
irdma (in RoCEv2 mode):
CIFS: max_qp_rd_atom=127, max_fast_reg_page_list_len = 262144,
CIFS: max_sgl_rd=0, max_sge_rd=13
CIFS: responder_resources=32 max_frmr_depth=2048 mr_io.type=0
CIFS: max_send_wr 384, device reporting max_cqe 1048574 max_qp_wr 4063
ksmbd: max_fast_reg_page_list_len = 262144, max_sgl_rd=0, max_sge_rd=13
ksmbd: device reporting max_cqe 1048574 max_qp_wr 4063
ksmbd: Old sc->rw_io.credits: max = 9, num_pages = 256,
ksmbd: New sc->rw_io.credits: max = 159, num_pages = 13, maxpages=2048
ksmbd: rdma_send_wr 159 + max_send_wr 256 = 415
And rely on rdma_rw_init_qp() to setup ib_mr_pool_init() for
RW MRs. ib_mr_pool_destroy() will be called by rdma_rw_cleanup_mrs().
It seems the code was implemented before the rdma_rw_* layer
was fully established in the kernel.
While there also add additional space for ib_drain_qp().
This should make sure ib_post_send() will never fail
because the submission queue is full.
Fixes: ddbdc861e37c ("ksmbd: smbd: introduce read/write credits for RDMA read/write")
Fixes: 4c564f03e23b ("smb: server: make use of common smbdirect_socket")
Fixes: 177368b99243 ("smb: server: make use of common smbdirect_socket_parameters")
Fixes: 95475d8886bd ("smb: server: make use smbdirect_socket.rw_io.credits")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"17 hotfixes. 12 are cc:stable and 14 are for MM.
There's a two-patch DAMON series from SeongJae Park which addresses a
missed check and possible memory leak. Apart from that it's all
singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
csky: abiv2: adapt to new folio flags field
mm/damon/core: use damos_commit_quota_goal() for new goal commit
mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme
hugetlbfs: move lock assertions after early returns in huge_pmd_unshare()
vmw_balloon: indicate success when effectively deflating during migration
mm/damon/core: fix list_add_tail() call on damon_call()
mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap
mm: prevent poison consumption when splitting THP
ocfs2: clear extent cache after moving/defragmenting extents
mm: don't spin in add_stack_record when gfp flags don't allow
dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC
mm/damon/sysfs: dealloc commit test ctx always
mm/damon/sysfs: catch commit test ctx alloc failure
hung_task: fix warnings caused by unaligned lock pointers
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Just three small fixes to address fuzzed images in relatively new
features, as reported by Robert.
- Hardening against fuzzed encoded extents
- Fix infinite loops due to crafted subpage compact indexes
- Improve z_erofs_extent_lookback()"
* tag 'erofs-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: consolidate z_erofs_extent_lookback()
erofs: avoid infinite loops due to corrupted subpage compact indexes
erofs: fix crafted invalid cases for encoded extents
|
|
Pull 9pfs fix from Dominique Martinet:
"Fix 9p cache=mmap regression by revert
This reverts the problematic commit instead of trying to fix it in a
rush"
* tag '9p-for-6.18-rc3-v2' of https://github.com/martinetd/linux:
Revert "fs/9p: Refresh metadata in d_revalidate for uncached mode too"
|
|
On a filesystem with parent pointers, xchk_nlinks_collect_dir walks both
the directory entries (data fork) and the parent pointers (attr fork) to
determine the correct link count. Unfortunately I forgot to update the
lock mode logic to handle the case of a directory whose attr fork is in
btree format and has not yet been loaded *and* whose data fork doesn't
need loading.
This leads to a bunch of assertions from xfs/286 in xfs_iread_extents
because we only took ILOCK_SHARED, not ILOCK_EXCL. You'd need the rare
happenstance of a directory with a large number of non-pptr extended
attributes set and enough memory pressure to cause the directory to be
evicted and partially reloaded from disk.
I /think/ this only started in 6.18-rc1 because I've started seeing OOM
errors with the maple tree slab using 70% of memory, and this didn't
happen in 6.17. Yay dynamic systems!
Cc: stable@vger.kernel.org # v6.10
Fixes: 77ede5f44b0d86 ("xfs: walk directory parent pointers to determine backref count")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Apparently we can never deprecate mount options in this project, because
it will invariably turn out that some foolish userspace depends on some
behavior and break. From Oleksandr Natalenko:
In v6.18, the attr2 XFS mount option is removed. This may silently
break system boot if the attr2 option is still present in /etc/fstab
for rootfs.
Consider Arch Linux that is being set up from scratch with / being
formatted as XFS. The genfstab command that is used to generate
/etc/fstab produces something like this by default:
/dev/sda2 on / type xfs (rw,relatime,attr2,discard,inode64,logbufs=8,logbsize=32k,noquota)
Once the system is set up and rebooted, there's no deprecation warning
seen in the kernel log:
# cat /proc/cmdline
root=UUID=77b42de2-397e-47ee-a1ef-4dfd430e47e9 rootflags=discard rd.luks.options=discard quiet
# dmesg | grep -i xfs
[ 2.409818] SGI XFS with ACLs, security attributes, realtime, scrub, repair, quota, no debug enabled
[ 2.415341] XFS (sda2): Mounting V5 Filesystem 77b42de2-397e-47ee-a1ef-4dfd430e47e9
[ 2.442546] XFS (sda2): Ending clean mount
Although as per the deprecation intention, it should be there.
Vlastimil (in Cc) suggests this is because xfs_fs_warn_deprecated()
doesn't produce any warning by design if the XFS FS is set to be
rootfs and gets remounted read-write during boot. This imposes two
problems:
1) a user doesn't see the deprecation warning; and
2) with v6.18 kernel, the read-write remount fails because of unknown
attr2 option rendering system unusable:
systemd[1]: Switching root.
systemd-remount-fs[225]: /usr/bin/mount for / exited with exit status 32.
# mount -o rw /
mount: /: fsconfig() failed: xfs: Unknown parameter 'attr2'.
Thorsten (in Cc) suggested reporting this as a user-visible regression.
From my PoV, although the deprecation is in place for 5 years already,
it may not be visible enough as the warning is not emitted for rootfs.
Considering the amount of systems set up with XFS on /, this may
impose a mass problem for users.
Vlastimil suggested making attr2 option a complete noop instead of
removing it.
IOWs, the initrd mounts the root fs with (I assume) no mount options,
and mount -a remounts with whatever options are in fstab. However,
XFS doesn't complain about deprecated mount options during a remount, so
technically speaking we were not warning all users in all combinations
that they were heading for a cliff.
Gotcha!!
Now, how did 'attr2' get slurped up on so many systems? The old code
would put that in /proc/mounts if the filesystem happened to be in attr2
mode, even if user hadn't mounted with any such option. IOWs, this is
because someone thought it would be a good idea to advertise system
state via /proc/mounts.
The easy way to fix this is to reintroduce the four mount options but
map them to a no-op option that ignores them, and hope that nobody's
depending on attr2 to appear in /proc/mounts. (Hint: use the fsgeometry
ioctl). But we've learned our lesson, so complain as LOUDLY as possible
about the deprecation.
Lessons learned:
1. Don't expose system state via /proc/mounts; the only strings that
ought to be there are options *explicitly* provided by the user.
2. Never tidy, it's not worth the stress and irritation.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: stable@vger.kernel.org # v6.18-rc1
Fixes: b9a176e54162f8 ("xfs: remove deprecated mount options")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The deprecation of the 'attr2' mount option in 6.18 wasn't entirely
successful because nobody noticed that the kernel never printed a
warning about attr2 being set in fstab if the only xfs filesystem is the
root fs; the initramfs mounts the root fs with no mount options; and the
init scripts only conveyed the fstab options by remounting the root fs.
Fix this by making it complain all the time.
Cc: stable@vger.kernel.org # v5.13
Fixes: 92cf7d36384b99 ("xfs: Skip repetitive warnings about mount options")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
xfs_daddr_t is a signed type, which means that xfs_buf_map_verify is
using a signed comparison. This causes problems if bt_nr_sectors is
never overridden (e.g. in the case of an xfbtree for rmap btree repairs)
because even daddr 0 can't pass the verifier test in that case.
Define an explicit max constant and set the initial bt_nr_sectors to a
positive value.
Found by xfs/422.
Cc: stable@vger.kernel.org # v6.18-rc1
Fixes: 42852fe57c6d2a ("xfs: track the number of blocks in each buftarg")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
btrfs_extent_root()/btrfs_global_root() does not return error pointers,
it returns NULL on error.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/all/aNJfvxj0anEnk9Dm@stanley.mountain/
Fixes : ed4e6b5d644c ("btrfs: ref-verify: handle damaged extent root tree")
CC: stable@vger.kernel.org # 6.17+
Signed-off-by: Amit Dhingra <mechanicalamit@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move the print before releasing the delayed node.
In my initial testing there was a bug that was causing delayed_nodes
to not get freed which is why I put the print after the release. This
obviously neglects the case where the delayed node is properly freed.
Add condition to make sure we only print if we have more than one
reference to the delayed_node to prevent printing when we only have
the reference taken in btrfs_kill_all_delayed_nodes().
Fixes: b767a28d6154 ("btrfs: print leaked references in kill_all_delayed_nodes()")
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This reverts commit 290434474c332a2ba9c8499fe699c7f2e1153280.
That commit broke cache=mmap, a mode that doesn't cache metadata,
but still has writeback cache.
In commit 290434474c33 ("fs/9p: Refresh metadata in d_revalidate
for uncached mode too") we considered metadata cache to be enough to
not look at the server, but in writeback cache too looking at the server
size would make the vfs consider the file has been truncated before the
data has been flushed out, making the following repro fail (nothing is
ever read back, the resulting file ends up with no data written)
```
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
char buf[4096];
int main(int argc, char *argv[])
{
int ret, i;
int fdw, fdr;
if (argc < 2)
return 1;
fdw = openat(AT_FDCWD, argv[1], O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600);
if (fdw < 0) {
fprintf(stderr, "cannot open fdw\n");
return 1;
}
write(fdw, buf, sizeof(buf));
fdr = openat(AT_FDCWD, argv[1], O_RDONLY|O_CLOEXEC);
if (fdr < 0) {
fprintf(stderr, "cannot open fdr\n");
close(fdw);
return 1;
}
for (i = 0; i < 10; i++) {
ret = read(fdr, buf, sizeof(buf));
fprintf(stderr, "i: %d, read returns %d\n", i, ret);
}
close(fdr);
close(fdw);
return 0;
}
```
There is a fix for this particular reproducer but it looks like there
are other problems around metadata refresh (e.g. around file rename), so
revert this to avoid d_revalidate in uncached mode for now.
Reported-by: Song Liu <song@kernel.org>
Link: https://lkml.kernel.org/r/CAHzjS_u_SYdt5=2gYO_dxzMKXzGMt-TfdE_ueowg-Hq5tRCAiw@mail.gmail.com
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Link: https://lore.kernel.org/bpf/CAEf4BzZbCE4tLoDZyUf_aASpgAGFj75QMfSXX4a4dLYixnOiLg@mail.gmail.com/
Fixes: 290434474c33 ("fs/9p: Refresh metadata in d_revalidate for uncached mode too")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
The initial m.delta[0] also needs to be checked against zero.
In addition, also drop the redundant logic that errors out for
lcn == 0 / m.delta[0] == 1 case.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Robert reported an infinite loop observed by two crafted images.
The root cause is that `clusterofs` can be larger than `lclustersize`
for !NONHEAD `lclusters` in corrupted subpage compact indexes, e.g.:
blocksize = lclustersize = 512 lcn = 6 clusterofs = 515
Move the corresponding check for full compress indexes to
`z_erofs_load_lcluster_from_disk()` to also cover subpage compact
compress indexes.
It also fixes the position of `m->type >= Z_EROFS_LCLUSTER_TYPE_MAX`
check, since it should be placed right after
`z_erofs_load_{compact,full}_lcluster()`.
Fixes: 8d2517aaeea3 ("erofs: fix up compacted indexes for block size < 4096")
Fixes: 1a5223c182fd ("erofs: do sanity check on m->type in z_erofs_load_compact_lcluster()")
Reported-by: Robert Morris <rtm@csail.mit.edu>
Closes: https://lore.kernel.org/r/35167.1760645886@localhost
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
When hugetlb_vmdelete_list() processes VMAs during truncate operations, it
may encounter VMAs where huge_pmd_unshare() is called without the required
shareable lock. This triggers an assertion failure in
hugetlb_vma_assert_locked().
The previous fix in commit dd83609b8898 ("hugetlbfs: skip VMAs without
shareable locks in hugetlb_vmdelete_list") skipped entire VMAs without
shareable locks to avoid the assertion. However, this prevented pages
from being unmapped and freed, causing a regression in
fallocate(PUNCH_HOLE) operations where pages were not freed immediately,
as reported by Mark Brown.
Instead of checking locks in the caller or skipping VMAs, move the lock
assertions in huge_pmd_unshare() to after the early return checks. The
assertions are only needed when actual PMD unsharing work will be
performed. If the function returns early because sz != PMD_SIZE or the
PMD is not shared, no locks are required and assertions should not fire.
This approach reverts the VMA skipping logic from commit dd83609b8898
("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list")
while moving the assertions to avoid the assertion failure, keeping all
the logic within huge_pmd_unshare() itself and allowing page unmapping and
freeing to proceed for all VMAs.
Link: https://lkml.kernel.org/r/20251014113344.21194-1-kartikey406@gmail.com
Fixes: dd83609b8898 ("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: <syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com>
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://syzkaller.appspot.com/bug?extid=f26d7c75c26ec19790e7
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: <syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
I've found that pynfs COMP6 now leaves the connection or lease in a
strange state, which causes CLOSE9 to hang indefinitely. I've dug
into it a little, but I haven't been able to root-cause it yet.
However, I bisected to commit 48aab1606fa8 ("NFSD: Remove the cap on
number of operations per NFSv4 COMPOUND").
Tianshuo Han also reports a potential vulnerability when decoding
an NFSv4 COMPOUND. An attacker can place an arbitrarily large op
count in the COMPOUND header, which results in:
[ 51.410584] nfsd: vmalloc error: size 1209533382144, exceeds total
pages, mode:0xdc0(GFP_KERNEL|__GFP_ZERO),
nodemask=(null),cpuset=/,mems_allowed=0
when NFSD attempts to allocate the COMPOUND op array.
Let's restore the operation-per-COMPOUND limit, but increased to 200
for now.
Reported-by: tianshuo han <hantianshuo233@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Tested-by: Tianshuo Han <hantianshuo233@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
There is an error building nfs4xdr.c with CONFIG_SUNRPC_DEBUG_TRACE=y
and CONFIG_FORTIFY_SOURCE=n due to the local variable strlen conflicting
with the function strlen():
In file included from include/linux/cpumask.h:11,
from arch/x86/include/asm/paravirt.h:21,
from arch/x86/include/asm/irqflags.h:102,
from include/linux/irqflags.h:18,
from include/linux/spinlock.h:59,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:7,
from include/linux/slab.h:16,
from fs/nfsd/nfs4xdr.c:37:
fs/nfsd/nfs4xdr.c: In function 'nfsd4_encode_components_esc':
include/linux/kernel.h:321:46: error: called object 'strlen' is not a function or function pointer
321 | __trace_puts(_THIS_IP_, str, strlen(str)); \
| ^~~~~~
include/linux/kernel.h:265:17: note: in expansion of macro 'trace_puts'
265 | trace_puts(fmt); \
| ^~~~~~~~~~
include/linux/sunrpc/debug.h:34:41: note: in expansion of macro 'trace_printk'
34 | # define __sunrpc_printk(fmt, ...) trace_printk(fmt, ##__VA_ARGS__)
| ^~~~~~~~~~~~
include/linux/sunrpc/debug.h:42:17: note: in expansion of macro '__sunrpc_printk'
42 | __sunrpc_printk(fmt, ##__VA_ARGS__); \
| ^~~~~~~~~~~~~~~
include/linux/sunrpc/debug.h:25:9: note: in expansion of macro 'dfprintk'
25 | dfprintk(FACILITY, fmt, ##__VA_ARGS__)
| ^~~~~~~~
fs/nfsd/nfs4xdr.c:2646:9: note: in expansion of macro 'dprintk'
2646 | dprintk("nfsd4_encode_components(%s)\n", components);
| ^~~~~~~
fs/nfsd/nfs4xdr.c:2643:13: note: declared here
2643 | int strlen, count=0;
| ^~~~~~
This dprintk() instance is not particularly useful, so just remove it
altogether to get rid of the immediate strlen() conflict.
At the same time, eliminate the local strlen variable to avoid potential
conflicts with strlen() in the future.
Fixes: ec7d8e68ef0e ("sunrpc: add a Kconfig option to redirect dfprintk() output to trace buffer")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When tracing is enabled, the trace_nfsd_read_done trace point
crashes during the pynfs read.testNoFh test.
Fixes: 15a8b55dbb1b ("nfsd: call op_release, even when op_func returns an error")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
NFSv4 clients won't send legitimate GETATTR requests for these new
attributes because they are intended to be used only with CB_GETATTR
and SETATTR. But NFSD has to do something besides crashing if it
ever sees a GETATTR request that queries these attributes.
RFC 8881 Section 18.7.3 states:
> The server MUST return a value for each attribute that the client
> requests if the attribute is supported by the server for the
> target file system. If the server does not support a particular
> attribute on the target file system, then it MUST NOT return the
> attribute value and MUST NOT set the attribute bit in the result
> bitmap. The server MUST return an error if it supports an
> attribute on the target but cannot obtain its value. In that case,
> no attribute values will be returned.
Further, RFC 9754 Section 5 states:
> These new attributes are invalid to be used with GETATTR, VERIFY,
> and NVERIFY, and they can only be used with CB_GETATTR and SETATTR
> by a client holding an appropriate delegation.
Thus there does not appear to be a specific server response mandated
by specification. Taking the guidance that querying these attributes
via GETATTR is "invalid", NFSD will return nfserr_inval, failing the
request entirely.
Reported-by: Robert Morris <rtm@csail.mit.edu>
Closes: https://lore.kernel.org/linux-nfs/7819419cf0cb50d8130dc6b747765d2b8febc88a.camel@kernel.org/T/#t
Fixes: 51c0d4f7e317 ("nfsd: add support for FATTR4_OPEN_ARGUMENTS")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
In the old mount proceedure, hostfs could only pass root directory during
boot. This is because it constructed the root directory using the @root_ino
event without any mount options. However, when using it with the new mount
API, this step is no longer triggered. As a result, if users mounts without
specifying any mount options, the @host_root_path remains uninitialized. To
prevent this issue, the @host_root_path should be initialized at the time
of allocation.
Reported-by: Geoffrey Thorpe <geoff@geoffthorpe.net>
Closes: https://lore.kernel.org/all/643333a0-f434-42fb-82ac-d25a0b56f3b7@geoffthorpe.net/
Fixes: cd140ce9f611 ("hostfs: convert hostfs to use the new mount API")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20251011092235.29880-1-lihongbo22@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In statmount_string(), most flags assign an output offset pointer (offp)
which is later updated with the string offset. However, the
STATMOUNT_MNT_UIDMAP and STATMOUNT_MNT_GIDMAP cases directly set the
struct fields instead of using offp. This leaves offp uninitialized,
leading to a possible uninitialized dereference when *offp is updated.
Fix it by assigning offp for UIDMAP and GIDMAP as well, keeping the code
path consistent.
Fixes: 37c4a9590e1e ("statmount: allow to retrieve idmappings")
Fixes: e52e97f09fb6 ("statmount: let unset strings be empty")
Cc: stable@vger.kernel.org
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Link: https://patch.msgid.link/20251013114151.664341-1-zhen.ni@easystack.cn
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With enough debug options enabled, struct xfs_mount is larger
than 4k and thus NOFAIL allocations won't work for it.
xfs_init_fs_context is early in the mount process, and if we really
are out of memory there we'd better give up ASAP anyway.
Fixes: 7b77b46a6137 ("xfs: use kmem functions for struct xfs_mount")
Reported-by: syzbot+359a67b608de1ef72f65@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The MRU cache for open zones is unfortunately still not ideal, as it can
time out pretty easily when doing heavy I/O to hard disks using up most
or all open zones. One option would be to just increase the timeout,
but while looking into that I realized we're just better off caching it
indefinitely as there is no real downside to that once we don't hold a
reference to the cache open zone.
So switch the open zone to RCU freeing, and then stash the last used
open zone into inode->i_private. This helps to significantly reduce
fragmentation by keeping I/O localized to zones for workloads that
write using many open files to HDD.
Fixes: 4e4d52075577 ("xfs: add the zoned space allocator")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
When GCD has no new work to handle, but read, write or reset commands
are outstanding, it currently busy loops, which is a bit suboptimal,
and can lead to softlockup warnings in case of stuck commands.
Change the code so that the task state is only set to running when work
is performed, which looks a bit tricky due to the design of the
reading/writing/resetting lists that contain both in-flight and finished
commands.
Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Currently, XFS_ONLINE_SCRUB_STATS selects DEBUG_FS. However, DEBUG_FS
is meant for debugging, and people may want to disable it on production
systems. Since commit 0ff51a1fd786f47b ("xfs: enable online fsck by
default in Kconfig")), XFS_ONLINE_SCRUB_STATS is enabled by default,
forcing DEBUG_FS enabled too.
Fix this by replacing the selection of DEBUG_FS by a dependency on
DEBUG_FS, which is what most other options controlling the gathering and
exposing of statistics do.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
When using a zoned realtime device, tightly packing of data blocks
belonging to multiple closed files into the same realtime group (RTG)
is very efficient at improving write performance. This is especially
true with SMR HDDs as this can reduce, and even suppress, disk head
seeks.
However, such tight packing does not make sense for large files that
require at least a full RTG. If tight packing placement is applied for
such files, the VM writeback thread switching between inodes result in
the large files to be fragmented, thus increasing the garbage collection
penalty later when the RTG needs to be reclaimed.
This problem can be avoided with a simple heuristic: if the size of the
inode being written back is at least equal to the RTG size, do not use
tight-packing. Modify xfs_zoned_pack_tight() to always return false in
this case.
With this change, a multi-writer workload writing files of 256 MB on a
file system backed by an SMR HDD with 256 MB zone size as a realtime
device sees all files occupying exactly one RTG (i.e. one device zone),
thus completely removing the heavy fragmentation observed without this
change.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Improve the description of the XFS_RT configuration option to document
that this option is required for zoned block devices.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add missing smb3_rw_credits tracepoints to cifs_readv_callback() (for SMB1)
to match those of SMB2/3.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
- Stop-gap solution for a race between unmount of a filesystem with
fsnotify marks and someone inspecting fdinfo of fsnotify group with
those marks in procfs.
A proper solution is in the works but it will get a while to settle.
- Fix for non-decodable file handles (used by unprivileged apps using
fanotify)
* tag 'fsnotify_for_v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs/notify: call exportfs_encode_fid with s_umount
expfs: Fix exportfs_can_encode_fh() for EXPORT_FH_FID
|
|
mbm_event mode
The following NULL pointer dereference is encountered on mount of resctrl fs
after booting a system that supports assignable counters with the
"rdt=!mbmtotal,!mbmlocal" kernel parameters:
BUG: kernel NULL pointer dereference, address: 0000000000000008
RIP: 0010:mbm_cntr_get
Call Trace:
rdtgroup_assign_cntr_event
rdtgroup_assign_cntrs
rdt_get_tree
Specifying the kernel parameter "rdt=!mbmtotal,!mbmlocal" effectively disables
the legacy X86_FEATURE_CQM_MBM_TOTAL and X86_FEATURE_CQM_MBM_LOCAL features
and the MBM events they represent. This results in the per-domain MBM event
related data structures to not be allocated during early initialization.
resctrl fs initialization follows by implicitly enabling both MBM total and
local events on a system that supports assignable counters (mbm_event mode),
but this enabling occurs after the per-domain data structures have been
created.
After booting, resctrl fs assumes that an enabled event can access all its
state. This results in NULL pointer dereference when resctrl attempts to
access the un-allocated structures of an enabled event.
Remove the late MBM event enabling from resctrl fs.
This leaves a problem where the X86_FEATURE_CQM_MBM_TOTAL and
X86_FEATURE_CQM_MBM_LOCAL features may be disabled while assignable counter
(mbm_event) mode is enabled without any events to support. Switching between
the "default" and "mbm_event" mode without any events is not practical.
Create a dependency between the X86_FEATURE_{CQM_MBM_TOTAL,CQM_MBM_LOCAL} and
X86_FEATURE_ABMC (assignable counter) hardware features. An x86 system that
supports assignable counters now requires support of X86_FEATURE_CQM_MBM_TOTAL
or X86_FEATURE_CQM_MBM_LOCAL.
This ensures all needed MBM related data structures are created before use and
that it is only possible to switch between "default" and "mbm_event" mode when
the same events are available in both modes. This dependency does not exist in
the hardware but this usage of these feature settings work for known systems.
[ bp: Massage commit message. ]
Fixes: 13390861b426e ("x86,fs/resctrl: Detect Assignable Bandwidth Monitoring feature details")
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/a62e6ac063d0693475615edd213d5be5e55443e6.1760560934.git.babu.moger@amd.com
|
|
The IB_WR_REG_MR and IB_WR_LOCAL_INV operations for smbdirect_mr_io
structures should never fail because the submission or completion queues
are too small. So we allocate more send_wr depending on the (local) max
number of MRs.
While there also add additional space for ib_drain_qp().
This should make sure ib_post_send() will never fail
because the submission queue is full.
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Fixes: cc55f65dd352 ("smb: client: make use of common smbdirect_socket_parameters")
Cc: stable@vger.kernel.org
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fixes from Namjae Jeon:
- Fix out-of-bounds in FS_IOC_SETFSLABEL
- Add validation for stream entry size to prevent infinite loop
* tag 'exfat-for-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix out-of-bounds in exfat_nls_to_ucs2()
exfat: fix improper check of dentry.stream.valid_size
|
|
Pull NFS client fixes from Anna Schumaker:
- Fix for FlexFiles mirror->dss allocation
- Apply delay_retrans to async operations
- Check if suid/sgid is cleared after a write when needed
- Fix setting the state renewal timer for early mounts after a reboot
* tag 'nfs-for-6.18-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS4: Fix state renewals missing after boot
NFS: check if suid/sgid was cleared after a write as needed
NFS4: Apply delay_retrans to async operations
NFSv4/flexfiles: fix to allocate mirror->dss before use
|
|
Pull smb client fixes from Steve French:
"smb client fixes, security and smbdirect improvements, and some minor cleanup:
- Important OOB DFS fix
- Fix various potential tcon refcount leaks
- smbdirect (RDMA) fixes (following up from test event a few weeks
ago):
- Fixes to improve and simplify handling of memory lifetime of
smbdirect_mr_io structures, when a connection gets disconnected
- Make sure we really wait to reach SMBDIRECT_SOCKET_DISCONNECTED
before destroying resources
- Make sure the send/recv submission/completion queues are large
enough to avoid ib_post_send() from failing under pressure
- convert cifs.ko to use the recommended crypto libraries (instead of
crypto_shash), this also can improve performance
- Three small cleanup patches"
* tag '6.18-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (24 commits)
smb: client: Consolidate cmac(aes) shash allocation
smb: client: Remove obsolete crypto_shash allocations
smb: client: Use HMAC-MD5 library for NTLMv2
smb: client: Use MD5 library for SMB1 signature calculation
smb: client: Use MD5 library for M-F symlink hashing
smb: client: Use HMAC-SHA256 library for SMB2 signature calculation
smb: client: Use HMAC-SHA256 library for key generation
smb: client: Use SHA-512 library for SMB3.1.1 preauth hash
cifs: parse_dfs_referrals: prevent oob on malformed input
smb: client: Fix refcount leak for cifs_sb_tlink
smb: client: let smbd_destroy() wait for SMBDIRECT_SOCKET_DISCONNECTED
smb: move some duplicate definitions to common/cifsglob.h
smb: client: let destroy_mr_list() keep smbdirect_mr_io memory if registered
smb: client: let destroy_mr_list() call ib_dereg_mr() before ib_dma_unmap_sg()
smb: client: call ib_dma_unmap_sg if mr->sgt.nents is not 0
smb: client: improve logic in smbd_deregister_mr()
smb: client: improve logic in smbd_register_mr()
smb: client: improve logic in allocate_mr_list()
smb: client: let destroy_mr_list() remove locked from the list
smb: client: let destroy_mr_list() call list_del(&mr->list)
...
|
|
Commit 29d6d30f5c8a ("Btrfs: send, don't send rmdir for same target
multiple times") has fixed an issue that a send stream contained a rmdir
operation for the same directory multiple times. After that fix we keep
track of the last directory for which we sent a rmdir operation and
compare with it before sending a rmdir for the parent inode of a deleted
hardlink we are processing. But there is still a corner case that in
between rmdir dir operations for the same inode we find deleted hardlinks
for other parent inodes, so tracking just the last inode for which we sent
a rmdir operation is not enough.
Hardlinks of a file in the same directory are stored in the same INODE_REF
item, but if the number of hardlinks is too large and can not fit in a
leaf, we use INODE_EXTREF items to store them. The key of an INODE_EXTREF
item is (inode_id, INODE_EXTREF, hash[name, parent ino]), so between two
hardlinks for the same parent directory, we can find others for other
parent directories. For example for the reproducer below we get the
following (from a btrfs inspect-internal dump-tree output):
item 0 key (259 INODE_EXTREF 2309449) itemoff 16257 itemsize 26
index 6925 parent 257 namelen 8 name: foo.6923
item 1 key (259 INODE_EXTREF 2311350) itemoff 16231 itemsize 26
index 6588 parent 258 namelen 8 name: foo.6587
item 2 key (259 INODE_EXTREF 2457395) itemoff 16205 itemsize 26
index 6611 parent 257 namelen 8 name: foo.6609
(...)
So tracking the last directory's inode number does not work in this case
since we process a link for parent inode 257, then for 258 and then back
again for 257, and that second time we process a deleted link for 257 we
think we have not yet sent a rmdir operation.
Fix this by using a rbtree to keep track of all the directories for which
we have already sent rmdir operations, and add those directories to the
'check_dirs' ref list in process_recorded_refs() only if the directory is
not yet in the rbtree, otherwise skip it since it means we have already
sent a rmdir operation for that directory.
The following test script reproduces the problem:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/a $MNT/b
echo 123 > $MNT/a/foo
for ((i = 1; i <= 1000; i++)); do
ln $MNT/a/foo $MNT/a/foo.$i
ln $MNT/a/foo $MNT/b/foo.$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
btrfs send $MNT/snap1 -f /tmp/base.send
rm -r $MNT/a $MNT/b
btrfs subvolume snapshot -r $MNT $MNT/snap2
btrfs send -p $MNT/snap1 $MNT/snap2 -f /tmp/incremental.send
umount $MNT
mkfs.btrfs -f $DEV
mount $DEV $MNT
btrfs receive $MNT -f /tmp/base.send
btrfs receive $MNT -f /tmp/incremental.send
rm -f /tmp/base.send /tmp/incremental.send
umount $MNT
When running it, it fails like this:
$ ./test.sh
(...)
At subvol snap1
At snapshot snap2
ERROR: rmdir o257-9-0 failed: No such file or directory
CC: <stable@vger.kernel.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Ting-Chang Hou <tchou@synology.com>
[ Updated changelog ]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If fs_info->super_copy or fs_info->super_for_commit allocated failed in
btrfs_get_tree_subvol(), then no need to call btrfs_free_fs_info().
Otherwise btrfs_check_leaked_roots() would access NULL pointer because
fs_info->allocated_roots had not been initialised.
syzkaller reported the following information:
------------[ cut here ]------------
BUG: unable to handle page fault for address: fffffffffffffbb0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 64c9067 P4D 64c9067 PUD 64cb067 PMD 0
Oops: Oops: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 1402 Comm: syz.1.35 Not tainted 6.15.8 #4 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...)
RIP: 0010:arch_atomic_read arch/x86/include/asm/atomic.h:23 [inline]
RIP: 0010:raw_atomic_read include/linux/atomic/atomic-arch-fallback.h:457 [inline]
RIP: 0010:atomic_read include/linux/atomic/atomic-instrumented.h:33 [inline]
RIP: 0010:refcount_read include/linux/refcount.h:170 [inline]
RIP: 0010:btrfs_check_leaked_roots+0x18f/0x2c0 fs/btrfs/disk-io.c:1230
[...]
Call Trace:
<TASK>
btrfs_free_fs_info+0x310/0x410 fs/btrfs/disk-io.c:1280
btrfs_get_tree_subvol+0x592/0x6b0 fs/btrfs/super.c:2029
btrfs_get_tree+0x63/0x80 fs/btrfs/super.c:2097
vfs_get_tree+0x98/0x320 fs/super.c:1759
do_new_mount+0x357/0x660 fs/namespace.c:3899
path_mount+0x716/0x19c0 fs/namespace.c:4226
do_mount fs/namespace.c:4239 [inline]
__do_sys_mount fs/namespace.c:4450 [inline]
__se_sys_mount fs/namespace.c:4427 [inline]
__x64_sys_mount+0x28c/0x310 fs/namespace.c:4427
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x92/0x180 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f032eaffa8d
[...]
Fixes: 3bb17a25bcb0 ("btrfs: add get_tree callback for new mount API")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dewei Meng <mengdewei@cqsoftware.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since commit 0c17270f9b92 ("net: sysfs: Implement is_visible for
phys_(port_id, port_name, switch_id)"), __dev_change_net_namespace() can
hit WARN_ON() when trying to change owner of a file that isn't visible.
See the trace below:
WARNING: CPU: 6 PID: 2938 at net/core/dev.c:12410 __dev_change_net_namespace+0xb89/0xc30
CPU: 6 UID: 0 PID: 2938 Comm: incusd Not tainted 6.17.1-1-mainline #1 PREEMPT(full) 4b783b4a638669fb644857f484487d17cb45ed1f
Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.07 02/19/2025
RIP: 0010:__dev_change_net_namespace+0xb89/0xc30
[...]
Call Trace:
<TASK>
? if6_seq_show+0x30/0x50
do_setlink.isra.0+0xc7/0x1270
? __nla_validate_parse+0x5c/0xcc0
? security_capable+0x94/0x1a0
rtnl_newlink+0x858/0xc20
? update_curr+0x8e/0x1c0
? update_entity_lag+0x71/0x80
? sched_balance_newidle+0x358/0x450
? psi_task_switch+0x113/0x2a0
? __pfx_rtnl_newlink+0x10/0x10
rtnetlink_rcv_msg+0x346/0x3e0
? sched_clock+0x10/0x30
? __pfx_rtnetlink_rcv_msg+0x10/0x10
netlink_rcv_skb+0x59/0x110
netlink_unicast+0x285/0x3c0
? __alloc_skb+0xdb/0x1a0
netlink_sendmsg+0x20d/0x430
____sys_sendmsg+0x39f/0x3d0
? import_iovec+0x2f/0x40
___sys_sendmsg+0x99/0xe0
__sys_sendmsg+0x8a/0xf0
do_syscall_64+0x81/0x970
? __sys_bind+0xe3/0x110
? syscall_exit_work+0x143/0x1b0
? do_syscall_64+0x244/0x970
? sock_alloc_file+0x63/0xc0
? syscall_exit_work+0x143/0x1b0
? do_syscall_64+0x244/0x970
? alloc_fd+0x12e/0x190
? put_unused_fd+0x2a/0x70
? do_sys_openat2+0xa2/0xe0
? syscall_exit_work+0x143/0x1b0
? do_syscall_64+0x244/0x970
? exc_page_fault+0x7e/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[...]
</TASK>
Fix this by checking is_visible() before trying to touch the attribute.
Fixes: 303a42769c4c ("sysfs: add sysfs_group{s}_change_owner()")
Fixes: 0c17270f9b92 ("net: sysfs: Implement is_visible for phys_(port_id, port_name, switch_id)")
Reported-by: Cynthia <cynthia@kosmx.dev>
Closes: https://lore.kernel.org/netdev/01070199e22de7f8-28f711ab-d3f1-46d9-b9a0-048ab05eb09b-000000@eu-central-1.amazonses.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20251016101456.4087-1-fmancera@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Robert recently reported two corrupted images that can cause system
crashes, which are related to the new encoded extents introduced
in Linux 6.15:
- The first one [1] has plen != 0 (e.g. plen == 0x2000000) but
(plen & Z_EROFS_EXTENT_PLEN_MASK) == 0. It is used to represent
special extents such as sparse extents (!EROFS_MAP_MAPPED), but
previously only plen == 0 was handled;
- The second one [2] has pa 0xffffffffffdcffed and plen 0xb4000,
then "cur [0xfffffffffffff000] += bvec.bv_len [0x1000]" in
"} while ((cur += bvec.bv_len) < end);" wraps around, causing an
out-of-bound access of pcl->compressed_bvecs[] in
z_erofs_submit_queue(). EROFS only supports 48-bit physical block
addresses (up to 1EiB for 4k blocks), so add a sanity check to
enforce this.
Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata")
Reported-by: Robert Morris <rtm@csail.mit.edu>
Closes: https://lore.kernel.org/r/75022.1759355830@localhost [1]
Closes: https://lore.kernel.org/r/80524.1760131149@localhost [2]
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
- fix soft lockupg caused by iput() added in bc986b1d756482a ("fs: stop
accessing ->i_count directly in f2fs and gfs2")
- fix a wrong block address map on multiple devices
Link: https://lore.kernel.org/oe-lkp/202509301450.138b448f-lkp@intel.com [1]
* tag 'f2fs-fix-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix wrong block mapping for multi-devices
f2fs: don't call iput() from f2fs_drop_inode()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- in tree-checker fix extref bounds check
- reorder send context structure to avoid
-Wflex-array-member-not-at-end warning
- fix extent readahead length for compressed extents
- fix memory leaks on error paths (qgroup assign ioctl, zone loading
with raid stripe tree enabled)
- fix how device specific mount options are applied, in particular the
'ssd' option will be set unexpectedly
- fix tracking of relocation state when tasks are running and
cancellation is attempted
- adjust assertion condition for folios allocated for scrub
- remove incorrect assertion checking for block group when populating
free space tree
* tag 'for-6.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: fix -Wflex-array-member-not-at-end warning in struct send_ctx
btrfs: tree-checker: fix bounds check in check_inode_extref()
btrfs: fix memory leaks when rejecting a non SINGLE data profile without an RST
btrfs: fix incorrect readahead expansion length
btrfs: do not assert we found block group item when creating free space tree
btrfs: do not use folio_test_partial_kmap() in ASSERT()s
btrfs: only set the device specific options after devices are opened
btrfs: fix memory leak on duplicated memory in the qgroup assign ioctl
btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already running
|
|
Pull smb server fixes from Steve French:
- Fix RPC hang due to locking bug
- Fix for memory leak in read and refcount leak (in session setup)
- Minor cleanup
* tag 'v6.18-rc1-smb-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix recursive locking in RPC handle list access
smb/server: fix possible refcount leak in smb2_sess_setup()
smb/server: fix possible memory leak in smb2_read()
smb: server: Use common error handling code in smb_direct_rdma_xmit()
|
|
Now that smb3_crypto_shash_allocate() and smb311_crypto_shash_allocate()
are identical and only allocate "cmac(aes)", delete the latter and
replace the call to it with the former.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Now that the SMB client accesses MD5, HMAC-MD5, HMAC-SHA256, and SHA-512
only via the library API and not via crypto_shash, allocating
crypto_shash objects for these algorithms is no longer necessary.
Remove all these allocations, their corresponding kconfig selections,
and their corresponding module soft dependencies.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
For the HMAC-MD5 computations in NTLMv2, use the HMAC-MD5 library
instead of a "hmac(md5)" crypto_shash. This is simpler and faster.
With the library there's no need to allocate memory, no need to handle
errors, and the HMAC-MD5 code is accessed directly without inefficient
indirect calls and other unnecessary API overhead.
To preserve the existing behavior of NTLMv2 support being disabled when
the kernel is booted with "fips=1", make setup_ntlmv2_rsp() check
fips_enabled itself. Previously it relied on the error from
cifs_alloc_hash("hmac(md5)", &hmacmd5).
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Convert cifs_calc_signature() to use the MD5 library instead of a "md5"
crypto_shash. This is simpler and faster. With the library there's no
need to allocate memory, no need to handle errors, and the MD5 code is
accessed directly without inefficient indirect calls and other
unnecessary API overhead.
To preserve the existing behavior of MD5 signature support being
disabled when the kernel is booted with "fips=1", make
cifs_calc_signature() check fips_enabled itself. Previously it relied
on the error from cifs_alloc_hash("md5", &server->secmech.md5).
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Convert parse_mf_symlink() and format_mf_symlink() to use the MD5
library instead of a "md5" crypto_shash. This is simpler and faster.
With the library there's no need to allocate memory, no need to handle
errors, and the MD5 code is accessed directly without inefficient
indirect calls and other unnecessary API overhead.
This also fixes an issue where these functions did not work on kernels
booted in FIPS mode. The use of MD5 here is for data integrity rather
than a security purpose, so it can use a non-FIPS-approved algorithm.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Convert smb2_calc_signature() to use the HMAC-SHA256 library instead of
a "hmac(sha256)" crypto_shash. This is simpler and faster. With the
library there's no need to allocate memory, no need to handle errors,
and the HMAC-SHA256 code is accessed directly without inefficient
indirect calls and other unnecessary API overhead.
To make this possible, make __cifs_calc_signature() support both the
HMAC-SHA256 library and crypto_shash. (crypto_shash is still needed for
HMAC-MD5 and AES-CMAC. A later commit will switch HMAC-MD5 from shash
to the library. I'd like to eventually do the same for AES-CMAC, but it
doesn't have a library API yet. So for now, shash is still needed.)
Also remove the unnecessary 'sigptr' variable.
For now smb3_crypto_shash_allocate() still allocates a "hmac(sha256)"
crypto_shash. It will be removed in a later commit.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Convert generate_key() to use the HMAC-SHA256 library instead of a
"hmac(sha256)" crypto_shash. This is simpler and faster. With the
library there's no need to allocate memory, no need to handle errors,
and the HMAC-SHA256 code is accessed directly without inefficient
indirect calls and other unnecessary API overhead.
Also remove the unnecessary 'hashptr' variable.
For now smb3_crypto_shash_allocate() still allocates a "hmac(sha256)"
crypto_shash. It will be removed in a later commit.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Convert smb311_update_preauth_hash() to use the SHA-512 library instead
of a "sha512" crypto_shash. This is simpler and faster. With the
library there's no need to allocate memory, no need to handle errors,
and the SHA-512 code is accessed directly without inefficient indirect
calls and other unnecessary API overhead.
Remove the call to smb311_crypto_shash_allocate() from
smb311_update_preauth_hash(), since it appears to have been needed only
to allocate the "sha512" crypto_shash. (It also had the side effect of
allocating the "cmac(aes)" crypto_shash, but that's also done in
generate_key() which is where the AES-CMAC key is initialized.)
For now the "sha512" crypto_shash is still being allocated elsewhere.
It will be removed in a later commit.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Malicious SMB server can send invalid reply to FSCTL_DFS_GET_REFERRALS
- reply smaller than sizeof(struct get_dfs_referral_rsp)
- reply with number of referrals smaller than NumberOfReferrals in the
header
Processing of such replies will cause oob.
Return -EINVAL error on such replies to prevent oob-s.
Signed-off-by: Eugene Korenevsky <ekorenevsky@aliyun.com>
Cc: stable@vger.kernel.org
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix three refcount inconsistency issues related to `cifs_sb_tlink`.
Comments for `cifs_sb_tlink` state that `cifs_put_tlink()` needs to be
called after successful calls to `cifs_sb_tlink()`. Three calls fail to
update refcount accordingly, leading to possible resource leaks.
Fixes: 8ceb98437946 ("CIFS: Move rename to ops struct")
Fixes: 2f1afe25997f ("cifs: Use smb 2 - 3 and cifsacl mount options getacl functions")
Fixes: 366ed846df60 ("cifs: Use smb 2 - 3 and cifsacl mount options setacl function")
Cc: stable@vger.kernel.org
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Handle inode number mismatches in nsfs file handles
- Update the comment to init_file()
- Add documentation link for EBADF in the rust file code
- Skip read lock assertion for read-only filesystems when using dax
- Don't leak disconnected dentries during umount
- Fix new coredump input pattern validation
- Handle ENOIOCTLCMD conversion in vfs_fileattr_{g,s}et() correctly
- Remove redundant IOCB_DIO_CALLER_COMP clearing in overlayfs
* tag 'vfs-6.18-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ovl: remove redundant IOCB_DIO_CALLER_COMP clearing
fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
Revert "fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP"
coredump: fix core_pattern input validation
vfs: Don't leak disconnected dentries on umount
dax: skip read lock assertion for read-only filesystems
rust: file: add intra-doc link for 'EBADF'
fs: update comment in init_file()
nsfs: handle inode number mismatches gracefully in file handles
|
|
The extent map cache can become stale when extents are moved or
defragmented, causing subsequent operations to see outdated extent flags.
This triggers a BUG_ON in ocfs2_refcount_cal_cow_clusters().
The problem occurs when:
1. copy_file_range() creates a reflinked extent with OCFS2_EXT_REFCOUNTED
2. ioctl(FITRIM) triggers ocfs2_move_extents()
3. __ocfs2_move_extents_range() reads and caches the extent (flags=0x2)
4. ocfs2_move_extent()/ocfs2_defrag_extent() calls __ocfs2_move_extent()
which clears OCFS2_EXT_REFCOUNTED flag on disk (flags=0x0)
5. The extent map cache is not invalidated after the move
6. Later write() operations read stale cached flags (0x2) but disk has
updated flags (0x0), causing a mismatch
7. BUG_ON(!(rec->e_flags & OCFS2_EXT_REFCOUNTED)) triggers
Fix by clearing the extent map cache after each extent move/defrag
operation in __ocfs2_move_extents_range(). This ensures subsequent
operations read fresh extent data from disk.
Link: https://lore.kernel.org/all/20251009142917.517229-1-kartikey406@gmail.com/T/
Link: https://lkml.kernel.org/r/20251009154903.522339-1-kartikey406@gmail.com
Fixes: 53069d4e7695 ("Ocfs2/move_extents: move/defrag extents within a certain range.")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: syzbot+6fdd8fa3380730a4b22c@syzkaller.appspotmail.com
Tested-by: syzbot+6fdd8fa3380730a4b22c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=2959889e1f6e216585ce522f7e8bc002b46ad9e7
Reviewed-by: Mark Fasheh <mark@fasheh.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We should wait for the rdma_cm to become SMBDIRECT_SOCKET_DISCONNECTED,
it turns out that (at least running some xfstests e.g. cifs/001)
often triggers the case where wait_event_interruptible() returns
with -ERESTARTSYS instead of waiting for SMBDIRECT_SOCKET_DISCONNECTED
to be reached.
Or we are already in SMBDIRECT_SOCKET_DISCONNECTING and never wait
for SMBDIRECT_SOCKET_DISCONNECTED.
Fixes: 050b8c374019 ("smbd: Make upper layer decide when to destroy the transport")
Fixes: e8b3bfe9bc65 ("cifs: smbd: Don't destroy transport on RDMA disconnect")
Fixes: b0aa92a229ab ("smb: client: make sure smbd_disconnect_rdma_work() doesn't run after smbd_destroy() took over")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
- Fix regression caused by removing CONFIG_EXT3_FS when testing some
very old defconfigs
- Avoid a BUG_ON when opening a file on a maliciously corrupted file
system
- Avoid mm warnings when freeing a very large orphan file metadata
- Avoid a theoretical races between metadata writeback and checkpoints
(it's very hard to hit in practice, since the race requires that the
writeback take a very long time)
* tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs
ext4: free orphan info with kvfree
ext4: detect invalid INLINE_DATA + EXTENTS flag combination
ext4, doc: fix and improve directory hash tree description
ext4: wait for ongoing I/O to complete before freeing blocks
jbd2: ensure that all ongoing I/O complete before freeing blocks
|
|
In order to maintain the code more easily, move duplicate definitions to
new common header file.
Co-developed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If a smbdirect_mr_io structure if still visible to callers of
smbd_register_mr() we can't free the related memory when the
connection is disconnected! Otherwise smbd_deregister_mr()
will crash.
Now we use a mutex and refcounting in order to keep the
memory around if the connection is disconnected.
It means smbd_deregister_mr() can be called at any later time to free
the memory, which is no longer referenced by nor referencing the
connection.
It also means smbd_destroy() no longer needs to wait for
mr_io.used.count to become 0.
Fixes: 050b8c374019 ("smbd: Make upper layer decide when to destroy the transport")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Since commit 305853cce3794 ("ksmbd: Fix race condition in RPC handle list
access"), ksmbd_session_rpc_method() attempts to lock sess->rpc_lock.
This causes hung connections / tasks when a client attempts to open
a named pipe. Using Samba's rpcclient tool:
$ rpcclient //192.168.1.254 -U user%password
$ rpcclient $> srvinfo
<connection hung here>
Kernel side:
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/0:0 state:D stack:0 pid:5021 tgid:5021 ppid:2 flags:0x00200000
Workqueue: ksmbd-io handle_ksmbd_work
Call trace:
__schedule from schedule+0x3c/0x58
schedule from schedule_preempt_disabled+0xc/0x10
schedule_preempt_disabled from rwsem_down_read_slowpath+0x1b0/0x1d8
rwsem_down_read_slowpath from down_read+0x28/0x30
down_read from ksmbd_session_rpc_method+0x18/0x3c
ksmbd_session_rpc_method from ksmbd_rpc_open+0x34/0x68
ksmbd_rpc_open from ksmbd_session_rpc_open+0x194/0x228
ksmbd_session_rpc_open from create_smb2_pipe+0x8c/0x2c8
create_smb2_pipe from smb2_open+0x10c/0x27ac
smb2_open from handle_ksmbd_work+0x238/0x3dc
handle_ksmbd_work from process_scheduled_works+0x160/0x25c
process_scheduled_works from worker_thread+0x16c/0x1e8
worker_thread from kthread+0xa8/0xb8
kthread from ret_from_fork+0x14/0x38
Exception stack(0x8529ffb0 to 0x8529fff8)
The task deadlocks because the lock is already held:
ksmbd_session_rpc_open
down_write(&sess->rpc_lock)
ksmbd_rpc_open
ksmbd_session_rpc_method
down_read(&sess->rpc_lock) <-- deadlock
Adjust ksmbd_session_rpc_method() callers to take the lock when necessary.
Fixes: 305853cce3794 ("ksmbd: Fix race condition in RPC handle list access")
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Reference count of ksmbd_session will leak when session need reconnect.
Fix this by adding the missing ksmbd_user_session_put().
Co-developed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Memory leak occurs when ksmbd_vfs_read() fails.
Fix this by adding the missing kvfree().
Co-developed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Since the len argument value passed to exfat_ioctl_set_volume_label()
from exfat_nls_to_utf16() is passed 1 too large, an out-of-bounds read
occurs when dereferencing p_cstring in exfat_nls_to_ucs2() later.
And because of the NLS_NAME_OVERLEN macro, another error occurs when
creating a file with a period at the end using utf8 and other iocharsets.
So to avoid this, you should remove the code that uses NLS_NAME_OVERLEN
macro and make the len argument value be the length of the label string,
but with a maximum length of FSLABEL_MAX - 1.
Reported-by: syzbot+98cc76a76de46b3714d4@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=98cc76a76de46b3714d4
Fixes: d01579d590f7 ("exfat: Add support for FS_IOC_{GET,SET}FSLABEL")
Suggested-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
We found an infinite loop bug in the exFAT file system that can lead to a
Denial-of-Service (DoS) condition. When a dentry in an exFAT filesystem is
malformed, the following system calls — SYS_openat, SYS_ftruncate, and
SYS_pwrite64 — can cause the kernel to hang.
Root cause analysis shows that the size validation code in exfat_find()
does not check whether dentry.stream.valid_size is negative. As a result,
the system calls mentioned above can succeed and eventually trigger the DoS
issue.
This patch adds a check for negative dentry.stream.valid_size to prevent
this vulnerability.
Co-developed-by: Seunghun Han <kkamagui@gmail.com>
Signed-off-by: Seunghun Han <kkamagui@gmail.com>
Co-developed-by: Jihoon Kwon <jimmyxyz010315@gmail.com>
Signed-off-by: Jihoon Kwon <jimmyxyz010315@gmail.com>
Signed-off-by: Jaehun Gou <p22gone@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Fix a crasher reported by rtm@csail.mit.edu
* tag 'nfsd-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Define a proc_layoutcommit for the FlexFiles layout type
|
|
Assuming the disk layout as below,
disk0: 0 --- 0x00035abfff
disk1: 0x00035ac000 --- 0x00037abfff
disk2: 0x00037ac000 --- 0x00037ebfff
and we want to read data from offset=13568 having len=128 across the block
devices, we can illustrate the block addresses like below.
0 .. 0x00037ac000 ------------------- 0x00037ebfff, 0x00037ec000 -------
| ^ ^ ^
| fofs 0 13568 13568+128
| ------------------------------------------------------
| LBA 0x37e8aa9 0x37ebfa9 0x37ec029
--- map 0x3caa9 0x3ffa9
In this example, we should give the relative map of the target block device
ranging from 0x3caa9 to 0x3ffa9 where the length should be calculated by
0x37ebfff + 1 - 0x37ebfa9.
In the below equation, however, map->m_pblk was supposed to be the original
address instead of the one from the target block address.
- map->m_len = min(map->m_len, dev->end_blk + 1 - map->m_pblk);
Cc: stable@vger.kernel.org
Fixes: 71f2c8206202 ("f2fs: multidevice: support direct IO")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
iput() calls the problematic routine, which does a ->i_count inc/dec
cycle. Undoing it with iput() recurses into the problem.
Note f2fs should not be playing games with the refcount to begin with,
but that will be handled later. Right now solve the immediate
regression.
Fixes: bc986b1d756482a ("fs: stop accessing ->i_count directly in f2fs and gfs2")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202509301450.138b448f-lkp@intel.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The warning -Wflex-array-member-not-at-end was introduced in GCC-14, and
we are getting ready to enable it, globally.
Fix the following warning:
fs/btrfs/send.c:181:24: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
and move the declaration of send_ctx::cur_inode_path to the end.
Notice that struct fs_path contains a flexible array member inline_buf,
but also a padding array and a limit calculated for the usable space of
inline_buf (FS_PATH_INLINE_SIZE). It is not the pattern where flexible
array is in the middle of a structure and could potentially overwrite
other members.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The parentheses for the unlikely() annotation were put in the wrong
place so it means that the condition is basically never true and the
bounds checking is skipped.
Fixes: aab9458b9f00 ("btrfs: tree-checker: add inode extref checks")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At the end of btrfs_load_block_group_zone_info() the first thing we do
is to ensure that if the mapping type is not a SINGLE one and there is
no RAID stripe tree, then we return early with an error.
Doing that, though, prevents the code from running the last calls from
this function which are about freeing memory allocated during its
run. Hence, in this case, instead of returning early, we set the ret
value and fall through the rest of the cleanup code.
Fixes: 5906333cc4af ("btrfs: zoned: don't skip block group profile checks on conventional zones")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The intent of btrfs_readahead_expand() was to expand to the length of
the current compressed extent being read. However, "ram_bytes" is *not*
that, in the case where a single physical compressed extent is used for
multiple file extents.
Consider this case with a large compressed extent C and then later two
non-compressed extents N1 and N2 written over C, leaving C1 and C2
pointing to offset/len pairs of C:
[ C ]
[ N1 ][ C1 ][ N2 ][ C2 ]
In such a case, ram_bytes for both C1 and C2 is the full uncompressed
length of C. So starting readahead in C1 will expand the readahead past
the end of C1, past N2, and into C2. This will then expand readahead
again, to C2_start + ram_bytes, way past EOF. First of all, this is
totally undesirable, we don't want to read the whole file in arbitrary
chunks of the large underlying extent if it happens to exist. Secondly,
it results in zeroing the range past the end of C2 up to ram_bytes. This
is particularly unpleasant with fs-verity as it can zero and set
uptodate pages in the verity virtual space past EOF. This incorrect
readahead behavior can lead to verity verification errors, if we iterate
in a way that happens to do the wrong readahead.
Fix this by using em->len for readahead expansion, not em->ram_bytes,
resulting in the expected behavior of stopping readahead at the extent
boundary.
Reported-by: Max Chernoff <git@maxchernoff.ca>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents")
CC: stable@vger.kernel.org # 6.17
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently, when building a free space tree at populate_free_space_tree(),
if we are not using the block group tree feature, we always expect to find
block group items (either extent items or a block group item with key type
BTRFS_BLOCK_GROUP_ITEM_KEY) when we search the extent tree with
btrfs_search_slot_for_read(), so we assert that we found an item. However
this expectation is wrong since we can have a new block group created in
the current transaction which is still empty and for which we still have
not added the block group's item to the extent tree, in which case we do
not have any items in the extent tree associated to the block group.
The insertion of a new block group's block group item in the extent tree
happens at btrfs_create_pending_block_groups() when it calls the helper
insert_block_group_item(). This typically is done when a transaction
handle is released, committed or when running delayed refs (either as
part of a transaction commit or when serving tickets for space reservation
if we are low on free space).
So remove the assertion at populate_free_space_tree() even when the block
group tree feature is not enabled and update the comment to mention this
case.
Syzbot reported this with the following stack trace:
BTRFS info (device loop3 state M): rebuilding free space tree
assertion failed: ret == 0 :: 0, in fs/btrfs/free-space-tree.c:1115
------------[ cut here ]------------
kernel BUG at fs/btrfs/free-space-tree.c:1115!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 6352 Comm: syz.3.25 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
RIP: 0010:populate_free_space_tree+0x700/0x710 fs/btrfs/free-space-tree.c:1115
Code: ff ff e8 d3 (...)
RSP: 0018:ffffc9000430f780 EFLAGS: 00010246
RAX: 0000000000000043 RBX: ffff88805b709630 RCX: fea61d0e2e79d000
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffffc9000430f8b0 R08: ffffc9000430f4a7 R09: 1ffff92000861e94
R10: dffffc0000000000 R11: fffff52000861e95 R12: 0000000000000001
R13: 1ffff92000861f00 R14: dffffc0000000000 R15: 0000000000000000
FS: 00007f424d9fe6c0(0000) GS:ffff888125afc000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd78ad212c0 CR3: 0000000076d68000 CR4: 00000000003526f0
Call Trace:
<TASK>
btrfs_rebuild_free_space_tree+0x1ba/0x6d0 fs/btrfs/free-space-tree.c:1364
btrfs_start_pre_rw_mount+0x128f/0x1bf0 fs/btrfs/disk-io.c:3062
btrfs_remount_rw fs/btrfs/super.c:1334 [inline]
btrfs_reconfigure+0xaed/0x2160 fs/btrfs/super.c:1559
reconfigure_super+0x227/0x890 fs/super.c:1076
do_remount fs/namespace.c:3279 [inline]
path_mount+0xd1a/0xfe0 fs/namespace.c:4027
do_mount fs/namespace.c:4048 [inline]
__do_sys_mount fs/namespace.c:4236 [inline]
__se_sys_mount+0x313/0x410 fs/namespace.c:4213
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f424e39066a
Code: d8 64 89 02 (...)
RSP: 002b:00007f424d9fde68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f424d9fdef0 RCX: 00007f424e39066a
RDX: 0000200000000180 RSI: 0000200000000380 RDI: 0000000000000000
RBP: 0000200000000180 R08: 00007f424d9fdef0 R09: 0000000000000020
R10: 0000000000000020 R11: 0000000000000246 R12: 0000200000000380
R13: 00007f424d9fdeb0 R14: 0000000000000000 R15: 00002000000002c0
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
Reported-by: syzbot+884dc4621377ba579a6f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68dc3dab.a00a0220.102ee.004e.GAE@google.com/
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
CC: <stable@vger.kernel.org> # 6.1.x: 1961d20f6fa8: btrfs: fix assertion when building free space tree
CC: <stable@vger.kernel.org> # 6.1.x
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Syzbot reported an ASSERT() triggered inside scrub:
BTRFS info (device loop0): scrub: started on devid 1
assertion failed: !folio_test_partial_kmap(folio) :: 0, in fs/btrfs/scrub.c:697
------------[ cut here ]------------
kernel BUG at fs/btrfs/scrub.c:697!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6077 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
RIP: 0010:scrub_stripe_get_kaddr+0x1bb/0x1c0 fs/btrfs/scrub.c:697
Call Trace:
<TASK>
scrub_bio_add_sector fs/btrfs/scrub.c:932 [inline]
scrub_submit_initial_read+0xf21/0x1120 fs/btrfs/scrub.c:1897
submit_initial_group_read+0x423/0x5b0 fs/btrfs/scrub.c:1952
flush_scrub_stripes+0x18f/0x1150 fs/btrfs/scrub.c:1973
scrub_stripe+0xbea/0x2a30 fs/btrfs/scrub.c:2516
scrub_chunk+0x2a3/0x430 fs/btrfs/scrub.c:2575
scrub_enumerate_chunks+0xa70/0x1350 fs/btrfs/scrub.c:2839
btrfs_scrub_dev+0x6e7/0x10e0 fs/btrfs/scrub.c:3153
btrfs_ioctl_scrub+0x249/0x4b0 fs/btrfs/ioctl.c:3163
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
---[ end trace 0000000000000000 ]---
Which doesn't make much sense, as all the folios we allocated for scrub
should not be highmem.
[CAUSE]
Thankfully syzbot has a detailed kernel config file, showing that
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is set to y.
And that debug option will force all folio_test_partial_kmap() to return
true, to improve coverage on highmem tests.
But in our case we really just want to make sure the folios we allocated
are not highmem (and they are indeed not). Such incorrect result from
folio_test_partial_kmap() is just screwing up everything.
[FIX]
Replace folio_test_partial_kmap() to folio_test_highmem() so that we
won't bother those highmem specific debuging options.
Fixes: 5fbaae4b8567 ("btrfs: prepare scrub to support bs > ps cases")
Reported-by: syzbot+bde59221318c592e6346@syzkaller.appspotmail.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
With v6.17-rc kernels, btrfs will always set 'ssd' mount option even if
the block device is not a rotating one:
# cat /sys/block/sdd/queue/rotational
1
# cat /etc/fstab:
LABEL=DATA2 /data2 btrfs rw,relatime,space_cache=v2,subvolid=5,subvol=/,nofail,nosuid,nodev 0 0
# mount
[...]
/dev/sdd on /data2 type btrfs (rw,nosuid,nodev,relatime,ssd,space_cache=v2,subvolid=5,subvol=/)
[CAUSE]
The 'ssd' mount option is set by set_device_specific_options(), and it
expects that if there is any rotating device in the btrfs, it will set
fs_devices::rotating.
However after commit bddf57a70781 ("btrfs: delay btrfs_open_devices()
until super block is created"), the device opening is delayed until the
super block is created.
But the timing of set_device_specific_options() is still left as is,
this makes the function be called without any device opened.
Since no device is opened, thus fs_devices::rotating will never be set,
making btrfs incorrectly set 'ssd' mount option.
[FIX]
Only call set_device_specific_options() after btrfs_open_devices().
Also only call set_device_specific_options() after a new mount, if we're
mounting a mounted btrfs, there is no need to set the device specific
mount options again.
Reported-by: HAN Yuwei <hrx@bupt.moe>
Link: https://lore.kernel.org/linux-btrfs/C8FF75669DFFC3C5+5f93bf8a-80a0-48a6-81bf-4ec890abc99a@bupt.moe/
Fixes: bddf57a70781 ("btrfs: delay btrfs_open_devices() until super block is created")
CC: stable@vger.kernel.org # 6.17
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
On 'btrfs_ioctl_qgroup_assign' we first duplicate the argument as
provided by the user, which is kfree'd in the end. But this was not the
case when allocating memory for 'prealloc'. In this case, if it somehow
failed, then the previous code would go directly into calling
'mnt_drop_write_file', without freeing the string duplicated from the
user space.
Fixes: 4addc1ffd67a ("btrfs: qgroup: preallocate memory before adding a relation")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When starting relocation, at reloc_chunk_start(), if we happen to find
the flag BTRFS_FS_RELOC_RUNNING is already set we return an error
(-EINPROGRESS) to the callers, however the callers call reloc_chunk_end()
which will clear the flag BTRFS_FS_RELOC_RUNNING, which is wrong since
relocation was started by another task and still running.
Finding the BTRFS_FS_RELOC_RUNNING flag already set is an unexpected
scenario, but still our current behaviour is not correct.
Fix this by never calling reloc_chunk_end() if reloc_chunk_start() has
returned an error, which is what logically makes sense, since the general
widespread pattern is to have end functions called only if the counterpart
start functions succeeded. This requires changing reloc_chunk_start() to
clear BTRFS_FS_RELOC_RUNNING if there's a pending cancel request.
Fixes: 907d2710d727 ("btrfs: add cancellable chunk relocation support")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the last renewal time was initialized to 0 and jiffies start
counting at -5 minutes, any clients connected in the first 5 minutes
after a reboot would have their renewal timer set to a very long
interval. If the connection was idle, this would result in the client
state timing out on the server and the next call to the server would
return NFS4ERR_BADSESSION.
Fix this by initializing the last renewal time to the current jiffies
instead of 0.
Signed-off-by: Joshua Watt <jpewhacker@gmail.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
I noticed xfstests generic/193 and generic/355 started failing against
knfsd after commit e7a8ebc305f2 ("NFSD: Offer write delegation for OPEN
with OPEN4_SHARE_ACCESS_WRITE").
I ran those same tests against ONTAP (which has had write delegation
support for a lot longer than knfsd) and they fail there too... so
while it's a new failure against knfsd, it isn't an entirely new
failure.
Add the NFS_INO_REVAL_FORCED flag so that the presence of a delegation
doesn't keep the inode from being revalidated to fetch the updated mode.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
The setting of delay_retrans is applied to synchronous RPC operations
because the retransmit count is stored in same struct nfs4_exception
that is passed each time an error is checked. However, for asynchronous
operations (READ, WRITE, LOCKU, CLOSE, DELEGRETURN), a new struct
nfs4_exception is made on the stack each time the task callback is
invoked. This means that the retransmit count is always zero and thus
delay_retrans never takes effect.
Apply delay_retrans to these operations by tracking and updating their
retransmit count.
Change-Id: Ieb33e046c2b277cb979caa3faca7f52faf0568c9
Signed-off-by: Joshua Watt <jpewhacker@gmail.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Move mirror_array's dss_count initialization and dss allocation to
ff_layout_alloc_mirror(), just before the loop that initializes each
nfs4_ff_layout_ds_stripe's nfs_file_localio.
Also handle NULL return from kcalloc() and remove one level of indent
in ff_layout_alloc_mirror().
This commit fixes dangling nfsd_serv refcount issues seen when using
NFS LOCALIO and then attempting to stop the NFSD service.
Fixes: 20b1d75fb840 ("NFSv4/flexfiles: Add support for striped layouts")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
This is more consistent as we call ib_dma_unmap_sg() only
when the memory is no longer registered.
This is the same pattern as calling ib_dma_unmap_sg() after
IB_WR_LOCAL_INV.
Fixes: c7398583340a ("CIFS: SMBD: Implement RDMA memory registration")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This seems to be the more reliable way to check if we need to
call ib_dma_unmap_sg().
Fixes: c7398583340a ("CIFS: SMBD: Implement RDMA memory registration")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
- use 'mr' as variable name
- style fixes
This will make further changes easier.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
- use 'mr' as variable name
- style fixes
This will make further changes easier.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
- use 'mr' as variable name
- use goto lables for easier cleanup
- use destroy_mr_list()
- style fixes
- INIT_WORK(&sc->mr_io.recovery_work, smbd_mr_recovery_work) on success
This will make further changes easier.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This should make sure get_mr() can't see the removed entries.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This makes the code clearer and will make further changes easier.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
No callers checks the return value and this makes further
changes easier.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This will be used in the next commits in order to improve the
client code.
A broken connection can just disable the smbdirect_mr_io while
keeping the memory arround for the caller.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The local variable “rc” is assigned a value in an if branch without
using it before it is reassigned there.
Thus delete this assignment statement.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Return a status code without storing it in an intermediate variable.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add two jump targets so that a bit of exception handling can be better
reused at the end of this function implementation.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|