sysctl: allow change system v ipc sysctls inside ipc namespace Patch series "Allow to change ipc/mq sysctls inside ipc namespace", v3. Right now ipc and mq limits count as per ipc namespace, but only real root can change them. By default, the current values of these limits are such that it can only be reduced. Since only root can change the values, it is impossible to reduce these limits in the rootless container. We can allow limit changes within ipc namespace because mq parameters are limited by RLIMIT_MSGQUEUE and ipc parameters are not limited to anything other than cgroups. This patch (of 3): Rootless containers are not allowed to modify kernel IPC parameters. All default limits are set to such high values that in fact there are no limits at all. All limits are not inherited and are initialized to default values when a new ipc_namespace is created. For new ipc_namespace: size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24)) size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24)) int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15) int ipc_ns.shm_rmid_forced = 0; unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192 unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000 unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384 The shm_tot (total amount of shared pages) has also ceased to be global, it is located in ipc_namespace and is not inherited from anywhere. In such conditions, it cannot be said that these limits limit anything. The real limiter for them is cgroups. If we allow rootless containers to change these parameters, then it can only be reduced. Link: https://lkml.kernel.org/r/cover.1705333426.git.legion@kernel.org Link: https://lkml.kernel.org/r/d2f4603305cbfed58a24755aa61d027314b73a45.1705333426.git.legion@kernel.org Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/e2d84d3ec0172cfff759e6065da84ce0cc2736f8.1663756794.git.legion@kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: Joel Granados <joel.granados@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
sysctl: Add a size arg to __register_sysctl_table We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
ipc: Remove extra braces Fix coding style. In the previous commit, I added braces because, in addition to changing .data, .extra1 also changed. Now this is not needed. Fixes: 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace") Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/37687827f630bc150210f5b8abeeb00f1336814e.1651584847.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
ipc: Check permissions for checkpoint_restart sysctls at open time As Eric Biederman pointed out, it is possible not to use a custom proc_handler and check permissions for every write, but to use a .permission handler. That will allow the checkpoint_restart sysctls to perform all of their permission checks at open time, and not need any other special code. Link: https://lore.kernel.org/lkml/87czib9g38.fsf@email.froward.int.ebiederm.org/ Fixes: 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/65fa8459803830608da4610a39f33c76aa933eb9.1651584847.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
ipc: Remove extra1 field abuse to pass ipc namespace Eric Biederman pointed out that using .extra1 to pass ipc namespace looks like an ugly hack and there is a better solution. We can get the ipc_namespace using the .data field. Link: https://lore.kernel.org/lkml/87czib9g38.fsf@email.froward.int.ebiederm.org/ Fixes: 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/93df64a8fe93ba20ebbe1d9f8eda484b2f325426.1651584847.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
ipc: Use the same namespace to modify and validate In the 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace") I missed that in addition to the modification of sem_ctls[3], the change is validated. This validation must occur in the same namespace. Link: https://lore.kernel.org/lkml/875ymnvryb.fsf@email.froward.int.ebiederm.org/ Fixes: 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace") Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/b3cb9a25cce6becbef77186bc1216071a08a969b.1651584847.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
ipc: Store ipc sysctls in the ipc namespace The ipc sysctls are not available for modification inside the user namespace. Following the mqueue sysctls, we changed the implementation to be more userns friendly. So far, the changes do not provide additional access to files. This will be done in a future patch. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/be6f9d014276f4dddd0c3aa05a86052856c1c555.1644862280.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL Compilation of ipc/ipc_sysctl.c is controlled by obj-$(CONFIG_SYSVIPC_SYSCTL) [see ipc/Makefile] And CONFIG_SYSVIPC_SYSCTL depends on SYSCTL [see init/Kconfig] An SYSCTL is selected by PROC_SYSCTL. [see fs/proc/Kconfig] Thus: #ifndef CONFIG_PROC_SYSCTL in ipc/ipc_sysctl.c is impossible, the fallback can be removed. Link: https://lkml.kernel.org/r/20210918145337.3369-1-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files This commit removes the requirement to be root to modify sem_next_id, msg_next_id and shm_next_id and checks checkpoint_restore_ns_capable instead. Since those files are specific to the IPC namespace, there is no reason they should require root privileges. This is similar to ns_last_pid, which also only checks checkpoint_restore_ns_capable. [akpm@linux-foundation.org: ipc/ipc_sysctl.c needs capability.h for checkpoint_restore_ns_capable()] Link: https://lkml.kernel.org/r/20210916163717.3179496-1-mclapinski@google.com Signed-off-by: Michal Clapinski <mclapinski@google.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc: adjust proc_ipc_sem_dointvec definition to match prototype Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the signature of proc_ipc_sem_dointvec to match ctl_table.proc_handler which fixes the following sparse error/warning: ipc/ipc_sysctl.c:94:47: warning: incorrect type in argument 3 (different address spaces) ipc/ipc_sysctl.c:94:47: expected void *buffer ipc/ipc_sysctl.c:94:47: got void [noderef] __user *buffer ipc/ipc_sysctl.c:194:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces)) ipc/ipc_sysctl.c:194:35: expected int ( [usertype] *proc_handler )( ... ) ipc/ipc_sysctl.c:194:35: got int ( * )( ... ) Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20200825105846.5193-1-tklauser@distanz.ch Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysctl: pass kernel pointers to ->proc_handler Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
proc/sysctl: add shared variables for range check In the sysctl code the proc_dointvec_minmax() function is often used to validate the user supplied value between an allowed range. This function uses the extra1 and extra2 members from struct ctl_table as minimum and maximum allowed value. On sysctl handler declaration, in every source file there are some readonly variables containing just an integer which address is assigned to the extra1 and extra2 members, so the sysctl range is enforced. The special values 0, 1 and INT_MAX are very often used as range boundary, leading duplication of variables like zero=0, one=1, int_max=INT_MAX in different source files: $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l 248 Add a const int array containing the most commonly used values, some macros to refer more easily to the correct array member, and use them instead of creating a local one for every object file. This is the bloat-o-meter output comparing the old and new binary compiled with the default Fedora config: # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164) Data old new delta sysctl_vals - 12 +12 __kstrtab_sysctl_vals - 12 +12 max 14 10 -4 int_max 16 - -16 one 68 - -68 zero 128 28 -100 Total: Before=20583249, After=20583085, chg -0.00% [mcroce@redhat.com: tipc: remove two unused variables] Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c] [arnd@arndb.de: proc/sysctl: make firmware loader table conditional] Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de [akpm@linux-foundation.org: fix fs/eventpoll.c] Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation version 2 of the license extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 315 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Armijn Hemel <armijn@tjaldur.nl> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ipc: do cyclic id allocation for the ipc object. For ipcmni_extend mode, the sequence number space is only 7 bits. So the chance of id reuse is relatively high compared with the non-extended mode. To alleviate this id reuse problem, this patch enables cyclic allocation for the index to the radix tree (idx). The disadvantage is that this can cause a slight slow-down of the fast path, as the radix tree could be higher than necessary. To limit the radix tree height, I have chosen the following limits: 1) The cycling is done over in_use*1.5. 2) At least, the cycling is done over "normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements "ipcmni_extended": 4096 elements Result: - for normal mode: No change for <= 42 active ipc elements. With more than 42 active ipc elements, a 2nd level would be added to the radix tree. Without cyclic allocation, a 2nd level would be added only with more than 63 active elements. - for extended mode: Cycling creates always at least a 2-level radix tree. With more than 2730 active objects, a 3rd level would be added, instead of > 4095 active objects until the 3rd level is added without cyclic allocation. For a 2-level radix tree compared to a 1-level radix tree, I have observed < 1% performance impact. Notes: 1) Normal "x=semget();y=semget();" is unaffected: Then the idx is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic() is used. 2) The -1% happens in a microbenchmark after this situation: x=semget(); for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);} y=semget(); Now perform semget calls on x and y that do not sleep. 3) The worst-case reuse cycle time is unfortunately unaffected: If you have 2^24-1 ipc objects allocated, and get/remove the last possible element in a loop, then the id is reused after 128 get/remove pairs. Performance check: A microbenchmark that performes no-op semop() randomly on two IDs, with only these two IDs allocated. The IDs were set using /proc/sys/kernel/sem_next_id. The test was run 5 times, averages are shown. 1 & 2: Base (6.22 seconds for 10.000.000 semops) 1 & 40: -0.2% 1 & 3348: - 0.8% 1 & 27348: - 1.6% 1 & 15777204: - 3.2% Or: ~12.6 cpu cycles per additional radix tree level. The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower than what I remember (spectre impact?). V2 of the patch: - use "min" and "max" - use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of (2<<12). [akpm@linux-foundation.org: fix max() warning] Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Waiman Long <longman@redhat.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc: allow boot time extension of IPCMNI from 32k to 16M The maximum number of unique System V IPC identifiers was limited to 32k. That limit should be big enough for most use cases. However, there are some users out there requesting for more, especially those that are migrating from Solaris which uses 24 bits for unique identifiers. To satisfy the need of those users, a new boot time kernel option "ipcmni_extend" is added to extend the IPCMNI value to 16M. This is a 512X increase which should be big enough for users out there that need a large number of unique IPC identifier. The use of this new option will change the pattern of the IPC identifiers returned by functions like shmget(2). An application that depends on such pattern may not work properly. So it should only be used if the users really need more than 32k of unique IPC numbers. This new option does have the side effect of reducing the maximum number of unique sequence numbers from 64k down to 128. So it is a trade-off. The computation of a new IPC id is not done in the performance critical path. So a little bit of additional overhead shouldn't have any real performance impact. Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc: IPCMNI limit check for semmni For SysV semaphores, the semmni value is the last part of the 4-element sem number array. To make semmni behave in a similar way to msgmni and shmmni, we can't directly use the _minmax handler. Instead, a special sem specific handler is added to check the last argument to make sure that it is limited to the [0, IPCMNI] range. An error will be returned if this is not the case. Link: http://lkml.kernel.org/r/1536352137-12003-3-git-send-email-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc: IPCMNI limit check for msgmni and shmmni Patch series "ipc: IPCMNI limit check for *mni & increase that limit", v9. The sysctl parameters msgmni, shmmni and semmni have an inherent limit of IPC_MNI (32k). However, users may not be aware of that because they can write a value much higher than that without getting any error or notification. Reading the parameters back will show the newly written values which are not real. The real IPCMNI limit is now enforced to make sure that users won't put in an unrealistic value. The first 2 patches enforce the limits. There are also users out there requesting increase in the IPCMNI value. The last 2 patches attempt to do that by using a boot kernel parameter "ipcmni_extend" to increase the IPCMNI limit from 32k to 8M if the users really want the extended value. This patch (of 4): A user can write arbitrary integer values to msgmni and shmmni sysctl parameters without getting error, but the actual limit is really IPCMNI (32k). This can mislead users as they think they can get a value that is not real. The right limits are now set for msgmni and shmmni so that the users will become aware if they set a value outside of the acceptable range. Link: http://lkml.kernel.org/r/1536352137-12003-2-git-send-email-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Luis R. Rodriguez <mcgrof@kernel.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc/msg: increase MSGMNI, remove scaling SysV can be abused to allocate locked kernel memory. For most systems, a small limit doesn't make sense, see the discussion with regards to SHMMAX. Therefore: increase MSGMNI to the maximum supported. And: If we ignore the risk of locking too much memory, then an automatic scaling of MSGMNI doesn't make sense. Therefore the logic can be removed. The code preserves auto_msgmni to avoid breaking any user space applications that expect that the value exists. Notes: 1) If an administrator must limit the memory allocations, then he can set MSGMNI as necessary. Or he can disable sysv entirely (as e.g. done by Android). 2) MSGMAX and MSGMNB are intentionally not increased, as these values are used to control latency vs. throughput: If MSGMNB is large, then msgsnd() just returns and more messages can be queued before a task switch to a task that calls msgrcv() is forced. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc: always handle a new value of auto_msgmni proc_dointvec_minmax() returns zero if a new value has been set. So we don't need to check all charecters have been handled. Below you can find two examples. In the new value has not been handled properly. $ strace ./a.out open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3 write(3, "0\n\0", 3) = 2 close(3) = 0 exit_group(0) $ cat /sys/kernel/debug/tracing/trace $strace ./a.out open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3 write(3, "0\n", 2) = 2 close(3) = 0 $ cat /sys/kernel/debug/tracing/trace a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin") Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Joe Perches <joe@perches.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc: convert use of typedef ctl_table to struct ctl_table This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>