Skip to content

Conversation

@stornoralle
Copy link

No description provided.

hikalium and others added 30 commits January 12, 2022 14:28
This adds a Virtio based video driver for video streaming device that
operates input and output data buffers to share video devices with
several guests. The current implementation consist of V4L2 based video
driver supporting video functions of decoder and encoder. The device
uses command structures to advertise and negotiate stream formats and
controls. This allows the driver to modify the processing logic of the
device on a per stream basis.

Signed-off-by: Dmitry Sepp <[email protected]>
Signed-off-by: Kiran Pawar <[email protected]>
Signed-off-by: Nikolay Martyanov <[email protected]>
Signed-off-by: Samiullah Khawaja <[email protected]>

(am from https://patchwork.linuxtv.org/patch/61717/)

BUG=b:120456557
TEST=compile with VIRTIO_VIDEO enabled

Fixes:
* Fixed a typo in the commit message: "video_video" -> "virtio_video"
* Fixed SPDX-License-Identifier in /include/uapi/linux/virtio_video.h
* Removed vidioc_enum_fmt_vid_{cap, out}_mplane callbacks that were removed by
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7e98b7b542a456582ea3029be857cc99a3b19bd5

Change-Id: I037b20a9faa1b31bb260cc2a0d57932e1979395e
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2066459
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge, resolved conflicts on:
            include/uapi/linux/virtio_ids.h]
Signed-off-by: Hikaru Nishida <[email protected]>
Replace pix_mp->width with pix_mp->height.

BUG=b:151703605
TEST=compile

Change-Id: I75baead1b9679b54990fd1e13267f634078c2df1
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2107041
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
Add missing NULL check in init_ctrls callbacks for decoder and encoder.

BUG=b:151703605
TEST=compile

Change-Id: I2789ef2a4eae1c140f7d28a0228dd1c29ad36eee
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2107042
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
…ontaining frame sizes in stream

Some coded format streams like VP8 and H.264 contain metadata.
For such formats, try_fmt is called with width=0 and height=0.
This CL makes try_fmt callback support this case.

BUG=b:151703605
TEST=compile

Change-Id: I7a950f52c0ffdba0aba9febdc10f11e640d347ed
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2107043
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
Support V4L2_SEL_TGT_COMPOSE as a selection target in the g_selection
callback.

BUG=b:151703605
TEST=compile

Change-Id: Ibc156fc2c8eeda4ddb1c45d8bd65b03a0d160df8
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2107044
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
…r with EOS flag

The host can return an empty CAPTURE buffer to notify EOS or an error.
In such case, the driver must be mark the buffer as "done" with byteused=0.

BUG=b:151703605
TEST=compile

Change-Id: I78700fb33a8797b42a7c522f368c41b693fb81f9
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2107045
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
…deo driver

This patch makes the virtio-video driver use virtio objects as DMA buffers.
So, users will beable to import resources exported by other virtio devices
such as virtio-gpu.

Currently, we assumes that only one virtio object for each v4l2_buffer
format even if it's for a multiplanar format.

Signed-off-by: Keiichi Watanabe <[email protected]>
(am from https://patchwork.kernel.org/patch/11457439/)

BUG=b:120456557
TEST=run simple decoding test

Change-Id: I6cd47add6e760fa3b3b8bf08e3a3d93cc97243a8
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2060511
Reviewed-by: Keiichi Watanabe <[email protected]>
Reviewed-by: Sean Paul <[email protected]>
Tested-by: Keiichi Watanabe <[email protected]>
Commit-Queue: Keiichi Watanabe <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]

Change-Id: Ie92e7cedca5da5553162cfd7fc27947fb7ae8fb5
…iver

0-day reports:

drivers/media/virtio/virtio_video_driver.c:103:5-11: ERROR:
	allocation function on line 102 returns NULL not ERR_PTR on failure

The various basic memory allocation functions don't return ERR_PTR.

BUG=b:120456557, b:151703605
TEST=compile with VIRTIO_VIDEO enabled

Change-Id: Ia556120dd001a4f395bcfb66a29f5f3c81ee8dd0
Fixes: bbca485 ("BACKPORT: FROMLIST: virtio_video: Add the Virtio Video V4L2 driver")
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2138634
Reviewed-by: Alexandre Courbot <[email protected]>
Reviewed-by: Keiichi Watanabe <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
This invalidates a uuid associated with a vb2_buffer object when the
uuid is used for another vb2_buffer.

This patch will fix a problem that happens in scenarios like this:

(1) QBUF(index=0, DMABUF uuid=1)
(2) DQBUF(index=0)
(3) QBUF(index=1, DMABUF uuid=1)
(4) DQBUF(index=1)
(5) QBUF(index=0, DMABUF uuid=1)

Note that index is associated with a vb2_buffer and a resource_id here.

In this scenario, resource_create needs to be sent to register of a new
combination of resource_id and a uuid for (1), (3) and (5).

However, in the previous implementation, resource_create wasn't sent at (5)
because the combination (index=0, uuid=1) had been registered for (1) and
not been invalidated.

With this patch, a uuid in a vbb2_buffer object will be invalidated when
the uuid is tied with another vb2_buffer.
In this scenario, the combination registered at (1) will be invalidated
at (3).

Change-Id: Ie2eb7f50980cc2258540f692331a16aa11b5c334

---

This patch fixes a bug introduced by CL:2060511, which was submitted to
LKML but not approved. So, this change should be included in the next
revision of CL:2060511.

BUG=b:151703605
TEST=compile

Change-Id: I441bb15d4c5da834a189f8709a2b95246cea36fb
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2120830
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
When the virtio-video driver sends RESOURCE_DESTROY_ALL command, it must
wait for the host returning a response for the command.

BUG=b:151703605
TEST=compile

Change-Id: I7eeea5d3ad6beab172f9590611fe558b2e5035fb
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2060510
[5.4-arcvm: picked from 5.4 for initial merge]
…iver

0-day reports:

drivers/media/virtio/virtio_video_dec.c:156:5: warning:
	no previous prototype for 'virtio_video_dec_init_ctrls'
drivers/media/virtio/virtio_video_dec.c:177:5: warning:
	no previous prototype for 'virtio_video_dec_init_queues'
drivers/media/virtio/virtio_video_dec.c:335:5: warning:
	no previous prototype for 'virtio_video_dec_enum_fmt_vid_out'
drivers/media/virtio/virtio_video_dec.c:417:5: warning:
	no previous prototype for 'virtio_video_dec_init'
drivers/media/virtio/virtio_video_dec.c: In function 'virtio_video_dec_init':
drivers/media/virtio/virtio_video_dec.c:419:10: warning:
	variable 'num' set but not used

drivers/media/virtio/virtio_video_enc.c:162:5: warning:
	no previous prototype for 'virtio_video_enc_init_ctrls'
drivers/media/virtio/virtio_video_enc.c:223:5: warning:
	no previous prototype for 'virtio_video_enc_init_queues'
drivers/media/virtio/virtio_video_enc.c:559:5: warning:
	no previous prototype for 'virtio_video_enc_init'
drivers/media/virtio/virtio_video_enc.c: In function 'virtio_video_enc_init':
drivers/media/virtio/virtio_video_enc.c:561:10: warning:
	variable 'num' set but not used

BUG=b:120456557, b:151703605
TEST=compile with VIRTIO_VIDEO enabled

Change-Id: Ic0e9f9b5efc54b627cd9024e90b3559a731ab1cc
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2214405
Reviewed-by: Keiichi Watanabe <[email protected]>
Tested-by: Keiichi Watanabe <[email protected]>
Commit-Queue: Keiichi Watanabe <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
…iver

0-day reports:

drivers/media/virtio/virtio_video_device.c:1188:6: error:
	no previous prototype for 'virtio_video_device_destroy'
drivers/media/virtio/virtio_video_helpers.c:188:10: error:
	no previous prototype for 'virtio_video_get_format_from_virtio_profile'

Somewhat guessing here, virtio_video_device_destroy() is only called
locally and thus marked static.
virtio_video_get_format_from_virtio_profile() is not called at all and
presumably intended to be used as helper function. Removed this function,
as the conversion from a profile to a format doesn't make much sense.
If we need a V4L2 format, we should virtio_video_format_to_v4l2 instead.

BUG=b:120456557, b:151703605
TEST=compile with VIRTIO_VIDEO enabled

Change-Id: I02d577f8e9b1a65cbbe9877391f8f755a474fc78
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Keiichi Watanabe <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2226628
[5.4-arcvm: picked from 5.4 for initial merge]
BUG=b:151703605
TEST=tast run ARCVM-DUT arc.VideoDecodeAccel*vm

Change-Id: I8c699abfef01c5063494546031496afa806274ea
Signed-off-by: David Stevens <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2253500
Reviewed-by: Tomasz Figa <[email protected]>
Reviewed-by: Keiichi Watanabe <[email protected]>
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
After calling V4L2_DEC_CMD_START, the stream state should be updated
from STOPPED to RUNNING.

BUG=b:168557465
BUG=b:151703605
TEST=pass android.media.cts.AdaptivePlaybackTest#testVP8_eosFlushSeek

Change-Id: I849a5d92c692bf675c54d443c7157da86877db50
Signed-off-by: Chih-Yu Huang <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2410146
Reviewed-by: Keiichi Watanabe <[email protected]>
Reviewed-by: Alexandre Courbot <[email protected]>
Commit-Queue: Chih-Yu Huang <[email protected]>
Tested-by: Chih-Yu Huang <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
0-day reports:

virtio_video_device.c:(.text+0xc00): undefined reference to `virtio_dma_buf_get_uuid'

VIRTIO_VIDEO now unconditionally requires VIRTIO_DMA_SHARED_BUFFER.
Since VIRTIO_DMA_SHARED_BUFFER depends on VIRTIO_MENU, which is
independent of VIRTIO, VIRTIO_VIDEO now unconditionally depends on
VIRTIO_MENU as well.

This still leaves

WARNING: unmet direct dependencies detected for VIRTIO_DMA_SHARED_BUFFER
  Depends on [n]: VIRTIO_MENU [=n] && DMA_SHARED_BUFFER [=y]
  Selected by [m]:
  - DRM_VIRTIO_GPU [=m] && HAS_IOMEM [=y] && DRM [=m] && VIRTIO [=y] && MMU [=y] && PCI [=y]

which is seen if VIRTIO_MENU is disabled. This problem was introduced by
commit 7c5e019 ("BACKPORT: FROMGIT: virtio: fix build for configs
without dma-bufs"). Since that commit, DRM_VIRTIO_GPU depends on
VIRTIO_MENU as well. Add that dependency.

BUG=b:142423916
TEST=Test builds

Change-Id: Ifa87ff51b39bf39feffc28e65c3c346a324415c9
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2410825
Reviewed-by: David Stevens <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
BUG=b:151703605
TEST=manual - Run ARCVM

Signed-off-by: Lepton Wu <[email protected]>
Change-Id: Ife9d5a8c35e6496a523e15ae95c585cce4949c84
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2477075
Reviewed-by: Keiichi Watanabe <[email protected]>
Reviewed-by: Alexandre Courbot <[email protected]>
Tested-by: Keiichi Watanabe <[email protected]>
Commit-Queue: Chih-Yu Huang <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
When possible, avoid calling dma_buf_get. If dma_buf_get is called, make
sure dma_buf_put is also called.

BUG=b:151703605, b:170702290, b:167992701
TEST=No leaks in /sys/kernel/debug/dma_buf/bufinfo after decoding

Change-Id: I1b39f8dd2599c625cb6396bd6b5914147976f564
Signed-off-by: David Stevens <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2483785
Reviewed-by: Keiichi Watanabe <[email protected]>
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
BUG=b:151703605
TEST=manual - Run ARCVM

Signed-off-by: Lepton Wu <[email protected]>
Change-Id: I6621bd04d76efc920348eeda0393e1efd0b79c7f
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2491443
Commit-Queue: Keiichi Watanabe <[email protected]>
Commit-Queue: Alexandre Courbot <[email protected]>
Reviewed-by: Keiichi Watanabe <[email protected]>
Reviewed-by: Alexandre Courbot <[email protected]>
[5.4-arcvm: picked from 5.4 for initial merge]
This CL introduces the VIRTIO_VIDEO_CONTROL_FORCE_KEYFRAME control to
the virtio encoder, to support the V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
control.

Signed-off-by: David Staessens <[email protected]>

BUG=b:161498590,b:174444769,b:175270403
TEST=tast run DUT arc.VideoEncodeAccel.h264_192p_i420_vm

[5.4-arcvm: picked from 5.4]

Change-Id: I4b7057dcf4595c91b26bb2c70b4b3a584f9baa13
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2594594
Reviewed-by: Alexandre Courbot <[email protected]>
Tested-by: David Staessens <[email protected]>
Commit-Queue: David Staessens <[email protected]>
When stopping a stream, set its state to drain before sending the drain
command to prevent a race where the drain command callback is processed
before the state gets set to drain.

BUG=b:151703605
TEST=No flakes in AdaptivePlaybackTest#testVP9_adaptiveEosFlushSeek

[5.4-arcvm: picked from 5.4]
Change-Id: I680b60d59ac0a7619e7acb2066419d1a2128efd4
Signed-off-by: David Stevens <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2592625
Reviewed-by: Hikaru Nishida <[email protected]>
Treat VIRTIO_VIDEO_EVENT_ERROR as an unrecoverable errors for the
respective stream.

The spec isn't explicit as to whether or not the error event is fatal.
However, since the error event is generic, the cause is opaque to the
driver, so the driver can't meaningfully do anything to recover in the
case of a non-fatal error. Therefore it doesn't make sense for the
device to send non-fatal error events in the first place.

BUG=b:151703605, b:174445948
TEST=android.security.cts.StagefrightTest#testStagefright_bug_33818508

[5.4-arcvm: picked from 5.4]
Change-Id: I4d9eee47ff3e42058768c0e6e70c69518e8abe20
Signed-off-by: David Stevens <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2594587
Reviewed-by: Hikaru Nishida <[email protected]>
…O_BITRATE.

Upon initialization the encode device requests the bitrate from the
underlying encoder and uses that as both the default and maximum value.
This prevents us from configuring bitrates higher than the initial
value, so this CL changes the maximum value to 1GBs instead.

BUG=b:162799179,b:177213709
TEST=tast run DUT arc.VideoEncodeAccel.h264_192p_i420_vm

[5.4-arcvm: picked from 5.4]
(cherry picked from commit 675b572)

Signed-off-by: David Staessens <[email protected]>
Change-Id: Ic32f5de210f93d2d2789be24602aee2c0a736a16
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2627240
Reviewed-by: Hikaru Nishida <[email protected]>
Tested-by: David Staessens <[email protected]>
Commit-Queue: David Staessens <[email protected]>
Remove streams from the id map and wait for any ongoing events to
complete before cleanup begins.

BUG=b:177697115, b:151703605
TEST=android.security.cts.StagefrightTest

Change-Id: I141b3844ffb507bfda0ea0d94c9eacb8173781a9
Signed-off-by: David Stevens <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2684016
Reviewed-by: Keiichi Watanabe <[email protected]>
Reviewed-by: Alexandre Courbot <[email protected]>
BUG=b:177697115, b:151703605
TEST=android.security.cts.StagefrightTest

Change-Id: Ib83251555bc0af40cb015fd08c6a9335c3e3f4a5
Signed-off-by: David Stevens <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2683951
Reviewed-by: Alexandre Courbot <[email protected]>
Reviewed-by: Keiichi Watanabe <[email protected]>
If a buffer's data offset changes, create a new object for the buffer.
In particular, this can happen when data offset is used to skip headers
at the start of input buffers.

BUG=b:174531173,b:151703605
TEST=android.media.cts.MediaDrmClearkeyTest#testClearKeyPlaybackMpeg2ts

Change-Id: I538a0bd1667a9e4b17d77f8e4707dda9700c824b
Signed-off-by: David Stevens <[email protected]>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2696229
Reviewed-by: Chih-Yu Huang <[email protected]>
Reviewed-by: Keiichi Watanabe <[email protected]>
This CL removes the unused virtio_video_mark_drain_complete function
from the virtio video device. All steps required to finish draining are
already done in the virtio_video_buf_done function.

Signed-off-by: David Staessens <[email protected]>

BUG=b:151703605
TEST=emerge-$BOARD sys-kernel/arcvm-kernel-ack-5_4

Change-Id: Ibcb59624b4d5e4d016a7a3627db755c92a5a2d67
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2879433
Reviewed-by: Keiichi Watanabe <[email protected]>
Tested-by: David Staessens <[email protected]>
Commit-Queue: David Staessens <[email protected]>
When draining the V4L2 encode device or handling a drain done callback,
the stream is not properly locked. In some cases this causes the
virtio_video_buf_done (EOS done) and virtio_video_encoder_cmd
(restart encoder) functions to be executed before the original
virtio_video_encoder_cmd call to drain the encoder is even finished
executing. This leads to the stream ending up in the draining state
after it has already finished draining.

Extra mutex lock and unlock operations have been introduced here to
serialize access to the virtio_video_encoder_cmd and
virtio_video_buf_done functions to avoid the above issue.

Signed-off-by: David Staessens <[email protected]>

BUG=b:186588566,b:151703605,b:187470003
TEST=android.mediav2.cts.EncoderProfileLevelTest

Change-Id: I35d36b31549a9883eaff1362d8b47653222f5613
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2859568
Reviewed-by: Keiichi Watanabe <[email protected]>
Reviewed-by: David Stevens <[email protected]>
Reviewed-by: Alexandre Courbot <[email protected]>
Commit-Queue: David Staessens <[email protected]>
Tested-by: David Staessens <[email protected]>
This reverts commit 3f5e7b5cba1430eef2a7d22236de4df678afb804.

Reason for revert: It caused GTS regression: b/188009828 

Original change's description:
> CHROMIUM: virtio_video_enc: Lock stream mutex when draining.
>
> When draining the V4L2 encode device or handling a drain done callback,
> the stream is not properly locked. In some cases this causes the
> virtio_video_buf_done (EOS done) and virtio_video_encoder_cmd
> (restart encoder) functions to be executed before the original
> virtio_video_encoder_cmd call to drain the encoder is even finished
> executing. This leads to the stream ending up in the draining state
> after it has already finished draining.
>
> Extra mutex lock and unlock operations have been introduced here to
> serialize access to the virtio_video_encoder_cmd and
> virtio_video_buf_done functions to avoid the above issue.
>
> Signed-off-by: David Staessens <[email protected]>
>
> BUG=b:186588566,b:151703605,b:187470003
> TEST=android.mediav2.cts.EncoderProfileLevelTest
>
> Change-Id: I35d36b31549a9883eaff1362d8b47653222f5613
> Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2859568
> Reviewed-by: Keiichi Watanabe <[email protected]>
> Reviewed-by: David Stevens <[email protected]>
> Reviewed-by: Alexandre Courbot <[email protected]>
> Commit-Queue: David Staessens <[email protected]>
> Tested-by: David Staessens <[email protected]>

Bug: b:186588566
Bug: b:151703605
Bug: b:187470003
Bug: b:188009828
Change-Id: Id1053f1ea2f19ae44ad9d78855687c12e7fa8707
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2896200
Reviewed-by: Chih-Yu Huang <[email protected]>
Tested-by: Chih-Yu Huang <[email protected]>
Owners-Override: Chih-Yu Huang <[email protected]>
Auto-Submit: Chih-Yu Huang <[email protected]>
Bot-Commit: Rubber Stamper <[email protected]>
Commit-Queue: Chih-Yu Huang <[email protected]>
This CL relands crrev.com/c/2859568 which was reverted in
crrev.com/c/2896200 because of b/188009828.

When draining the V4L2 encode device or handling a drain done callback,
the stream was not properly locked leading to b/186588566. In some
cases this caused virtio_video_buf_done() to be executed before the
V4L2_ENC_CMD_STOP command was fully executed.

Adding locks unfortunately seems to have caused another issue. This CL
reduces the scope of the lock so draining is properly synced, while at
the same time avoiding holding the lock during
virtio_video_queue_eos_event() and v4l2_m2m_buf_done() which might
cause issues.

Original change's description:
> CHROMIUM: virtio_video_enc: Lock stream mutex when draining.
>
> When draining the V4L2 encode device or handling a drain done callback,
> the stream is not properly locked. In some cases this causes the
> virtio_video_buf_done (EOS done) and virtio_video_encoder_cmd
> (restart encoder) functions to be executed before the original
> virtio_video_encoder_cmd call to drain the encoder is even finished
> executing. This leads to the stream ending up in the draining state
> after it has already finished draining.
>
> Extra mutex lock and unlock operations have been introduced here to
> serialize access to the virtio_video_encoder_cmd and
> virtio_video_buf_done functions to avoid the above issue.
>
> Signed-off-by: David Staessens <[email protected]>
>
> BUG=b:186588566,b:151703605,b:187470003
> TEST=android.mediav2.cts.EncoderProfileLevelTest
>
> Change-Id: I35d36b31549a9883eaff1362d8b47653222f5613
> Reviewed-on: http://crrev.com/c/2859568
> Reviewed-by: Keiichi Watanabe <[email protected]>
> Reviewed-by: David Stevens <[email protected]>
> Reviewed-by: Alexandre Courbot <[email protected]>
> Commit-Queue: David Staessens <[email protected]>
> Tested-by: David Staessens <[email protected]>

Signed-off-by: David Staessens <[email protected]>

BUG=b:188009828,b:186588566,b:151703605,b:187470003
TEST=android.mediav2.cts.EncoderProfileLevelTest
     com.google.android.exoplayer.gts.DashTest#testH264Adaptive

Change-Id: I4b7948f24e0634b857405bf84ee08cabbdbcf3cd
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2900138
Reviewed-by: Alexandre Courbot <[email protected]>
Commit-Queue: David Staessens <[email protected]>
Tested-by: David Staessens <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
commit c652887 ("KVM: arm64: vgic-v3: Allow userspace to write
GICD_TYPER2.nASSGIcap") makes the allocation of vPEs depend on nASSGIcap
for GICv4.1 hosts. While the vGIC v4 initialization and teardown is
handled correctly, it erroneously attempts to establish a vLPI mapping
to a VM that has no vPEs allocated:

  Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a8
   Mem abort info:
     ESR = 0x0000000096000044
     EC = 0x25: DABT (current EL), IL = 32 bits
     SET = 0, FnV = 0
     EA = 0, S1PTW = 0
     FSC = 0x04: level 0 translation fault
   Data abort info:
     ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
     CM = 0, WnR = 1, TnD = 0, TagAccess = 0
     GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
   user pgtable: 4k pages, 48-bit VAs, pgdp=00000073a453b000
   [00000000000000a8] pgd=0000000000000000, p4d=0000000000000000
   Internal error: Oops: 0000000096000044 [#1] SMP
   pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
   pc : its_irq_set_vcpu_affinity+0x58c/0x95c
   lr : its_irq_set_vcpu_affinity+0x1e0/0x95c
   sp : ffff8001029bb9e0
   pmr_save: 00000060
   x29: ffff8001029bba20 x28: ffff0001ca5e28c0 x27: 0000000000000000
   x26: 0000000000000000 x25: ffff00019eee9f80 x24: ffff0001992b3f00
   x23: ffff8001029bbab8 x22: ffff00001159fb80 x21: 00000000000024a7
   x20: 00000000000024a7 x19: ffff00019eee9fb4 x18: 0000000000000494
   x17: 000000000000000e x16: 0000000000000494 x15: 0000000000000002
   x14: ffff0001a7f34600 x13: ffffccaad1203000 x12: 0000000000000018
   x11: ffff000011991000 x10: 0000000000000000 x9 : 00000000000000a2
   x8 : 00000000000020a8 x7 : 0000000000000000 x6 : 000000000000003f
   x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000004
   x2 : 0000000000000000 x1 : ffff8001029bbab8 x0 : 00000000000000a8
   Call trace:
    its_irq_set_vcpu_affinity+0x58c/0x95c
    irq_set_vcpu_affinity+0x74/0xc8
    its_map_vlpi+0x4c/0x94
    kvm_vgic_v4_set_forwarding+0x134/0x298
    kvm_arch_irq_bypass_add_producer+0x28/0x34
    irq_bypass_register_producer+0xf8/0x1d8
    vfio_msi_set_vector_signal+0x2c8/0x308
    vfio_pci_set_msi_trigger+0x198/0x2d4
    vfio_pci_set_irqs_ioctl+0xf0/0x104
    vfio_pci_core_ioctl+0x6ac/0xc5c
    vfio_device_fops_unl_ioctl+0x128/0x370
    __arm64_sys_ioctl+0x98/0xd0
    el0_svc_common+0xd8/0x1d8
    do_el0_svc+0x28/0x34
    el0_svc+0x40/0xb8
    el0t_64_sync_handler+0x70/0xbc
    el0t_64_sync+0x1a8/0x1ac
   Code: 321f0129 f940094a 8b08014 d1400900 (39000009)
   ---[ end trace 0000000000000000 ]---

Fix it by moving the GICv4.1 special-casing to
vgic_supports_direct_msis(), returning false if the user explicitly
disabled nASSGIcap for the VM.

Fixes: c652887 ("KVM: arm64: vgic-v3: Allow userspace to write GICD_TYPER2.nASSGIcap")
Suggested-by: Oliver Upton <[email protected]>
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
In gicv5_irs_of_init_affinity() a WARN_ON() is triggered if:

 1) a phandle in the "cpus" property does not correspond to a valid OF
    node
 2  a CPU logical id does not exist for a given OF cpu_node

#1 is a firmware bug and should be reported as such but does not warrant a
   WARN_ON() backtrace.

#2 is not necessarily an error condition (eg a kernel can be booted with
   nr_cpus=X limiting the number of cores artificially) and therefore there
   is no reason to clutter the kernel log with WARN_ON() output when the
   condition is hit.

Rework the IRS affinity parsing code to remove undue WARN_ON()s thus
making it less noisy.

Signed-off-by: Lorenzo Pieralisi <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
When there are memory-only nodes (nodes without CPUs), these nodes are not
properly initialized, causing kernel panic during boot.

of_numa_init
	of_numa_parse_cpu_nodes
		node_set(nid, numa_nodes_parsed);
	of_numa_parse_memory_nodes

In of_numa_parse_cpu_nodes, numa_nodes_parsed gets updated only for nodes
containing CPUs.  Memory-only nodes should have been updated in
of_numa_parse_memory_nodes, but they weren't.

Subsequently, when free_area_init() attempts to access NODE_DATA() for
these uninitialized memory nodes, the kernel panics due to NULL pointer
dereference.

This can be reproduced on ARM64 QEMU with 1 CPU and 2 memory nodes:

qemu-system-aarch64 \
-cpu host -nographic \
-m 4G -smp 1 \
-machine virt,accel=kvm,gic-version=3,iommu=smmuv3 \
-object memory-backend-ram,size=2G,id=mem0 \
-object memory-backend-ram,size=2G,id=mem1 \
-numa node,nodeid=0,memdev=mem0 \
-numa node,nodeid=1,memdev=mem1 \
-kernel $IMAGE \
-hda $DISK \
-append "console=ttyAMA0 root=/dev/vda rw earlycon"

[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x481fd010]
[    0.000000] Linux version 6.17.0-rc1-00001-gabb4b3daf18c-dirty (yintirui@local) (gcc (GCC) 12.3.1, GNU ld (GNU Binutils) 2.41) #52 SMP PREEMPT Mon Aug 18 09:49:40 CST 2025
[    0.000000] KASLR enabled
[    0.000000] random: crng init done
[    0.000000] Machine model: linux,dummy-virt
[    0.000000] efi: UEFI not found.
[    0.000000] earlycon: pl11 at MMIO 0x0000000009000000 (options '')
[    0.000000] printk: legacy bootconsole [pl11] enabled
[    0.000000] OF: reserved mem: Reserved memory: No reserved-memory node in the DT
[    0.000000] NODE_DATA(0) allocated [mem 0xbfffd9c0-0xbfffffff]
[    0.000000] node 1 must be removed before remove section 23
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
[    0.000000]   DMA32    empty
[    0.000000]   Normal   [mem 0x0000000100000000-0x000000013fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000040000000-0x00000000bfffffff]
[    0.000000]   node   1: [mem 0x00000000c0000000-0x000000013fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000040000000-0x00000000bfffffff]
[    0.000000] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0
[    0.000000] Mem abort info:
[    0.000000]   ESR = 0x0000000096000004
[    0.000000]   EC = 0x25: DABT (current EL), IL = 32 bits
[    0.000000]   SET = 0, FnV = 0
[    0.000000]   EA = 0, S1PTW = 0
[    0.000000]   FSC = 0x04: level 0 translation fault
[    0.000000] Data abort info:
[    0.000000]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[    0.000000]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    0.000000]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    0.000000] [00000000000000a0] user address but active_mm is swapper
[    0.000000] Internal error: Oops: 0000000096000004 [#1]  SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.17.0-rc1-00001-g760c6dabf762-dirty torvalds#54 PREEMPT
[    0.000000] Hardware name: linux,dummy-virt (DT)
[    0.000000] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    0.000000] pc : free_area_init+0x50c/0xf9c
[    0.000000] lr : free_area_init+0x5c0/0xf9c
[    0.000000] sp : ffffa02ca0f33c00
[    0.000000] x29: ffffa02ca0f33cb0 x28: 0000000000000000 x27: 0000000000000000
[    0.000000] x26: 4ec4ec4ec4ec4ec5 x25: 00000000000c0000 x24: 00000000000c0000
[    0.000000] x23: 0000000000040000 x22: 0000000000000000 x21: ffffa02ca0f3b368
[    0.000000] x20: ffffa02ca14c7b98 x19: 0000000000000000 x18: 0000000000000002
[    0.000000] x17: 000000000000cacc x16: 0000000000000001 x15: 0000000000000001
[    0.000000] x14: 0000000080000000 x13: 0000000000000018 x12: 0000000000000002
[    0.000000] x11: ffffa02ca0fd4f00 x10: ffffa02ca14bab20 x9 : ffffa02ca14bab38
[    0.000000] x8 : 00000000000c0000 x7 : 0000000000000001 x6 : 0000000000000002
[    0.000000] x5 : 0000000140000000 x4 : ffffa02ca0f33c90 x3 : ffffa02ca0f33ca0
[    0.000000] x2 : ffffa02ca0f33c98 x1 : 0000000080000000 x0 : 0000000000000001
[    0.000000] Call trace:
[    0.000000]  free_area_init+0x50c/0xf9c (P)
[    0.000000]  bootmem_init+0x110/0x1dc
[    0.000000]  setup_arch+0x278/0x60c
[    0.000000]  start_kernel+0x70/0x748
[    0.000000]  __primary_switched+0x88/0x90
[    0.000000] Code: d503201f b98093e0 52800016 f8607a93 (f9405260)
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7675076 ("arch_numa: switch over to numa_memblks")
Signed-off-by: Yin Tirui <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Cc: Chen Jun <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Joanthan Cameron <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Saravana Kannan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
While working on the lazy MMU mode enablement for s390 I hit pretty
curious issues in the kasan code.

The first is related to a custom kasan-based sanitizer aimed at catching
invalid accesses to PTEs and is inspired by [1] conversation.  The kasan
complains on valid PTE accesses, while the shadow memory is reported as
unpoisoned:

[  102.783993] ==================================================================
[  102.784008] BUG: KASAN: out-of-bounds in set_pte_range+0x36c/0x390
[  102.784016] Read of size 8 at addr 0000780084cf9608 by task vmalloc_test/0/5542
[  102.784019] 
[  102.784040] CPU: 1 UID: 0 PID: 5542 Comm: vmalloc_test/0 Kdump: loaded Tainted: G           OE       6.16.0-gcc-ipte-kasan-11657-gb2d930c4950e torvalds#340 PREEMPT 
[  102.784047] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  102.784049] Hardware name: IBM 8561 T01 703 (LPAR)
[  102.784052] Call Trace:
[  102.784054]  [<00007fffe0147ac0>] dump_stack_lvl+0xe8/0x140 
[  102.784059]  [<00007fffe0112484>] print_address_description.constprop.0+0x34/0x2d0 
[  102.784066]  [<00007fffe011282c>] print_report+0x10c/0x1f8 
[  102.784071]  [<00007fffe090785a>] kasan_report+0xfa/0x220 
[  102.784078]  [<00007fffe01d3dec>] set_pte_range+0x36c/0x390 
[  102.784083]  [<00007fffe01d41c2>] leave_ipte_batch+0x3b2/0xb10 
[  102.784088]  [<00007fffe07d3650>] apply_to_pte_range+0x2f0/0x4e0 
[  102.784094]  [<00007fffe07e62e4>] apply_to_pmd_range+0x194/0x3e0 
[  102.784099]  [<00007fffe07e820e>] __apply_to_page_range+0x2fe/0x7a0 
[  102.784104]  [<00007fffe07e86d8>] apply_to_page_range+0x28/0x40 
[  102.784109]  [<00007fffe090a3ec>] __kasan_populate_vmalloc+0xec/0x310 
[  102.784114]  [<00007fffe090aa36>] kasan_populate_vmalloc+0x96/0x130 
[  102.784118]  [<00007fffe0833a04>] alloc_vmap_area+0x3d4/0xf30 
[  102.784123]  [<00007fffe083a8ba>] __get_vm_area_node+0x1aa/0x4c0 
[  102.784127]  [<00007fffe083c4f6>] __vmalloc_node_range_noprof+0x126/0x4e0 
[  102.784131]  [<00007fffe083c980>] __vmalloc_node_noprof+0xd0/0x110 
[  102.784135]  [<00007fffe083ca32>] vmalloc_noprof+0x32/0x40 
[  102.784139]  [<00007fff608aa336>] fix_size_alloc_test+0x66/0x150 [test_vmalloc] 
[  102.784147]  [<00007fff608aa710>] test_func+0x2f0/0x430 [test_vmalloc] 
[  102.784153]  [<00007fffe02841f8>] kthread+0x3f8/0x7a0 
[  102.784159]  [<00007fffe014d8b4>] __ret_from_fork+0xd4/0x7d0 
[  102.784164]  [<00007fffe299c00a>] ret_from_fork+0xa/0x30 
[  102.784173] no locks held by vmalloc_test/0/5542.
[  102.784176] 
[  102.784178] The buggy address belongs to the physical page:
[  102.784186] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x84cf9
[  102.784198] flags: 0x3ffff00000000000(node=0|zone=1|lastcpupid=0x1ffff)
[  102.784212] page_type: f2(table)
[  102.784225] raw: 3ffff00000000000 0000000000000000 0000000000000122 0000000000000000
[  102.784234] raw: 0000000000000000 0000000000000000 f200000000000001 0000000000000000
[  102.784248] page dumped because: kasan: bad access detected
[  102.784250] 
[  102.784252] Memory state around the buggy address:
[  102.784260]  0000780084cf9500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  102.784274]  0000780084cf9580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  102.784277] >0000780084cf9600: fd 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  102.784290]                          ^
[  102.784293]  0000780084cf9680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  102.784303]  0000780084cf9700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  102.784306] ==================================================================

The second issue hits when the custom sanitizer above is not implemented,
but the kasan itself is still active:

[ 1554.438028] Unable to handle kernel pointer dereference in virtual kernel address space
[ 1554.438065] Failing address: 001c0ff0066f0000 TEID: 001c0ff0066f0403
[ 1554.438076] Fault in home space mode while using kernel ASCE.
[ 1554.438103] AS:00000000059d400b R2:0000000ffec5c00b R3:00000000c6c9c007 S:0000000314470001 P:00000000d0ab413d 
[ 1554.438158] Oops: 0011 ilc:2 [#1]SMP 
[ 1554.438175] Modules linked in: test_vmalloc(E+) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) sunrpc(E) pkey_pckmo(E) uvdevice(E) s390_trng(E) rng_core(E) eadm_sch(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) loop(E) i2c_core(E) drm_panel_orientation_quirks(E) nfnetlink(E) ctcm(E) fsm(E) zfcp(E) scsi_transport_fc(E) diag288_wdt(E) watchdog(E) ghash_s390(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha512_s390(E) sha1_s390(E) sha_common(E) pkey(E) autofs4(E)
[ 1554.438319] Unloaded tainted modules: pkey_uv(E):1 hmac_s390(E):2
[ 1554.438354] CPU: 1 UID: 0 PID: 1715 Comm: vmalloc_test/0 Kdump: loaded Tainted: G            E       6.16.0-gcc-ipte-kasan-11657-gb2d930c4950e torvalds#350 PREEMPT 
[ 1554.438368] Tainted: [E]=UNSIGNED_MODULE
[ 1554.438374] Hardware name: IBM 8561 T01 703 (LPAR)
[ 1554.438381] Krnl PSW : 0704e00180000000 00007fffe1d3d6ae (memset+0x5e/0x98)
[ 1554.438396]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 1554.438409] Krnl GPRS: 0000000000000001 001c0ff0066f0000 001c0ff0066f0000 00000000000000f8
[ 1554.438418]            00000000000009fe 0000000000000009 0000000000000000 0000000000000002
[ 1554.438426]            0000000000005000 000078031ae655c8 00000feffdcf9f59 0000780258672a20
[ 1554.438433]            0000780243153500 00007f8033780000 00007fffe083a510 00007f7fee7cfa00
[ 1554.438452] Krnl Code: 00007fffe1d3d6a0: eb540008000c	srlg	%r5,%r4,8
           00007fffe1d3d6a6: b9020055		ltgr	%r5,%r5
          #00007fffe1d3d6aa: a784000b		brc	8,00007fffe1d3d6c0
          >00007fffe1d3d6ae: 42301000		stc	%r3,0(%r1)
           00007fffe1d3d6b2: d2fe10011000	mvc	1(255,%r1),0(%r1)
           00007fffe1d3d6b8: 41101100		la	%r1,256(%r1)
           00007fffe1d3d6bc: a757fff9		brctg	%r5,00007fffe1d3d6ae
           00007fffe1d3d6c0: 42301000		stc	%r3,0(%r1)
[ 1554.438539] Call Trace:
[ 1554.438545]  [<00007fffe1d3d6ae>] memset+0x5e/0x98 
[ 1554.438552] ([<00007fffe083a510>] remove_vm_area+0x220/0x400)
[ 1554.438562]  [<00007fffe083a9d6>] vfree.part.0+0x26/0x810 
[ 1554.438569]  [<00007fff6073bd50>] fix_align_alloc_test+0x50/0x90 [test_vmalloc] 
[ 1554.438583]  [<00007fff6073c73a>] test_func+0x46a/0x6c0 [test_vmalloc] 
[ 1554.438593]  [<00007fffe0283ac8>] kthread+0x3f8/0x7a0 
[ 1554.438603]  [<00007fffe014d8b4>] __ret_from_fork+0xd4/0x7d0 
[ 1554.438613]  [<00007fffe299ac0a>] ret_from_fork+0xa/0x30 
[ 1554.438622] INFO: lockdep is turned off.
[ 1554.438627] Last Breaking-Event-Address:
[ 1554.438632]  [<00007fffe1d3d65c>] memset+0xc/0x98
[ 1554.438644] Kernel panic - not syncing: Fatal exception: panic_on_oops

This series fixes the above issues and is a pre-requisite for the s390
lazy MMU mode implementation.

test_vmalloc was used to stress-test the fixes.


This patch (of 2):

When vmalloc shadow memory is established the modification of the
corresponding page tables is not protected by any locks.  Instead, the
locking is done per-PTE.  This scheme however has defects.

kasan_populate_vmalloc_pte() - while ptep_get() read is atomic the
sequence pte_none(ptep_get()) is not.  Doing that outside of the lock
might lead to a concurrent PTE update and what could be seen as a shadow
memory corruption as result.

kasan_depopulate_vmalloc_pte() - by the time a page whose address was
extracted from ptep_get() read and cached in a local variable outside of
the lock is attempted to get free, could actually be freed already.

To avoid these put ptep_get() itself and the code that manipulates the
result of the read under lock.  In addition, move freeing of the page out
of the atomic context.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/adb258634194593db294c0d1fb35646e894d6ead.1755528662.git.agordeev@linux.ibm.com
Link: https://lore.kernel.org/linux-mm/[email protected]/ [1]
Fixes: 3c5c3cf ("kasan: support backing vmalloc space with real shadow memory")
Signed-off-by: Alexander Gordeev <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Daniel Axtens <[email protected]>
Cc: Marc Rutland <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
During our internal testing, we started observing intermittent boot
failures when the machine uses 4-level paging and has a large amount of
persistent memory:

  BUG: unable to handle page fault for address: ffffe70000000034
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0 
  Oops: 0002 [#1] SMP NOPTI
  RIP: 0010:__init_single_page+0x9/0x6d
  Call Trace:
   <TASK>
   __init_zone_device_page+0x17/0x5d
   memmap_init_zone_device+0x154/0x1bb
   pagemap_range+0x2e0/0x40f
   memremap_pages+0x10b/0x2f0
   devm_memremap_pages+0x1e/0x60
   dev_dax_probe+0xce/0x2ec [device_dax]
   dax_bus_probe+0x6d/0xc9
   [... snip ...]
   </TASK>

It turns out that the kernel panics while initializing vmemmap (struct
page array) when the vmemmap region spans two PGD entries, because the new
PGD entry is only installed in init_mm.pgd, but not in the page tables of
other tasks.

And looking at __populate_section_memmap():
  if (vmemmap_can_optimize(altmap, pgmap))                                
          // does not sync top level page tables
          r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
  else                                                                    
          // sync top level page tables in x86
          r = vmemmap_populate(start, end, nid, altmap);

In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
synchronizes the top level page table (See commit 9b86152 ("x86-64,
mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so
that all tasks in the system can see the new vmemmap area.

However, when vmemmap_can_optimize() returns true, the optimized path
skips synchronization of top-level page tables.  This is because
vmemmap_populate_compound_pages() is implemented in core MM code, which
does not handle synchronization of the top-level page tables.  Instead,
the core MM has historically relied on each architecture to perform this
synchronization manually.

We're not the first party to encounter a crash caused by not-sync'd top
level page tables: earlier this year, Gwan-gyeong Mun attempted to address
the issue [1] [2] after hitting a kernel panic when x86 code accessed the
vmemmap area before the corresponding top-level entries were synced.  At
that time, the issue was believed to be triggered only when struct page
was enlarged for debugging purposes, and the patch did not get further
updates.

It turns out that current approach of relying on each arch to handle the
page table sync manually is fragile because 1) it's easy to forget to sync
the top level page table, and 2) it's also easy to overlook that the
kernel should not access the vmemmap and direct mapping areas before the
sync.

# The solution: Make page table sync more code robust and harder to miss

To address this, Dave Hansen suggested [3] [4] introducing
{pgd,p4d}_populate_kernel() for updating kernel portion of the page tables
and allow each architecture to explicitly perform synchronization when
installing top-level entries.  With this approach, we no longer need to
worry about missing the sync step, reducing the risk of future
regressions.

The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK,
PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by
vmalloc and ioremap to synchronize page tables.

pgd_populate_kernel() looks like this:
static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd,
                                       p4d_t *p4d)
{
        pgd_populate(&init_mm, pgd, p4d);
        if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED)
                arch_sync_kernel_mappings(addr, addr);
}

It is worth noting that vmalloc() and apply_to_range() carefully
synchronizes page tables by calling p*d_alloc_track() and
arch_sync_kernel_mappings(), and thus they are not affected by this patch
series.

This series was hugely inspired by Dave Hansen's suggestion and hence
added Suggested-by: Dave Hansen.

Cc stable because lack of this series opens the door to intermittent
boot failures.


This patch (of 3):

Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to
linux/pgtable.h so that they can be used outside of vmalloc and ioremap.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/[email protected] [1] 
Link: https://lore.kernel.org/linux-mm/[email protected] [2] 
Link: https://lore.kernel.org/linux-mm/[email protected] [3] 
Link: https://lore.kernel.org/linux-mm/[email protected]  [4] 
Fixes: 8d40091 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Harry Yoo <[email protected]>
Acked-by: Kiryl Shutsemau <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: "Uladzislau Rezki (Sony)" <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: bibo mao <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dev Jain <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Gwan-gyeong Mun <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Thomas Huth <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
…ings()

Define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to ensure
page tables are properly synchronized when calling p*d_populate_kernel().

For 5-level paging, synchronization is performed via
pgd_populate_kernel().  In 4-level paging, pgd_populate() is a no-op, so
synchronization is instead performed at the P4D level via
p4d_populate_kernel().

This fixes intermittent boot failures on systems using 4-level paging and
a large amount of persistent memory:

  BUG: unable to handle page fault for address: ffffe70000000034
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: 0002 [#1] SMP NOPTI
  RIP: 0010:__init_single_page+0x9/0x6d
  Call Trace:
   <TASK>
   __init_zone_device_page+0x17/0x5d
   memmap_init_zone_device+0x154/0x1bb
   pagemap_range+0x2e0/0x40f
   memremap_pages+0x10b/0x2f0
   devm_memremap_pages+0x1e/0x60
   dev_dax_probe+0xce/0x2ec [device_dax]
   dax_bus_probe+0x6d/0xc9
   [... snip ...]
   </TASK>

It also fixes a crash in vmemmap_set_pmd() caused by accessing vmemmap
before sync_global_pgds() [1]:

  BUG: unable to handle page fault for address: ffffeb3ff1200000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
  Tainted: [W]=WARN
  RIP: 0010:vmemmap_set_pmd+0xff/0x230
   <TASK>
   vmemmap_populate_hugepages+0x176/0x180
   vmemmap_populate+0x34/0x80
   __populate_section_memmap+0x41/0x90
   sparse_add_section+0x121/0x3e0
   __add_pages+0xba/0x150
   add_pages+0x1d/0x70
   memremap_pages+0x3dc/0x810
   devm_memremap_pages+0x1c/0x60
   xe_devm_add+0x8b/0x100 [xe]
   xe_tile_init_noalloc+0x6a/0x70 [xe]
   xe_device_probe+0x48c/0x740 [xe]
   [... snip ...]

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 8d40091 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Harry Yoo <[email protected]>
Closes: https://lore.kernel.org/linux-mm/[email protected] [1]
Suggested-by: Dave Hansen <[email protected]>
Acked-by: Kiryl Shutsemau <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: bibo mao <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dev Jain <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Thomas Huth <[email protected]>
Cc: "Uladzislau Rezki (Sony)" <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
sched_numa_find_nth_cpu() uses a bsearch to look for the 'closest'
CPU in sched_domains_numa_masks and given cpus mask. However they
might not intersect if all CPUs in the cpus mask are offline. bsearch
will return NULL in that case, bail out instead of dereferencing a
bogus pointer.

The previous behaviour lead to this bug when using maxcpus=4 on an
rk3399 (LLLLbb) (i.e. booting with all big CPUs offline):

[    1.422922] Unable to handle kernel paging request at virtual address ffffff8000000000
[    1.423635] Mem abort info:
[    1.423889]   ESR = 0x0000000096000006
[    1.424227]   EC = 0x25: DABT (current EL), IL = 32 bits
[    1.424715]   SET = 0, FnV = 0
[    1.424995]   EA = 0, S1PTW = 0
[    1.425279]   FSC = 0x06: level 2 translation fault
[    1.425735] Data abort info:
[    1.425998]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[    1.426499]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    1.426952]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    1.427428] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000004a9f000
[    1.428038] [ffffff8000000000] pgd=18000000f7fff403, p4d=18000000f7fff403, pud=18000000f7fff403, pmd=0000000000000000
[    1.429014] Internal error: Oops: 0000000096000006 [#1]  SMP
[    1.429525] Modules linked in:
[    1.429813] CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc4-dirty torvalds#343 PREEMPT
[    1.430559] Hardware name: Pine64 RockPro64 v2.1 (DT)
[    1.431012] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.431634] pc : sched_numa_find_nth_cpu+0x2a0/0x488
[    1.432094] lr : sched_numa_find_nth_cpu+0x284/0x488
[    1.432543] sp : ffffffc084e1b960
[    1.432843] x29: ffffffc084e1b960 x28: ffffff80078a8800 x27: ffffffc0846eb1d0
[    1.433495] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[    1.434144] x23: 0000000000000000 x22: fffffffffff7f093 x21: ffffffc081de6378
[    1.434792] x20: 0000000000000000 x19: 0000000ffff7f093 x18: 00000000ffffffff
[    1.435441] x17: 3030303866666666 x16: 66663d736b73616d x15: ffffffc104e1b5b7
[    1.436091] x14: 0000000000000000 x13: ffffffc084712860 x12: 0000000000000372
[    1.436739] x11: 0000000000000126 x10: ffffffc08476a860 x9 : ffffffc084712860
[    1.437389] x8 : 00000000ffffefff x7 : ffffffc08476a860 x6 : 0000000000000000
[    1.438036] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
[    1.438683] x2 : 0000000000000000 x1 : ffffffc0846eb000 x0 : ffffff8000407b68
[    1.439332] Call trace:
[    1.439559]  sched_numa_find_nth_cpu+0x2a0/0x488 (P)
[    1.440016]  smp_call_function_any+0xc8/0xd0
[    1.440416]  armv8_pmu_init+0x58/0x27c
[    1.440770]  armv8_cortex_a72_pmu_init+0x20/0x2c
[    1.441199]  arm_pmu_device_probe+0x1e4/0x5e8
[    1.441603]  armv8_pmu_device_probe+0x1c/0x28
[    1.442007]  platform_probe+0x5c/0xac
[    1.442347]  really_probe+0xbc/0x298
[    1.442683]  __driver_probe_device+0x78/0x12c
[    1.443087]  driver_probe_device+0xdc/0x160
[    1.443475]  __driver_attach+0x94/0x19c
[    1.443833]  bus_for_each_dev+0x74/0xd4
[    1.444190]  driver_attach+0x24/0x30
[    1.444525]  bus_add_driver+0xe4/0x208
[    1.444874]  driver_register+0x60/0x128
[    1.445233]  __platform_driver_register+0x24/0x30
[    1.445662]  armv8_pmu_driver_init+0x28/0x4c
[    1.446059]  do_one_initcall+0x44/0x25c
[    1.446416]  kernel_init_freeable+0x1dc/0x3bc
[    1.446820]  kernel_init+0x20/0x1d8
[    1.447151]  ret_from_fork+0x10/0x20
[    1.447493] Code: 90022e21 f000e5f5 910de2b5 2a1703e2 (f8767803)
[    1.448040] ---[ end trace 0000000000000000 ]---
[    1.448483] note: swapper/0[1] exited with preempt_count 1
[    1.449047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    1.449741] SMP: stopping secondary CPUs
[    1.450105] Kernel Offset: disabled
[    1.450419] CPU features: 0x000000,00080000,20002001,0400421b
[    1.450935] Memory Limit: none
[    1.451217] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Yury: with the fix, the function returns cpu == nr_cpu_ids, and later in

	smp_call_function_any ->
	  smp_call_function_single ->
	     generic_exec_single

we test the cpu for '>= nr_cpu_ids' and return -ENXIO. So everything is
handled correctly.

Fixes: cd7f553 ("sched: add sched_numa_find_nth_cpu()")
Cc: [email protected]
Signed-off-by: Christian Loehle <[email protected]>
Signed-off-by: Yury Norov (NVIDIA) <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
tee_shm_put have NULL pointer dereference:

__optee_disable_shm_cache -->
	shm = reg_pair_to_ptr(...);//shm maybe return NULL
        tee_shm_free(shm); -->
		tee_shm_put(shm);//crash

Add check in tee_shm_put to fix it.

panic log:
Unable to handle kernel paging request at virtual address 0000000000100cca
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=0000002049d07000
[0000000000100cca] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 0000000096000004 [#1] SMP
CPU: 2 PID: 14442 Comm: systemd-sleep Tainted: P OE ------- ----
6.6.0-39-generic torvalds#38
Source Version: 938b255f6cb8817c95b0dd5c8c2944acfce94b07
Hardware name: greatwall GW-001Y1A-FTH, BIOS Great Wall BIOS V3.0
10/26/2022
pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : tee_shm_put+0x24/0x188
lr : tee_shm_free+0x14/0x28
sp : ffff001f98f9faf0
x29: ffff001f98f9faf0 x28: ffff0020df543cc0 x27: 0000000000000000
x26: ffff001f811344a0 x25: ffff8000818dac00 x24: ffff800082d8d048
x23: ffff001f850fcd18 x22: 0000000000000001 x21: ffff001f98f9fb88
x20: ffff001f83e76218 x19: ffff001f83e761e0 x18: 000000000000ffff
x17: 303a30303a303030 x16: 0000000000000000 x15: 0000000000000003
x14: 0000000000000001 x13: 0000000000000000 x12: 0101010101010101
x11: 0000000000000001 x10: 0000000000000001 x9 : ffff800080e08d0c
x8 : ffff001f98f9fb88 x7 : 0000000000000000 x6 : 0000000000000000
x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
x2 : ffff001f83e761e0 x1 : 00000000ffff001f x0 : 0000000000100cca
Call trace:
tee_shm_put+0x24/0x188
tee_shm_free+0x14/0x28
__optee_disable_shm_cache+0xa8/0x108
optee_shutdown+0x28/0x38
platform_shutdown+0x28/0x40
device_shutdown+0x144/0x2b0
kernel_power_off+0x3c/0x80
hibernate+0x35c/0x388
state_store+0x64/0x80
kobj_attr_store+0x14/0x28
sysfs_kf_write+0x48/0x60
kernfs_fop_write_iter+0x128/0x1c0
vfs_write+0x270/0x370
ksys_write+0x6c/0x100
__arm64_sys_write+0x20/0x30
invoke_syscall+0x4c/0x120
el0_svc_common.constprop.0+0x44/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x24/0x88
el0t_64_sync_handler+0x134/0x150
el0t_64_sync+0x14c/0x15

Fixes: dfd0743 ("tee: handle lookup of shm with reference count 0")
Signed-off-by: Pei Xiao <[email protected]>
Reviewed-by: Sumit Garg <[email protected]>
Signed-off-by: Jens Wiklander <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
BUG: kernel NULL pointer dereference, address: 00000000000002ec
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP PTI
CPU: 28 UID: 0 PID: 343 Comm: kworker/28:1 Kdump: loaded Tainted: G        OE       6.17.0-rc2+ torvalds#9 NONE
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
Workqueue: smc_hs_wq smc_listen_work [smc]
RIP: 0010:smc_ib_is_sg_need_sync+0x9e/0xd0 [smc]
...
Call Trace:
 <TASK>
 smcr_buf_map_link+0x211/0x2a0 [smc]
 __smc_buf_create+0x522/0x970 [smc]
 smc_buf_create+0x3a/0x110 [smc]
 smc_find_rdma_v2_device_serv+0x18f/0x240 [smc]
 ? smc_vlan_by_tcpsk+0x7e/0xe0 [smc]
 smc_listen_find_device+0x1dd/0x2b0 [smc]
 smc_listen_work+0x30f/0x580 [smc]
 process_one_work+0x18c/0x340
 worker_thread+0x242/0x360
 kthread+0xe7/0x220
 ret_from_fork+0x13a/0x160
 ret_from_fork_asm+0x1a/0x30
 </TASK>

If the software RoCE device is used, ibdev->dma_device is a null pointer.
As a result, the problem occurs. Null pointer detection is added to
prevent problems.

Fixes: 0ef69e7 ("net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu")
Signed-off-by: Liu Jian <[email protected]>
Reviewed-by: Guangguan Wang <[email protected]>
Reviewed-by: Zhu Yanjun <[email protected]>
Reviewed-by: D. Wythe <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
VXLAN FDB entries can point to either a remote destination or an FDB
nexthop group. The latter is usually used in EVPN deployments where
learning is disabled.

However, when learning is enabled, an incoming packet might try to
refresh an FDB entry that points to an FDB nexthop group and therefore
does not have a remote. Such packets should be dropped, but they are
only dropped after dereferencing the non-existent remote, resulting in a
NPD [1] which can be reproduced using [2].

Fix by dropping such packets earlier. Remove the misleading comment from
first_remote_rcu().

[1]
BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
CPU: 13 UID: 0 PID: 361 Comm: mausezahn Not tainted 6.17.0-rc1-virtme-g9f6b606b6b37 #1 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014
RIP: 0010:vxlan_snoop+0x98/0x1e0
[...]
Call Trace:
 <TASK>
 vxlan_encap_bypass+0x209/0x240
 encap_bypass_if_local+0xb1/0x100
 vxlan_xmit_one+0x1375/0x17e0
 vxlan_xmit+0x6b4/0x15f0
 dev_hard_start_xmit+0x5d/0x1c0
 __dev_queue_xmit+0x246/0xfd0
 packet_sendmsg+0x113a/0x1850
 __sock_sendmsg+0x38/0x70
 __sys_sendto+0x126/0x180
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

[2]
 #!/bin/bash

 ip address add 192.0.2.1/32 dev lo
 ip address add 192.0.2.2/32 dev lo

 ip nexthop add id 1 via 192.0.2.3 fdb
 ip nexthop add id 10 group 1 fdb

 ip link add name vx0 up type vxlan id 10010 local 192.0.2.1 dstport 12345 localbypass
 ip link add name vx1 up type vxlan id 10020 local 192.0.2.2 dstport 54321 learning

 bridge fdb add 00:11:22:33:44:55 dev vx0 self static dst 192.0.2.2 port 54321 vni 10020
 bridge fdb add 00:aa:bb:cc:dd:ee dev vx1 self static nhid 10

 mausezahn vx0 -a 00:aa:bb:cc:dd:ee -b 00:11:22:33:44:55 -c 1 -q

Fixes: 1274e1c ("vxlan: ecmp support for mac fdb entries")
Reported-by: Marlin Cremers <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
Ido Schimmel says:

====================
vxlan: Fix NPDs when using nexthop objects

With FDB nexthop groups, VXLAN FDB entries do not necessarily point to
a remote destination but rather to an FDB nexthop group. This means that
first_remote_{rcu,rtnl}() can return NULL and a few places in the driver
were not ready for that, resulting in NULL pointer dereferences.
Patches #1-#2 fix these NPDs.

Note that vxlan_fdb_find_uc() still dereferences the remote returned by
first_remote_rcu() without checking that it is not NULL, but this
function is only invoked by a single driver which vetoes the creation of
FDB nexthop groups. I will patch this in net-next to make the code less
fragile.

Patch #3 adds a selftests which exercises these code paths and tests
basic Tx functionality with FDB nexthop groups. I verified that the test
crashes the kernel without the first two patches.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
When transmitting a PTP frame which is timestamp using 2 step, the
following warning appears if CONFIG_PROVE_LOCKING is enabled:
=============================
[ BUG: Invalid wait context ]
6.17.0-rc1-00326-ge6160462704e torvalds#427 Not tainted
-----------------------------
ptp4l/119 is trying to lock:
c2a44ed4 (&vsc8531->ts_lock){+.+.}-{3:3}, at: vsc85xx_txtstamp+0x50/0xac
other info that might help us debug this:
context-{4:4}
4 locks held by ptp4l/119:
 #0: c145f068 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x58/0x1440
 #1: c29df974 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_queue_xmit+0x5c4/0x1440
 #2: c2aaaad0 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x108/0x350
 #3: c2aac170 (&lan966x->tx_lock){+.-.}-{2:2}, at: lan966x_port_xmit+0xd0/0x350
stack backtrace:
CPU: 0 UID: 0 PID: 119 Comm: ptp4l Not tainted 6.17.0-rc1-00326-ge6160462704e torvalds#427 NONE
Hardware name: Generic DT based system
Call trace:
 unwind_backtrace from show_stack+0x10/0x14
 show_stack from dump_stack_lvl+0x7c/0xac
 dump_stack_lvl from __lock_acquire+0x8e8/0x29dc
 __lock_acquire from lock_acquire+0x108/0x38c
 lock_acquire from __mutex_lock+0xb0/0xe78
 __mutex_lock from mutex_lock_nested+0x1c/0x24
 mutex_lock_nested from vsc85xx_txtstamp+0x50/0xac
 vsc85xx_txtstamp from lan966x_fdma_xmit+0xd8/0x3a8
 lan966x_fdma_xmit from lan966x_port_xmit+0x1bc/0x350
 lan966x_port_xmit from dev_hard_start_xmit+0xc8/0x2c0
 dev_hard_start_xmit from sch_direct_xmit+0x8c/0x350
 sch_direct_xmit from __dev_queue_xmit+0x680/0x1440
 __dev_queue_xmit from packet_sendmsg+0xfa4/0x1568
 packet_sendmsg from __sys_sendto+0x110/0x19c
 __sys_sendto from sys_send+0x18/0x20
 sys_send from ret_fast_syscall+0x0/0x1c
Exception stack(0xf0b05fa8 to 0xf0b05ff0)
5fa0:                   00000001 0000000 0000000 0004b47a 0000003a 00000000
5fc0: 00000001 0000000 00000000 00000121 0004af58 00044874 00000000 00000000
5fe0: 00000001 bee9d420 00025a10 b6e75c7c

So, instead of using the ts_lock for tx_queue, use the spinlock that
skb_buff_head has.

Reviewed-by: Vadim Fedorenko <[email protected]>
Fixes: 7d272e6 ("net: phy: mscc: timestamping and PHC support")
Signed-off-by: Horatiu Vultur <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
The commit ced17ee ("Revert "virtio: reject shm region if length is zero"")
exposes the following DAX page fault bug (this fix the failure that getting shm
region alway returns false because of zero length):

The commit 21aa65b ("mm: remove callers of pfn_t functionality") handles
the DAX physical page address incorrectly: the removed macro 'phys_to_pfn_t()'
should be replaced with 'PHYS_PFN()'.

[    1.390321] BUG: unable to handle page fault for address: ffffd3fb40000008
[    1.390875] #PF: supervisor read access in kernel mode
[    1.391257] #PF: error_code(0x0000) - not-present page
[    1.391509] PGD 0 P4D 0
[    1.391626] Oops: Oops: 0000 [#1] SMP NOPTI
[    1.391806] CPU: 6 UID: 1000 PID: 162 Comm: weston Not tainted 6.17.0-rc3-WSL2-STABLE #2 PREEMPT(none)
[    1.392361] RIP: 0010:dax_to_folio+0x14/0x60
[    1.392653] Code: 52 c9 c3 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 c1 ef 05 48 c1 e7 06 48 03 3d 34 b5 31 01 <48> 8b 57 08 48 89 f8 f6 c2 01 75 2b 66 90 c3 cc cc cc cc f7 c7 ff
[    1.393727] RSP: 0000:ffffaf7d04407aa8 EFLAGS: 00010086
[    1.394003] RAX: 000000a000000000 RBX: ffffaf7d04407bb0 RCX: 0000000000000000
[    1.394524] RDX: ffffd17b40000008 RSI: 0000000000000083 RDI: ffffd3fb40000000
[    1.394967] RBP: 0000000000000011 R08: 000000a000000000 R09: 0000000000000000
[    1.395400] R10: 0000000000001000 R11: ffffaf7d04407c10 R12: 0000000000000000
[    1.395806] R13: ffffa020557be9c0 R14: 0000014000000001 R15: 0000725970e94000
[    1.396268] FS:  000072596d6d2ec0(0000) GS:ffffa0222dc59000(0000) knlGS:0000000000000000
[    1.396715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.397100] CR2: ffffd3fb40000008 CR3: 000000011579c005 CR4: 0000000000372ef0
[    1.397518] Call Trace:
[    1.397663]  <TASK>
[    1.397900]  dax_insert_entry+0x13b/0x390
[    1.398179]  dax_fault_iter+0x2a5/0x6c0
[    1.398443]  dax_iomap_pte_fault+0x193/0x3c0
[    1.398750]  __fuse_dax_fault+0x8b/0x270
[    1.398997]  ? vm_mmap_pgoff+0x161/0x210
[    1.399175]  __do_fault+0x30/0x180
[    1.399360]  do_fault+0xc4/0x550
[    1.399547]  __handle_mm_fault+0x8e3/0xf50
[    1.399731]  ? do_syscall_64+0x72/0x1e0
[    1.399958]  handle_mm_fault+0x192/0x2f0
[    1.400204]  do_user_addr_fault+0x20e/0x700
[    1.400418]  exc_page_fault+0x66/0x150
[    1.400602]  asm_exc_page_fault+0x26/0x30
[    1.400831] RIP: 0033:0x72596d1bf703
[    1.401076] Code: 31 f6 45 31 e4 48 8d 15 b3 73 00 00 e8 06 03 00 00 8b 83 68 01 00 00 e9 8e fa ff ff 0f 1f 00 48 8b 44 24 08 4c 89 ee 48 89 df <c7> 00 21 43 34 12 e8 72 09 00 00 e9 6a fa ff ff 0f 1f 44 00 00 e8
[    1.402172] RSP: 002b:00007ffc350f6dc0 EFLAGS: 00010202
[    1.402488] RAX: 0000725970e94000 RBX: 00005b7c642c2560 RCX: 0000725970d359a7
[    1.402898] RDX: 0000000000000003 RSI: 00007ffc350f6dc0 RDI: 00005b7c642c2560
[    1.403284] RBP: 00007ffc350f6e90 R08: 000000000000000d R09: 0000000000000000
[    1.403634] R10: 00007ffc350f6dd8 R11: 0000000000000246 R12: 0000000000000001
[    1.404078] R13: 00007ffc350f6dc0 R14: 0000725970e29ce0 R15: 0000000000000003
[    1.404450]  </TASK>
[    1.404570] Modules linked in:
[    1.404821] CR2: ffffd3fb40000008
[    1.405029] ---[ end trace 0000000000000000 ]---
[    1.405323] RIP: 0010:dax_to_folio+0x14/0x60
[    1.405556] Code: 52 c9 c3 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 c1 ef 05 48 c1 e7 06 48 03 3d 34 b5 31 01 <48> 8b 57 08 48 89 f8 f6 c2 01 75 2b 66 90 c3 cc cc cc cc f7 c7 ff
[    1.406639] RSP: 0000:ffffaf7d04407aa8 EFLAGS: 00010086
[    1.406910] RAX: 000000a000000000 RBX: ffffaf7d04407bb0 RCX: 0000000000000000
[    1.407379] RDX: ffffd17b40000008 RSI: 0000000000000083 RDI: ffffd3fb40000000
[    1.407800] RBP: 0000000000000011 R08: 000000a000000000 R09: 0000000000000000
[    1.408246] R10: 0000000000001000 R11: ffffaf7d04407c10 R12: 0000000000000000
[    1.408666] R13: ffffa020557be9c0 R14: 0000014000000001 R15: 0000725970e94000
[    1.409170] FS:  000072596d6d2ec0(0000) GS:ffffa0222dc59000(0000) knlGS:0000000000000000
[    1.409608] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.409977] CR2: ffffd3fb40000008 CR3: 000000011579c005 CR4: 0000000000372ef0
[    1.410437] Kernel panic - not syncing: Fatal exception
[    1.410857] Kernel Offset: 0xc000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Fixes: 21aa65b ("mm: remove callers of pfn_t functionality")
Signed-off-by: Haiyue Wang <[email protected]>
Link: https://lore.kernel.org/[email protected]
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
A crash was observed with the following output:

BUG: kernel NULL pointer dereference, address: 0000000000000010
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 torvalds#138 PREEMPT(voluntary)
RIP: 0010:bitmap_parselist+0x53/0x3e0
Call Trace:
 <TASK>
 osnoise_cpus_write+0x7a/0x190
 vfs_write+0xf8/0x410
 ? do_sys_openat2+0x88/0xd0
 ksys_write+0x60/0xd0
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
 </TASK>

This issue can be reproduced by below code:

fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY);
write(fd, "0-2", 0);

When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return
ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which
trigger the null pointer dereference. Add check for the parameter 'count'.

Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/[email protected]
Fixes: 17f8910 ("tracing/osnoise: Allow arbitrarily long CPU string")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
Steven Rostedt reported a crash with "ftrace=function" kernel command
line:

[    0.159269] BUG: kernel NULL pointer dereference, address: 000000000000001c
[    0.160254] #PF: supervisor read access in kernel mode
[    0.160975] #PF: error_code(0x0000) - not-present page
[    0.161697] PGD 0 P4D 0
[    0.162055] Oops: Oops: 0000 [#1] SMP PTI
[    0.162619] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.17.0-rc2-test-00006-g48d06e78b7cb-dirty torvalds#9 PREEMPT(undef)
[    0.164141] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    0.165439] RIP: 0010:kmem_cache_alloc_noprof (mm/slub.c:4237)
[ 0.166186] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 e4 f0 48 83 ec 20 8b 05 c9 b6 7e 01 <44> 8b 77 1c 65 4c 8b 2d b5 ea 20 02 4c 89 6c 24 18 41 89 f5 21 f0
[    0.168811] RSP: 0000:ffffffffb2e03b30 EFLAGS: 00010086
[    0.169545] RAX: 0000000001fff33f RBX: 0000000000000000 RCX: 0000000000000000
[    0.170544] RDX: 0000000000002800 RSI: 0000000000002800 RDI: 0000000000000000
[    0.171554] RBP: ffffffffb2e03b80 R08: 0000000000000004 R09: ffffffffb2e03c90
[    0.172549] R10: ffffffffb2e03c90 R11: 0000000000000000 R12: 0000000000000000
[    0.173544] R13: ffffffffb2e03c90 R14: ffffffffb2e03c90 R15: 0000000000000001
[    0.174542] FS:  0000000000000000(0000) GS:ffff9d2808114000(0000) knlGS:0000000000000000
[    0.175684] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.176486] CR2: 000000000000001c CR3: 000000007264c001 CR4: 00000000000200b0
[    0.177483] Call Trace:
[    0.177828]  <TASK>
[    0.178123] mas_alloc_nodes (lib/maple_tree.c:176 (discriminator 2) lib/maple_tree.c:1255 (discriminator 2))
[    0.178692] mas_store_gfp (lib/maple_tree.c:5468)
[    0.179223] execmem_cache_add_locked (mm/execmem.c:207)
[    0.179870] execmem_alloc (mm/execmem.c:213 mm/execmem.c:313 mm/execmem.c:335 mm/execmem.c:475)
[    0.180397] ? ftrace_caller (arch/x86/kernel/ftrace_64.S:169)
[    0.180922] ? __pfx_ftrace_caller (arch/x86/kernel/ftrace_64.S:158)
[    0.181517] execmem_alloc_rw (mm/execmem.c:487)
[    0.182052] arch_ftrace_update_trampoline (arch/x86/kernel/ftrace.c:266 arch/x86/kernel/ftrace.c:344 arch/x86/kernel/ftrace.c:474)
[    0.182778] ? ftrace_caller_op_ptr (arch/x86/kernel/ftrace_64.S:182)
[    0.183388] ftrace_update_trampoline (kernel/trace/ftrace.c:7947)
[    0.184024] __register_ftrace_function (kernel/trace/ftrace.c:368)
[    0.184682] ftrace_startup (kernel/trace/ftrace.c:3048)
[    0.185205] ? __pfx_function_trace_call (kernel/trace/trace_functions.c:210)
[    0.185877] register_ftrace_function_nolock (kernel/trace/ftrace.c:8717)
[    0.186595] register_ftrace_function (kernel/trace/ftrace.c:8745)
[    0.187254] ? __pfx_function_trace_call (kernel/trace/trace_functions.c:210)
[    0.187924] function_trace_init (kernel/trace/trace_functions.c:170)
[    0.188499] tracing_set_tracer (kernel/trace/trace.c:5916 kernel/trace/trace.c:6349)
[    0.189088] register_tracer (kernel/trace/trace.c:2391)
[    0.189642] early_trace_init (kernel/trace/trace.c:11075 kernel/trace/trace.c:11149)
[    0.190204] start_kernel (init/main.c:970)
[    0.190732] x86_64_start_reservations (arch/x86/kernel/head64.c:307)
[    0.191381] x86_64_start_kernel (??:?)
[    0.191955] common_startup_64 (arch/x86/kernel/head_64.S:419)
[    0.192534]  </TASK>
[    0.192839] Modules linked in:
[    0.193267] CR2: 000000000000001c
[    0.193730] ---[ end trace 0000000000000000 ]---

The crash happens because on x86 ftrace allocations from execmem require
maple tree to be initialized.

Move maple tree initialization that depends only on slab availability
earlier in boot so that it will happen right after mm_core_init().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5d79c2b ("x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations")
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reported-by: Steven Rostedt (Google) <[email protected]>
Tested-by: Steven Rostedt (Google) <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Masami Hiramatsu (Google) <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
…on memory

When I did memory failure tests, below panic occurs:

page dumped because: VM_BUG_ON_PAGE(PagePoisoned(page))
kernel BUG at include/linux/page-flags.h:616!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 720 Comm: bash Not tainted 6.10.0-rc1-00195-g148743902568 torvalds#40
RIP: 0010:unpoison_memory+0x2f3/0x590
RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246
RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0
RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb
R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000
R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe
FS:  00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 unpoison_memory+0x2f3/0x590
 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
 debugfs_attr_write+0x42/0x60
 full_proxy_write+0x5b/0x80
 vfs_write+0xd5/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xb9/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f08f0314887
RSP: 002b:00007ffece710078 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f08f0314887
RDX: 0000000000000009 RSI: 0000564787a30410 RDI: 0000000000000001
RBP: 0000564787a30410 R08: 000000000000fefe R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
R13: 00007f08f041b780 R14: 00007f08f0417600 R15: 00007f08f0416a00
 </TASK>
Modules linked in: hwpoison_inject
---[ end trace 0000000000000000 ]---
RIP: 0010:unpoison_memory+0x2f3/0x590
RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246
RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0
RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb
R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000
R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe
FS:  00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x31c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---

The root cause is that unpoison_memory() tries to check the PG_HWPoison
flags of an uninitialized page.  So VM_BUG_ON_PAGE(PagePoisoned(page)) is
triggered.  This can be reproduced by below steps:

1.Offline memory block:

 echo offline > /sys/devices/system/memory/memory12/state

2.Get offlined memory pfn:

 page-types -b n -rlN

3.Write pfn to unpoison-pfn

 echo <pfn> > /sys/kernel/debug/hwpoison/unpoison-pfn

This scenario can be identified by pfn_to_online_page() returning NULL. 
And ZONE_DEVICE pages are never expected, so we can simply fail if
pfn_to_online_page() == NULL to fix the bug.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: f1dd2cd ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
Signed-off-by: Miaohe Lin <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
Problem description
===================

Lockdep reports a possible circular locking dependency (AB/BA) between
&pl->state_mutex and &phy->lock, as follows.

phylink_resolve() // acquires &pl->state_mutex
-> phylink_major_config()
   -> phy_config_inband() // acquires &pl->phydev->lock

whereas all the other call sites where &pl->state_mutex and
&pl->phydev->lock have the locking scheme reversed. Everywhere else,
&pl->phydev->lock is acquired at the top level, and &pl->state_mutex at
the lower level. A clear example is phylink_bringup_phy().

The outlier is the newly introduced phy_config_inband() and the existing
lock order is the correct one. To understand why it cannot be the other
way around, it is sufficient to consider phylink_phy_change(), phylink's
callback from the PHY device's phy->phy_link_change() virtual method,
invoked by the PHY state machine.

phy_link_up() and phy_link_down(), the (indirect) callers of
phylink_phy_change(), are called with &phydev->lock acquired.
Then phylink_phy_change() acquires its own &pl->state_mutex, to
serialize changes made to its pl->phy_state and pl->link_config.
So all other instances of &pl->state_mutex and &phydev->lock must be
consistent with this order.

Problem impact
==============

I think the kernel runs a serious deadlock risk if an existing
phylink_resolve() thread, which results in a phy_config_inband() call,
is concurrent with a phy_link_up() or phy_link_down() call, which will
deadlock on &pl->state_mutex in phylink_phy_change(). Practically
speaking, the impact may be limited by the slow speed of the medium
auto-negotiation protocol, which makes it unlikely for the current state
to still be unresolved when a new one is detected, but I think the
problem is there. Nonetheless, the problem was discovered using lockdep.

Proposed solution
=================

Practically speaking, the phy_config_inband() requirement of having
phydev->lock acquired must transfer to the caller (phylink is the only
caller). There, it must bubble up until immediately before
&pl->state_mutex is acquired, for the cases where that takes place.

Solution details, considerations, notes
=======================================

This is the phy_config_inband() call graph:

                          sfp_upstream_ops :: connect_phy()
                          |
                          v
                          phylink_sfp_connect_phy()
                          |
                          v
                          phylink_sfp_config_phy()
                          |
                          |   sfp_upstream_ops :: module_insert()
                          |   |
                          |   v
                          |   phylink_sfp_module_insert()
                          |   |
                          |   |   sfp_upstream_ops :: module_start()
                          |   |   |
                          |   |   v
                          |   |   phylink_sfp_module_start()
                          |   |   |
                          |   v   v
                          |   phylink_sfp_config_optical()
 phylink_start()          |   |
   |   phylink_resume()   v   v
   |   |  phylink_sfp_set_config()
   |   |  |
   v   v  v
 phylink_mac_initial_config()
   |   phylink_resolve()
   |   |  phylink_ethtool_ksettings_set()
   v   v  v
   phylink_major_config()
            |
            v
    phy_config_inband()

phylink_major_config() caller #1, phylink_mac_initial_config(), does not
acquire &pl->state_mutex nor do its callers. It must acquire
&pl->phydev->lock prior to calling phylink_major_config().

phylink_major_config() caller #2, phylink_resolve() acquires
&pl->state_mutex, thus also needs to acquire &pl->phydev->lock.

phylink_major_config() caller #3, phylink_ethtool_ksettings_set(), is
completely uninteresting, because it only calls phylink_major_config()
if pl->phydev is NULL (otherwise it calls phy_ethtool_ksettings_set()).
We need to change nothing there.

Other solutions
===============

The lock inversion between &pl->state_mutex and &pl->phydev->lock has
occurred at least once before, as seen in commit c718af2 ("net:
phylink: fix ethtool -A with attached PHYs"). The solution there was to
simply not call phy_set_asym_pause() under the &pl->state_mutex. That
cannot be extended to our case though, where the phy_config_inband()
call is much deeper inside the &pl->state_mutex section.

Fixes: 5fd0f1a ("net: phylink: add negotiation of in-band capabilities")
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
If request_irq() in i40e_vsi_request_irq_msix() fails in an iteration
later than the first, the error path wants to free the IRQs requested
so far. However, it uses the wrong dev_id argument for free_irq(), so
it does not free the IRQs correctly and instead triggers the warning:

 Trying to free already-free IRQ 173
 WARNING: CPU: 25 PID: 1091 at kernel/irq/manage.c:1829 __free_irq+0x192/0x2c0
 Modules linked in: i40e(+) [...]
 CPU: 25 UID: 0 PID: 1091 Comm: NetworkManager Not tainted 6.17.0-rc1+ #1 PREEMPT(lazy)
 Hardware name: [...]
 RIP: 0010:__free_irq+0x192/0x2c0
 [...]
 Call Trace:
  <TASK>
  free_irq+0x32/0x70
  i40e_vsi_request_irq_msix.cold+0x63/0x8b [i40e]
  i40e_vsi_request_irq+0x79/0x80 [i40e]
  i40e_vsi_open+0x21f/0x2f0 [i40e]
  i40e_open+0x63/0x130 [i40e]
  __dev_open+0xfc/0x210
  __dev_change_flags+0x1fc/0x240
  netif_change_flags+0x27/0x70
  do_setlink.isra.0+0x341/0xc70
  rtnl_newlink+0x468/0x860
  rtnetlink_rcv_msg+0x375/0x450
  netlink_rcv_skb+0x5c/0x110
  netlink_unicast+0x288/0x3c0
  netlink_sendmsg+0x20d/0x430
  ____sys_sendmsg+0x3a2/0x3d0
  ___sys_sendmsg+0x99/0xe0
  __sys_sendmsg+0x8a/0xf0
  do_syscall_64+0x82/0x2c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [...]
  </TASK>
 ---[ end trace 0000000000000000 ]---

Use the same dev_id for free_irq() as for request_irq().

I tested this with inserting code to fail intentionally.

Fixes: 493fb30 ("i40e: Move q_vectors from pointer to array to array of pointers")
Signed-off-by: Michal Schmidt <[email protected]>
Reviewed-by: Aleksandr Loktionov <[email protected]>
Reviewed-by: Subbaraya Sundeep <[email protected]>
Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
Hangbin Liu says:

====================
hsr: fix lock warnings

hsr_for_each_port is called in many places without holding the RCU read
lock, this may trigger warnings on debug kernels like:

  [   40.457015] [  T201] WARNING: suspicious RCU usage
  [   40.457020] [  T201] 6.17.0-rc2-virtme #1 Not tainted
  [   40.457025] [  T201] -----------------------------
  [   40.457029] [  T201] net/hsr/hsr_main.c:137 RCU-list traversed in non-reader section!!
  [   40.457036] [  T201]
                          other info that might help us debug this:

  [   40.457040] [  T201]
                          rcu_scheduler_active = 2, debug_locks = 1
  [   40.457045] [  T201] 2 locks held by ip/201:
  [   40.457050] [  T201]  #0: ffffffff93040a40 (&ops->srcu){.+.+}-{0:0}, at: rtnl_link_ops_get+0xf2/0x280
  [   40.457080] [  T201]  #1: ffffffff92e7f968 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x5e1/0xb20
  [   40.457102] [  T201]
                          stack backtrace:
  [   40.457108] [  T201] CPU: 2 UID: 0 PID: 201 Comm: ip Not tainted 6.17.0-rc2-virtme #1 PREEMPT(full)
  [   40.457114] [  T201] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  [   40.457117] [  T201] Call Trace:
  [   40.457120] [  T201]  <TASK>
  [   40.457126] [  T201]  dump_stack_lvl+0x6f/0xb0
  [   40.457136] [  T201]  lockdep_rcu_suspicious.cold+0x4f/0xb1
  [   40.457148] [  T201]  hsr_port_get_hsr+0xfe/0x140
  [   40.457158] [  T201]  hsr_add_port+0x192/0x940
  [   40.457167] [  T201]  ? __pfx_hsr_add_port+0x10/0x10
  [   40.457176] [  T201]  ? lockdep_init_map_type+0x5c/0x270
  [   40.457189] [  T201]  hsr_dev_finalize+0x4bc/0xbf0
  [   40.457204] [  T201]  hsr_newlink+0x3c3/0x8f0
  [   40.457212] [  T201]  ? __pfx_hsr_newlink+0x10/0x10
  [   40.457222] [  T201]  ? rtnl_create_link+0x173/0xe40
  [   40.457233] [  T201]  rtnl_newlink_create+0x2cf/0x750
  [   40.457243] [  T201]  ? __pfx_rtnl_newlink_create+0x10/0x10
  [   40.457247] [  T201]  ? __dev_get_by_name+0x12/0x50
  [   40.457252] [  T201]  ? rtnl_dev_get+0xac/0x140
  [   40.457259] [  T201]  ? __pfx_rtnl_dev_get+0x10/0x10
  [   40.457285] [  T201]  __rtnl_newlink+0x22c/0xa50
  [   40.457305] [  T201]  rtnl_newlink+0x637/0xb20

Adding rcu_read_lock() for all hsr_for_each_port() looks confusing.

Introduce a new helper, hsr_for_each_port_rtnl(), that assumes the
RTNL lock is held. This allows callers in suitable contexts to iterate
ports safely without explicit RCU locking.

Other code paths that rely on RCU protection continue to use
hsr_for_each_port() with rcu_read_lock().
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
Avoid below overlapping mappings by using a contiguous
non-cacheable buffer.

[    4.077708] DMA-API: stm32_fmc2_nfc 48810000.nand-controller: cacheline tracking EEXIST,
overlapping mappings aren't supported
[    4.089103] WARNING: CPU: 1 PID: 44 at kernel/dma/debug.c:568 add_dma_entry+0x23c/0x300
[    4.097071] Modules linked in:
[    4.100101] CPU: 1 PID: 44 Comm: kworker/u4:2 Not tainted 6.1.82 #1
[    4.106346] Hardware name: STMicroelectronics STM32MP257F VALID1 SNOR / MB1704 (LPDDR4 Power discrete) + MB1703 + MB1708 (SNOR MB1730) (DT)
[    4.118824] Workqueue: events_unbound deferred_probe_work_func
[    4.124674] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.131624] pc : add_dma_entry+0x23c/0x300
[    4.135658] lr : add_dma_entry+0x23c/0x300
[    4.139792] sp : ffff800009dbb490
[    4.143016] x29: ffff800009dbb4a0 x28: 0000000004008022 x27: ffff8000098a6000
[    4.150174] x26: 0000000000000000 x25: ffff8000099e7000 x24: ffff8000099e7de8
[    4.157231] x23: 00000000ffffffff x22: 0000000000000000 x21: ffff8000098a6a20
[    4.164388] x20: ffff000080964180 x19: ffff800009819ba0 x18: 0000000000000006
[    4.171545] x17: 6361727420656e69 x16: 6c6568636163203a x15: 72656c6c6f72746e
[    4.178602] x14: 6f632d646e616e2e x13: ffff800009832f58 x12: 00000000000004ec
[    4.185759] x11: 00000000000001a4 x10: ffff80000988af58 x9 : ffff800009832f58
[    4.192916] x8 : 00000000ffffefff x7 : ffff80000988af58 x6 : 80000000fffff000
[    4.199972] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
[    4.207128] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000812d2c40
[    4.214185] Call trace:
[    4.216605]  add_dma_entry+0x23c/0x300
[    4.220338]  debug_dma_map_sg+0x198/0x350
[    4.224373]  __dma_map_sg_attrs+0xa0/0x110
[    4.228411]  dma_map_sg_attrs+0x10/0x2c
[    4.232247]  stm32_fmc2_nfc_xfer.isra.0+0x1c8/0x3fc
[    4.237088]  stm32_fmc2_nfc_seq_read_page+0xc8/0x174
[    4.242127]  nand_read_oob+0x1d4/0x8e0
[    4.245861]  mtd_read_oob_std+0x58/0x84
[    4.249596]  mtd_read_oob+0x90/0x150
[    4.253231]  mtd_read+0x68/0xac

Signed-off-by: Christophe Kerello <[email protected]>
Cc: [email protected]
Fixes: 2cd457f ("mtd: rawnand: stm32_fmc2: add STM32 FMC2 NAND flash controller driver")
Signed-off-by: Miquel Raynal <[email protected]>
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
5da3d94 ("PCI: mvebu: Use for_each_of_range() iterator for parsing
"ranges"") simplified code by using the for_each_of_range() iterator, but
it broke PCI enumeration on Turris Omnia (and probably other mvebu
targets).

Issue #1:

To determine range.flags, of_pci_range_parser_one() uses bus->get_flags(),
which resolves to of_bus_pci_get_flags(), which already returns an
IORESOURCE bit field, and NOT the original flags from the "ranges"
resource.

Then mvebu_get_tgt_attr() attempts the very same conversion again.  Remove
the misinterpretation of range.flags in mvebu_get_tgt_attr(), to restore
the intended behavior.

Issue #2:

The driver needs target and attributes, which are encoded in the raw
address values of the "/soc/pcie/ranges" resource. According to
of_pci_range_parser_one(), the raw values are stored in range.bus_addr and
range.parent_bus_addr, respectively. range.cpu_addr is a translated version
of range.parent_bus_addr, and not relevant here.

Use the correct range structure member, to extract target and attributes.
This restores the intended behavior.

Fixes: 5da3d94 ("PCI: mvebu: Use for_each_of_range() iterator for parsing "ranges"")
Reported-by: Jan Palus <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220479
Signed-off-by: Klaus Kudielka <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Tested-by: Tony Dinh <[email protected]>
Tested-by: Jan Palus <[email protected]>
Link: https://patch.msgid.link/[email protected]
Gnurou pushed a commit that referenced this pull request Sep 30, 2025
The function ceph_process_folio_batch() sets folio_batch entries to
NULL, which is an illegal state.  Before folio_batch_release() crashes
due to this API violation, the function ceph_shift_unused_folios_left()
is supposed to remove those NULLs from the array.

However, since commit ce80b76 ("ceph: introduce
ceph_process_folio_batch() method"), this shifting doesn't happen
anymore because the "for" loop got moved to ceph_process_folio_batch(),
and now the `i` variable that remains in ceph_writepages_start()
doesn't get incremented anymore, making the shifting effectively
unreachable much of the time.

Later, commit 1551ec6 ("ceph: introduce ceph_submit_write()
method") added more preconditions for doing the shift, replacing the
`i` check (with something that is still just as broken):

- if ceph_process_folio_batch() fails, shifting never happens

- if ceph_move_dirty_page_in_page_array() was never called (because
  ceph_process_folio_batch() has returned early for some of various
  reasons), shifting never happens

- if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
  has returned early for some of the reasons mentioned above or
  because ceph_move_dirty_page_in_page_array() has failed), shifting
  never happens

Since those two commits, any problem in ceph_process_folio_batch()
could crash the kernel, e.g. this way:

 BUG: kernel NULL pointer dereference, address: 0000000000000034
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0002 [#1] SMP NOPTI
 CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es torvalds#714 NONE
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 Workqueue: writeback wb_workfn (flush-ceph-1)
 RIP: 0010:folios_put_refs+0x85/0x140
 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
 RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
 R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
 FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ceph_writepages_start+0xeb9/0x1410

The crash can be reproduced easily by changing the
ceph_check_page_before_write() return value to `-E2BIG`.

(Interestingly, the crash happens only if `huge_zero_folio` has
already been allocated; without `huge_zero_folio`,
is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
entries instead of dereferencing them.  That makes reproducing the bug
somewhat unreliable.  See
https://lore.kernel.org/[email protected]
for a discussion of this detail.)

My suggestion is to move the ceph_shift_unused_folios_left() to right
after ceph_process_folio_batch() to ensure it always gets called to
fix up the illegal folio_batch state.

Cc: [email protected]
Fixes: ce80b76 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://lore.kernel.org/ceph-devel/[email protected]/
Signed-off-by: Max Kellermann <[email protected]>
Reviewed-by: Viacheslav Dubeyko <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
Gnurou pushed a commit that referenced this pull request Oct 26, 2025
Since ixgbe_adapter is embedded in devlink, calling devlink_free()
prematurely in the ixgbe_remove() path can lead to UAF. Move devlink_free()
to the end.

KASAN report:

 BUG: KASAN: use-after-free in ixgbe_reset_interrupt_capability+0x140/0x180 [ixgbe]
 Read of size 8 at addr ffff0000adf813e0 by task bash/2095
 CPU: 1 UID: 0 PID: 2095 Comm: bash Tainted: G S  6.17.0-rc2-tnguy.net-queue+ #1 PREEMPT(full)
 [...]
 Call trace:
  show_stack+0x30/0x90 (C)
  dump_stack_lvl+0x9c/0xd0
  print_address_description.constprop.0+0x90/0x310
  print_report+0x104/0x1f0
  kasan_report+0x88/0x180
  __asan_report_load8_noabort+0x20/0x30
  ixgbe_reset_interrupt_capability+0x140/0x180 [ixgbe]
  ixgbe_clear_interrupt_scheme+0xf8/0x130 [ixgbe]
  ixgbe_remove+0x2d0/0x8c0 [ixgbe]
  pci_device_remove+0xa0/0x220
  device_remove+0xb8/0x170
  device_release_driver_internal+0x318/0x490
  device_driver_detach+0x40/0x68
  unbind_store+0xec/0x118
  drv_attr_store+0x64/0xb8
  sysfs_kf_write+0xcc/0x138
  kernfs_fop_write_iter+0x294/0x440
  new_sync_write+0x1fc/0x588
  vfs_write+0x480/0x6a0
  ksys_write+0xf0/0x1e0
  __arm64_sys_write+0x70/0xc0
  invoke_syscall.constprop.0+0xcc/0x280
  el0_svc_common.constprop.0+0xa8/0x248
  do_el0_svc+0x44/0x68
  el0_svc+0x54/0x160
  el0t_64_sync_handler+0xa0/0xe8
  el0t_64_sync+0x1b0/0x1b8

Fixes: a028523 ("ixgbe: add initial devlink support")
Signed-off-by: Koichiro Den <[email protected]>
Tested-by: Rinitha S <[email protected]>
Reviewed-by: Jedrzej Jagielski <[email protected]>
Reviewed-by: Aleksandr Loktionov <[email protected]>
Reviewed-by: Paul Menzel <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Gnurou pushed a commit that referenced this pull request Oct 26, 2025
Since blamed commit, unregister_netdevice_many_notify() takes the netdev
mutex if the device needs it.

If the device list is too long, this will lock more device mutexes than
lockdep can handle:

unshare -n \
 bash -c 'for i in $(seq 1 100);do ip link add foo$i type dummy;done'

BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48  max: 48!
48 locks held by kworker/u16:1/69:
 #0: ..148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work
 #1: ..d40 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work
 #2: ..bd0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net
 #3: ..aa8 (rtnl_mutex){+.+.}-{4:4}, at: default_device_exit_batch
 #4: ..cb0 (&dev_instance_lock_key#3){+.+.}-{4:4}, at: unregister_netdevice_many_notify
[..]

Add a helper to close and then unlock a list of net_devices.
Devices that are not up have to be skipped - netif_close_many always
removes them from the list without any other actions taken, so they'd
remain in locked state.

Close devices whenever we've used up half of the tracking slots or we
processed entire list without hitting the limit.

Fixes: 7e4d784 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Florian Westphal <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Gnurou pushed a commit that referenced this pull request Oct 26, 2025
[BUG]
Syzbot reported an ASSERT() triggered inside scrub:

  BTRFS info (device loop0): scrub: started on devid 1
  assertion failed: !folio_test_partial_kmap(folio) :: 0, in fs/btrfs/scrub.c:697
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/scrub.c:697!
  Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 0 UID: 0 PID: 6077 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
  RIP: 0010:scrub_stripe_get_kaddr+0x1bb/0x1c0 fs/btrfs/scrub.c:697
  Call Trace:
   <TASK>
   scrub_bio_add_sector fs/btrfs/scrub.c:932 [inline]
   scrub_submit_initial_read+0xf21/0x1120 fs/btrfs/scrub.c:1897
   submit_initial_group_read+0x423/0x5b0 fs/btrfs/scrub.c:1952
   flush_scrub_stripes+0x18f/0x1150 fs/btrfs/scrub.c:1973
   scrub_stripe+0xbea/0x2a30 fs/btrfs/scrub.c:2516
   scrub_chunk+0x2a3/0x430 fs/btrfs/scrub.c:2575
   scrub_enumerate_chunks+0xa70/0x1350 fs/btrfs/scrub.c:2839
   btrfs_scrub_dev+0x6e7/0x10e0 fs/btrfs/scrub.c:3153
   btrfs_ioctl_scrub+0x249/0x4b0 fs/btrfs/ioctl.c:3163
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:597 [inline]
   __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>
  ---[ end trace 0000000000000000 ]---

Which doesn't make much sense, as all the folios we allocated for scrub
should not be highmem.

[CAUSE]
Thankfully syzbot has a detailed kernel config file, showing that
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is set to y.

And that debug option will force all folio_test_partial_kmap() to return
true, to improve coverage on highmem tests.

But in our case we really just want to make sure the folios we allocated
are not highmem (and they are indeed not). Such incorrect result from
folio_test_partial_kmap() is just screwing up everything.

[FIX]
Replace folio_test_partial_kmap() to folio_test_highmem() so that we
won't bother those highmem specific debuging options.

Fixes: 5fbaae4 ("btrfs: prepare scrub to support bs > ps cases")
Reported-by: [email protected]
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Gnurou pushed a commit that referenced this pull request Oct 26, 2025
…ce tree

Currently, when building a free space tree at populate_free_space_tree(),
if we are not using the block group tree feature, we always expect to find
block group items (either extent items or a block group item with key type
BTRFS_BLOCK_GROUP_ITEM_KEY) when we search the extent tree with
btrfs_search_slot_for_read(), so we assert that we found an item. However
this expectation is wrong since we can have a new block group created in
the current transaction which is still empty and for which we still have
not added the block group's item to the extent tree, in which case we do
not have any items in the extent tree associated to the block group.

The insertion of a new block group's block group item in the extent tree
happens at btrfs_create_pending_block_groups() when it calls the helper
insert_block_group_item(). This typically is done when a transaction
handle is released, committed or when running delayed refs (either as
part of a transaction commit or when serving tickets for space reservation
if we are low on free space).

So remove the assertion at populate_free_space_tree() even when the block
group tree feature is not enabled and update the comment to mention this
case.

Syzbot reported this with the following stack trace:

  BTRFS info (device loop3 state M): rebuilding free space tree
  assertion failed: ret == 0 :: 0, in fs/btrfs/free-space-tree.c:1115
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/free-space-tree.c:1115!
  Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 1 UID: 0 PID: 6352 Comm: syz.3.25 Not tainted syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
  RIP: 0010:populate_free_space_tree+0x700/0x710 fs/btrfs/free-space-tree.c:1115
  Code: ff ff e8 d3 (...)
  RSP: 0018:ffffc9000430f780 EFLAGS: 00010246
  RAX: 0000000000000043 RBX: ffff88805b709630 RCX: fea61d0e2e79d000
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffffc9000430f8b0 R08: ffffc9000430f4a7 R09: 1ffff92000861e94
  R10: dffffc0000000000 R11: fffff52000861e95 R12: 0000000000000001
  R13: 1ffff92000861f00 R14: dffffc0000000000 R15: 0000000000000000
  FS:  00007f424d9fe6c0(0000) GS:ffff888125afc000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fd78ad212c0 CR3: 0000000076d68000 CR4: 00000000003526f0
  Call Trace:
   <TASK>
   btrfs_rebuild_free_space_tree+0x1ba/0x6d0 fs/btrfs/free-space-tree.c:1364
   btrfs_start_pre_rw_mount+0x128f/0x1bf0 fs/btrfs/disk-io.c:3062
   btrfs_remount_rw fs/btrfs/super.c:1334 [inline]
   btrfs_reconfigure+0xaed/0x2160 fs/btrfs/super.c:1559
   reconfigure_super+0x227/0x890 fs/super.c:1076
   do_remount fs/namespace.c:3279 [inline]
   path_mount+0xd1a/0xfe0 fs/namespace.c:4027
   do_mount fs/namespace.c:4048 [inline]
   __do_sys_mount fs/namespace.c:4236 [inline]
   __se_sys_mount+0x313/0x410 fs/namespace.c:4213
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f424e39066a
  Code: d8 64 89 02 (...)
  RSP: 002b:00007f424d9fde68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 00007f424d9fdef0 RCX: 00007f424e39066a
  RDX: 0000200000000180 RSI: 0000200000000380 RDI: 0000000000000000
  RBP: 0000200000000180 R08: 00007f424d9fdef0 R09: 0000000000000020
  R10: 0000000000000020 R11: 0000000000000246 R12: 0000200000000380
  R13: 00007f424d9fdeb0 R14: 0000000000000000 R15: 00002000000002c0
   </TASK>
  Modules linked in:
  ---[ end trace 0000000000000000 ]---

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: a5ed918 ("Btrfs: implement the free space B-tree")
CC: <[email protected]> # 6.1.x: 1961d20: btrfs: fix assertion when building free space tree
CC: <[email protected]> # 6.1.x
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Gnurou pushed a commit that referenced this pull request Oct 26, 2025
Another day, another syzkaller bug. KVM erroneously allows userspace to
pend vCPU events for a vCPU that hasn't been initialized yet, leading to
KVM interpreting a bunch of uninitialized garbage for routing /
injecting the exception.

In one case the injection code and the hyp disagree on whether the vCPU
has a 32bit EL1 and put the vCPU into an illegal mode for AArch64,
tripping the BUG() in exception_target_el() during the next injection:

  kernel BUG at arch/arm64/kvm/inject_fault.c:40!
  Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
  CPU: 3 UID: 0 PID: 318 Comm: repro Not tainted 6.17.0-rc4-00104-g10fd0285305d torvalds#6 PREEMPT
  Hardware name: linux,dummy-virt (DT)
  pstate: 21402009 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : exception_target_el+0x88/0x8c
  lr : pend_serror_exception+0x18/0x13c
  sp : ffff800082f03a10
  x29: ffff800082f03a10 x28: ffff0000cb132280 x27: 0000000000000000
  x26: 0000000000000000 x25: ffff0000c2a99c20 x24: 0000000000000000
  x23: 0000000000008000 x22: 0000000000000002 x21: 0000000000000004
  x20: 0000000000008000 x19: ffff0000c2a99c20 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 00000000200000c0
  x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
  x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
  x8 : ffff800082f03af8 x7 : 0000000000000000 x6 : 0000000000000000
  x5 : ffff800080f621f0 x4 : 0000000000000000 x3 : 0000000000000000
  x2 : 000000000040009b x1 : 0000000000000003 x0 : ffff0000c2a99c20
  Call trace:
   exception_target_el+0x88/0x8c (P)
   kvm_inject_serror_esr+0x40/0x3b4
   __kvm_arm_vcpu_set_events+0xf0/0x100
   kvm_arch_vcpu_ioctl+0x180/0x9d4
   kvm_vcpu_ioctl+0x60c/0x9f4
   __arm64_sys_ioctl+0xac/0x104
   invoke_syscall+0x48/0x110
   el0_svc_common.constprop.0+0x40/0xe0
   do_el0_svc+0x1c/0x28
   el0_svc+0x34/0xf0
   el0t_64_sync_handler+0xa0/0xe4
   el0t_64_sync+0x198/0x19c
  Code: f946bc01 b4fffe61 9101e020 17fffff2 (d4210000)

Reject the ioctls outright as no sane VMM would call these before
KVM_ARM_VCPU_INIT anyway. Even if it did the exception would've been
thrown away by the eventual reset of the vCPU's state.

Cc: [email protected] # 6.17
Fixes: b7b27fa ("arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS")
Signed-off-by: Oliver Upton <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Gnurou pushed a commit that referenced this pull request Oct 26, 2025
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.18, take #1

Improvements and bug fixes:

- Fix the handling of ZCR_EL2 in NV VMs
  ([email protected])

- Pick the correct translation regime when doing a PTW on
  the back of a SEA ([email protected])

- Prevent userspace from injecting an event into a vcpu that isn't
  initialised yet ([email protected])

- Move timer save/restore to the sysreg handling code, fixing EL2 timer
  access in the process ([email protected])

- Add FGT-based trapping of MDSCR_EL1 to reduce the overhead of debug
  ([email protected])

- Fix trapping configuration when the host isn't GICv3
  ([email protected])

- Improve the detection of HCR_EL2.E2H being RES1
  ([email protected])

- Drop a spurious 'break' statement in the S1 PTW
  ([email protected])

- Don't try to access SPE when owned by EL3
  ([email protected])

Documentation updates:

- Document the failure modes of event injection
  ([email protected])

- Document that a GICv3 guest can be created on a GICv5 host
  with FEAT_GCIE_LEGACY ([email protected])

Selftest improvements:

- Add a selftest for the effective value of HCR_EL2.AMO
  ([email protected])

- Address build warning in the timer selftest when building
  with clang ([email protected])

- Teach irq_fd selftests about non-x86 architectures
  ([email protected])

- Add missing sysregs to the set_id_regs selftest
  ([email protected])

- Fix vcpu allocation in the vgic_lpi_stress selftest
  ([email protected])

- Correctly enable interrupts in the vgic_lpi_stress selftest
  ([email protected])
Gnurou pushed a commit that referenced this pull request Oct 26, 2025
Expand the prefault memory selftest to add a regression test for a KVM bug
where KVM's retry logic would result in (breakable) deadlock due to the
memslot deletion waiting on prefaulting to release SRCU, and prefaulting
waiting on the memslot to fully disappear (KVM uses a two-step process to
delete memslots, and KVM x86 retries page faults if a to-be-deleted, a.k.a.
INVALID, memslot is encountered).

To exercise concurrent memslot remove, spawn a second thread to initiate
memslot removal at roughly the same time as prefaulting.  Test memslot
removal for all testcases, i.e. don't limit concurrent removal to only the
success case.  There are essentially three prefault scenarios (so far)
that are of interest:

 1. Success
 2. ENOENT due to no memslot
 3. EAGAIN due to INVALID memslot

For all intents and purposes, #1 and #2 are mutually exclusive, or rather,
easier to test via separate testcases since writing to non-existent memory
is trivial.  But for #3, making it mutually exclusive with #1 _or_ #2 is
actually more complex than testing memslot removal for all scenarios.  The
only requirement to let memslot removal coexist with other scenarios is a
way to guarantee a stable result, e.g. that the "no memslot" test observes
ENOENT, not EAGAIN, for the final checks.

So, rather than make memslot removal mutually exclusive with the ENOENT
scenario, simply restore the memslot and retry prefaulting.  For the "no
memslot" case, KVM_PRE_FAULT_MEMORY should be idempotent, i.e. should
always fail with ENOENT regardless of how many times userspace attempts
prefaulting.

Pass in both the base GPA and the offset (instead of the "full" GPA) so
that the worker can recreate the memslot.

Signed-off-by: Yan Zhao <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Gnurou pushed a commit that referenced this pull request Oct 26, 2025
cxl EDAC calls cxl_feature_info() to get the feature information and
if the hardware has no Features support, cxlfs may be passed in as
NULL.

[   51.957498] BUG: kernel NULL pointer dereference, address: 0000000000000008
[   51.965571] #PF: supervisor read access in kernel mode
[   51.971559] #PF: error_code(0x0000) - not-present page
[   51.977542] PGD 17e4f6067 P4D 0
[   51.981384] Oops: Oops: 0000 [#1] SMP NOPTI
[   51.986300] CPU: 49 UID: 0 PID: 3782 Comm: systemd-udevd Not tainted 6.17.0dj
test+ torvalds#64 PREEMPT(voluntary)
[   51.997355] Hardware name: <removed>
[   52.009790] RIP: 0010:cxl_feature_info+0xa/0x80 [cxl_core]

Add a check for cxlfs before dereferencing it and return -EOPNOTSUPP if
there is no cxlfs created due to no hardware support.

Fixes: eb5dfcb ("cxl: Add support to handle user feature commands for set feature")
Reviewed-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Alison Schofield <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants