-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Preventing containers from being unable to be deleted #4757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
11c5aba to
60ae641
Compare
|
I was unable to add integration tests for this PR without resorting to some hacky methods, but I tested whether this issue was resolved in the kubernetes-sigs/kind repository. In brief, I discovered this issue while working in the kubernetes/kubernetes repo to propagate kubelet's context to the container runtime. The issue manifested as the test job being unable to tear down after the k/k repo's e2e tests completed, because the leaked Therefore, I opened a PR in the kubernetes-sigs/kind repo to debug this issue by manually replacing the containerd/runc binaries in the CI environment. After building the code from this PR and replacing the binaries in the CI environment, the test job no longer failed to tear down due to systemd being unable to shut down, as the leaked processes were resolved. Ref: kubernetes-sigs/kind#3903 (Some job failures occurred due to the instability of the k/k repo e2e tests, but they are unrelated to this issue.) I also conducted some manual tests targeting the scenarios where the leftover container is in the paused and stopped states.Paused: Inject sleep to allow us to control where the code is interrupted.You can add a headerdiff --git a/vendor/github.com/opencontainers/cgroups/systemd/v1.go b/vendor/github.com/opencontainers/cgroups/systemd/v1.go
index 8453e9b4..bbe3524c 100644
--- a/vendor/github.com/opencontainers/cgroups/systemd/v1.go
+++ b/vendor/github.com/opencontainers/cgroups/systemd/v1.go
@@ -6,6 +6,7 @@ import (
"path/filepath"
"strings"
"sync"
+ "time"
systemdDbus "github.com/coreos/go-systemd/v22/dbus"
"github.com/sirupsen/logrus"
@@ -361,6 +362,7 @@ func (m *LegacyManager) Set(r *cgroups.Resources) error {
}
}
setErr := setUnitProperties(m.dbus, unitName, properties...)
+ time.Sleep(time.Second * 30)
if needsThaw {
if err := m.doFreeze(cgroups.Thawed); err != nil {
logrus.Infof("thaw container after SetUnitProperties failed: %v", err)stopped: Inject sleep to allow us to control where the code is interrupted.You can add a headerdiff --git a/libcontainer/process_linux.go b/libcontainer/process_linux.go
index 96e3ca5f..350e3660 100644
--- a/libcontainer/process_linux.go
+++ b/libcontainer/process_linux.go
@@ -613,6 +613,7 @@ func (p *initProcess) start() (retErr error) {
return fmt.Errorf("unable to apply cgroup configuration: %w", err)
}
}
+ time.Sleep(time.Second * 30)
if p.intelRdtManager != nil {
if err := p.intelRdtManager.Apply(p.pid()); err != nil {
return fmt.Errorf("unable to apply Intel RDT configuration: %w", err) |
60ae641 to
29dcef9
Compare
|
See also: #2575 |
|
@HirazawaUi thanks! So my comment was on-spot, but you didn't need to remove the assignment? For testing, I'd like to have something. It should be simple and kind of reliable. Here are some ideas, but we don't need a test if we don't find a reasonable and simple way to test this:
|
I believe that removing this assignment and delaying the assignment process until after runc/libcontainer/container_linux.go Lines 893 to 895 in a4b9868
runc/libcontainer/container_linux.go Lines 905 to 913 in a4b9868
|
I will try testing it in the direction of Suggestion 2 (it seems the most effective). If it cannot be implemented, I will promptly provide feedback here :) |
39d801e to
a6ebd29
Compare
|
Test case has been added. While attempting to use Compared to event monitoring, this approach better aligns with the scenario we encountered and is completely asynchronous. The only downside seems to be its fragility, but I added numerous @rata Do you think this test case sufficiently covers the scenarios for this PR? |
This comment was marked as spam.
This comment was marked as spam.
|
ping @kolyshkin @AkihiroSuda @rata Could you take another look at this PR? Any feedback would be greatly appreciated. |
kolyshkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry, had some pending review comments which I forgot to submit)
Also, you need a proper name/description for the second commit. Currently it just says "add integration test" which is enough in the context of this PR, but definitely not enough when looking at git history.
a6ebd29 to
1606d12
Compare
207ce21 to
649949f
Compare
649949f to
6e99e66
Compare
rata
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@HirazawaUi you need to sing-off the commits before we merge. Can you do that? |
6e99e66 to
813e25f
Compare
Signed-off-by: HirazawaUi <[email protected]>
813e25f to
780c4e9
Compare
Thanks for the reminder, signed. |
|
Oh, @lifubang requested changes, although he wrote to feel free to ignore if we want to merge the test, that is why I wanted to do that. @HirazawaUi Can you remove the test, then? That way we can easily merge now. You can open another PR with the test, if you want. Although I feel we need to explore more options to have a reliable test (and not sure it's worth it? Maybe it is). Something that might work is using seccomp notify and make it hang in some syscall, but it is also fragile. Maybe if we use a rare syscall, only when compiled with some build tags (like the access syscall), and then compile and run runc like that for the tests. The test needs more thought, definitely :) |
780c4e9 to
1b39997
Compare
Removed, while I'd prefer to keep it given the considerable effort invested in designing and implementing this testing approach, I respect the consensus to remove it. Perhaps the journey of exploration matters more than the outcome itself. |
|
@lifubang PTAL |
|
Oh, yeah, I can dismiss the review but let's just wait for @lifubang to take another look... |
|
@HirazawaUi Would you like this change to be backported to release-1.3? |
I'd be very happy to backport this to still maintained older releases, I'll do this tomorrow :) |
runc includes several bug fix: https://github.com/opencontainers/runc/releases/tag/v1.3.1 * opencontainers/runc#4645 * opencontainers/runc#4757 These two changes are important for our user cases. Signed-off-by: yongxiu <[email protected]>
This is follow-up to PR #4645, I am taking over from @jianghao65536 to continue addressing this issue.
If the
runc-createprocess is terminated due to receiving a SIGKILL signal, it may lead to the runc-init process leaking due to issues like cgroup freezing, and it cannot be cleaned up byrunc delete/stopbecause the container lacks astate.jsonfile. This typically occurs when higher-level container runtimes terminate the runc create process due to context cancellation or timeout.In PR #4645, the Creating state was added to clean up processes in the
STAGE_PARENT/STAGE_CHILDstage within the cgroup. This PR no longer adds theCreatingstate for the following reasons:Although
runc init STAGE_PARENT/STAGE_CHILDmay exist simultaneously whenrunc createreceives SIGKILL signal, after runc create terminates,STAGE_PARENT/STAGE_CHILDwill also terminate due to the termination ofrunc create:pipenumto communicate withrunc create. Whenrunc createterminates,pipenumis closed, causingSTAGE_PARENTto fail when reading/writing topipenum, triggering bail and termination.STAGE_PARENT. WhenSTAGE_PARENTterminates,syncfdis closed, causingSTAGE_CHILDto fail when reading/writing tosyncfd, triggering bail and termination.If the
runc-createprocess is terminated during execution, the container may be in one of the following states:SIGKILLsignal during the process of setting the cgroup, the container will be in a paused state. At this point, therunc initprocess becomes zombie process and cannot be killed. However,pausedState.destroywill thaw the cgroup and terminate therunc initprocess.STAGE_PARENT->STAGE_CHILDphase, the container will be in a stopped state. As described above,STAGE_PARENT/STAGE_CHILDwill terminate due to the termination of runc create, so no processes will be left behind. We only need to clean up the remaining cgroup files, andstoppedState.destroywill handle this cleanup.Therefore, based on the above reasons, the existing
pausedandstoppedstates are sufficient to handle the abnormal termination of runc create due to a SIGKILL signal.