-
Notifications
You must be signed in to change notification settings - Fork 1.3k
CLOUDSTACK-8915 - Cannot SSH into VMs deployed Redundant VPC routers #908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLOUDSTACK-8915 - Cannot SSH into VMs deployed Redundant VPC routers #908
Conversation
|
Ping @remibergsma @DaanHoogland @karuturi @miguelaferreira @borisroman @wido Test environment:
Although I had 4 exceptions during the tests because the cleanup failed - which means my environment still had some left-overs from the tests - all the tests passed fine! The following tests have been, successfully, executed: nosetests --with-marvin --marvin-config=${marvinCfg} -s -a tags=advanced,required_hardware=true nosetests --with-marvin --marvin-config=${marvinCfg} -s -a tags=advanced,required_hardware=false Hardware ==> true
Create a redundant VPC with two networks with two VMs in each network ... === TestName: test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL | Status : SUCCESS ===
ok
Create a redundant VPC with two networks with two VMs in each network and check default routes ... === TestName: test_02_redundant_VPC_default_routes | Status : SUCCESS ===
ok
Test iptables default INPUT/FORWARD policy on RouterVM ... === TestName: test_02_routervm_iptables_policies | Status : SUCCESS ===
ok
Test iptables default INPUT/FORWARD policies on VPC router ... === TestName: test_01_single_VPC_iptables_policies | Status : SUCCESS ===
ok
=== TestName: test_01_single_VPC_iptables_policies | Status : EXCEPTION ===
ERROR
Create a VPC with two networks with one VM in each network and test nics after destroy ... === TestName: test_01_VPC_nics_after_destroy | Status : SUCCESS ===
ok
Create a VPC with two networks with one VM in each network and test default routes ... === TestName: test_02_VPC_default_routes | Status : SUCCESS ===
ok
Exception 1: Exception during cleanup : Execute cmd: deletenetworkoffering failed, due to: errorCode: 431, errorText:Can't delete network offering 20 as its used by 1 networks. To make the network offering unavaiable, disable it
Hardware ==> false
Test router internal advanced zone ... === TestName: test_02_router_internal_adv | Status : SUCCESS ===
ok
Test restart network ... === TestName: test_03_restart_network_cleanup | Status : SUCCESS ===
ok
Test router basic setup ... === TestName: test_05_router_basic | Status : SUCCESS ===
ok
Test router advanced setup ... === TestName: test_06_router_advanced | Status : SUCCESS ===
ok
Test stop router ... === TestName: test_07_stop_router | Status : SUCCESS ===
ok
Test start router ... === TestName: test_08_start_router | Status : SUCCESS ===
ok
Test reboot router ... === TestName: test_09_reboot_router | Status : SUCCESS ===
ok
test_privategw_acl (integration.smoke.test_privategw_acl.TestPrivateGwACL) ... === TestName: test_privategw_acl | Status : SUCCESS ===
ok
Test reset virtual machine on reboot ... === TestName: test_01_reset_vm_on_reboot | Status : SUCCESS ===
ok
Test advanced zone virtual router ... === TestName: test_advZoneVirtualRouter | Status : SUCCESS ===
ok
Test Deploy Virtual Machine ... === TestName: test_deploy_vm | Status : SUCCESS ===
ok
Test Multiple Deploy Virtual Machine ... === TestName: test_deploy_vm_multiple | Status : SUCCESS ===
ok
Test Stop Virtual Machine ... === TestName: test_01_stop_vm | Status : SUCCESS ===
ok
Test Start Virtual Machine ... === TestName: test_02_start_vm | Status : SUCCESS ===
ok
Test Reboot Virtual Machine ... === TestName: test_03_reboot_vm | Status : SUCCESS ===
ok
Test destroy Virtual Machine ... === TestName: test_06_destroy_vm | Status : SUCCESS ===
ok
Test recover Virtual Machine ... === TestName: test_07_restore_vm | Status : SUCCESS ===
ok
Test migrate VM ... SKIP: At least two hosts should be present in the zone for migration
Test destroy(expunge) Virtual Machine ... === TestName: test_09_expunge_vm | Status : SUCCESS ===
ok
Test VPN in VPC ... === TestName: test_vpc_remote_access_vpn | Status : SUCCESS ===
ok
Test VPN in VPC ... === TestName: test_vpc_site2site_vpn | Status : SUCCESS ===
ok
Test to create service offering ... === TestName: test_01_create_service_offering | Status : SUCCESS ===
ok
Test to update existing service offering ... === TestName: test_02_edit_service_offering | Status : SUCCESS ===
ok
Test to delete service offering ... === TestName: test_03_delete_service_offering | Status : SUCCESS ===
ok
Test create VPC offering ... === TestName: test_01_create_vpc_offering | Status : SUCCESS ===
ok
Test VPC offering without load balancing service ... === TestName: test_03_vpc_off_without_lb | Status : EXCEPTION ===
ERROR
Test VPC offering without static NAT service ... === TestName: test_04_vpc_off_without_static_nat | Status : EXCEPTION ===
ERROR
Test VPC offering without port forwarding service ... === TestName: test_05_vpc_off_without_pf | Status : EXCEPTION ===
ERROR
Test VPC offering with invalid services ... === TestName: test_06_vpc_off_invalid_services | Status : SUCCESS ===
ok
Test update VPC offering ... === TestName: test_07_update_vpc_off | Status : SUCCESS ===
ok
Test list VPC offering ... === TestName: test_08_list_vpc_off | Status : SUCCESS ===
ok
test_09_create_redundant_vpc_offering (integration.component.test_vpc_offerings.TestVPCOffering) ... === TestName: test_09_create_redundant_vpc_offering | Status : SUCCESS ===
ok
Test start/stop of router after addition of one guest network ... === TestName: test_01_start_stop_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test reboot of router after addition of one guest network ... === TestName: test_02_reboot_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test to change service offering of router after addition of one guest network ... === TestName: test_04_chg_srv_off_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test destroy of router after addition of one guest network ... === TestName: test_05_destroy_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test to stop and start router after creation of VPC ... === TestName: test_01_stop_start_router_after_creating_vpc | Status : SUCCESS ===
ok
Test to reboot the router after creating a VPC ... === TestName: test_02_reboot_router_after_creating_vpc | Status : SUCCESS ===
ok
Tests to change service offering of the Router after ... === TestName: test_04_change_service_offerring_vpc | Status : SUCCESS ===
ok
Test to destroy the router after creating a VPC ... === TestName: test_05_destroy_router_after_creating_vpc | Status : SUCCESS ===
ok
Exception 1: Exception during cleanup : Execute cmd: deletenetworkoffering failed, due to: errorCode: 431, errorText:Can't delete network offering 25 as its used by 1 networks. To make the network offering unavaiable, disable it
Exception 2: Exception during cleanup : Execute cmd: deletenetworkoffering failed, due to: errorCode: 431, errorText:Can't delete network offering 26 as its used by 1 networks. To make the network offering unavaiable, disable it
Exception 3: Exception during cleanup : Execute cmd: deletenetworkoffering failed, due to: errorCode: 431, errorText:Can't delete network offering 27 as its used by 1 networks. To make the network offering unavaiable, disable it |
|
@wilderrodrigues I've ran the same tests as you proposed. On Centos KVM hosts. Resulting in success besides the same exceptions. So that's 👍 LGTM I also manually tried to SSH into the VPC, which also worked as expected. Regarding the exceptions, see [1] here the network is created. But in the test the network was never removed, so cleanup would definitely fail, [2]. These integration tests should be rewritten so the network is removed inside the test, this could be done now, or in a separate PR as it doesn't alter system behavior. I however do think that you need to amend the last commit tittle. [1]
[2]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was recently changed from copy_if_needed to copy in PR 763 / 692 to solve CLOUDSTACK-8725. Are we sure we want to reverse that? Tests are green but there may not be a test for this? Or was this the reason in the first place the redundant VPC was broken? Please double check.
Discussion: #692
Jira: https://issues.apache.org/jira/browse/CLOUDSTACK-8725
PR: https://github.com/apache/cloudstack/pull/763/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What fixed the RVR was the change in the CsRedundant._collect_ips function. That one was kept, I just fixed the rVPC part of it, which was changed with no reason.
Concerning the template file (conntrackd.conf.templ), that's just a default conntrackd file completely commented out. If we copy it in every operation in the routers it will restart conntrackd every time, because it copies the the and afterwards it applies the configuration we need:
# conntrackd configuration
connt = CsFile(self.CONNTRACKD_CONF)
if guest is not None:
connt.section("Multicast {", "}", [
"IPv4_address 225.0.0.50\n",
"Group 3780\n",
"IPv4_interface %s\n" % guest.get_ip(),
"Interface %s\n" % guest.get_device(),
"SndSocketBuffer 1249280\n",
"RcvSocketBuffer 1249280\n",
"Checksum on\n"])
connt.section("Address Ignore {", "}", self._collect_ignore_ips())
connt.commit()
The copy if needed is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks!
|
Hi @borisroman Yes, agree on the cleanup, I mentioned in the PR that it was broken. Since I'm working on improving the tests, I will create another PR with the fix and also with the complete test for the egress on isolated network. What would you suggest for message's title? Cheers, |
|
Hi @remibergsma @borisroman @DaanHoogland @miguelaferreira I executed the RVR test here, see results below: The test above does the following: 1 listNetworks should show the created network in allocated state I don;t consider it enough, so I will write a test that check the conntrackd service in the routers as well. Should the PR wait until the test is ready or if I do it manually now (to close this blocker quicker) would that be enough? Cheers, |
|
Hi @remibergsma @borisroman @DaanHoogland @miguelaferreira @wido @Runseb , I did the following manual tests:
I then tried the following as well: As you can see, I could not ping google from inside the VM. I then went to the Master router and did: So, no default route on RVR routers. I then added them: After that, I went back to the VM and ping was successful! This bug is not related with this PR and was not mentioned before - probably nobody tested it. I would suggest to create a separate issue, which should include the fix and a test to cover it. What do you think? Concerning the conntrackd, I did the following: As you can see the configuration file is not good! And that's the same problem that was reported before. I also created a Redundant VPC in order to double check the conntrackd configuration in the routers. The results were as follows: I will apply the copy stuff and test again both rVPC and RVR. Cheers, |
|
Guys, There is also a glitch with the test_vpc_redundant.py: it was expected to fail on the new test_routes, because the default route is broken on R-VPC - which needs an issue first and a new PR. The glitch on the test was: Which would pass if we get "100.0% packet loss" I changed it for: I will submit the changes on this PR later today or tomorrow morning. Cheers, |
|
@remibergsma I see that currently we invoke the main function of configure.py every time we call the update_config when a new config to vr is pushed.The main function tries to process all the components again, even the once which are not modified. IMHO this is not a good way to update vr config. We should only configure things which are changed newly. I think we need to fix this as it is not scalable and redoing the whole config again might result in restarting one or many services like you said is happening with conntrackd. However i think we can fix the conntrackd issue temporarily by copying the conntrackd to some non standard location and run conntrackd using this config file.Currently the config file is copied to a standard location in /etc, doing this will make sure that the conntrackd will not be there when this script runs for the first time. I do not know of any tests for conntrackd case and for the default routes case in the isolated networks. I think we need to test this manually. |
|
Hi @remibergsma and @bvbharat, Please have a look at my comment from 2 days ago. I mentioned that indeed the change would create a regression. So this PR won't get merged with the current change on the copy/copy_if_needed stuff. I will change it back and also refactor the code of the copy() function, which is not duplicate from the copy_if_needed. Concerning the current way theVR is being configured - and having the configuration applied every time update_config is called - I agree it's not optimal. Our plan is to stabilise all the VRs in the coming 2 weeks. If we get the BVTs to work as expected, we will then move towards the next phase: refactor the persistent configuration, which should be available for a 4.6.1 release. Cheers, |
… been configured yet. - In case of rVPC we experienced the wrong route being added to the VPC tiers
…o make it more clear
- The cidr was replaced by the single IP, which broke the feature. - Wait during transition from master to backup otherwise the test fails due to wronge state
- If the file is always copied, it will result in restarting keepalived everytime which makes the routers transit between master/backup
- That's not the place to fix the default routes for redundant VPC,
- Adding tests to cover PF and FW in isolated networks
* Will still add some tests for egress as well
- Add egress tests in order to check if VMs can reach the outside world - Increase the wait when testing redundant routers: they fight to become master - Make sure the clean up is done properly
- It will help to increase coverage of VR use: PF; LB and FW
…dant_on() function is called - Also refactored the copy() function under CsHelper.py
- Due to an issue with VPC routers (CLOUDSTACK-8935) we are not able to destroy networks before destroying the routers - Added a forcestop/destroy routers inside the tearDown to make sure it passes. The issue will be addressed in a separate PR - Make sure the routers list is cleaned after destroy_routers() is called - Populate routers list after the router is recreated
|
Hi @remibergsma @bvbharat @karuturi @DaanHoogland @borisroman @wido @miguelaferreira The VRRP code is again compliant with RVR and r-VPC, I just refactored the copy method in order to avoid duplication. Whilst trying to fix the cleanup I found out another bug: CLOUDSTACK-8935. So, in order to get the cleanup working I had to add a remove routers in the tearDown, otherwise the networks cannot be removed. There is also a bug concerning the r-VPC and RVR default routes: CLOUDSTACK-8934. As you can see, I created only one issue for both, but if you think would be better we can split it. The both issues mentioned above will be tackled in different PRs. Tests were executed agains the following environment:
Below the results for the tests that require hardware: The "test_02_redundant_VPC_default_routes" is expected to fail. I already mentioned the bug above I will add a separate report for the other tests as soon as they are ready. Cheers, |
|
Test environment:
Tests results: The exceptions ara again related to cleanup, although I haven't changed any of those tests. Perhaps it's something to be investigated. Cheers, |
|
ping @remibergsma @DaanHoogland @borisroman @wido @bvbharat @karuturi @bhaisaab @miguelaferreira Anyone that could give a second LGTM to this PR? We need it to be merged. Thanks in advance. Cheers, |
|
tl;dr LGTM 👍 @wilderrodrigues @remibergsma @karuturi I have ran the tests overnight last night. Same specs as Wilder, so I haven't tested xen, vmware, hyperv or ubuntu. The test results from running the integration tests came back positive, except for the 3 pointed out above by Wilder. He created a ticket besides this one as he thinks it's unrelated. See https://issues.apache.org/jira/browse/CLOUDSTACK-8935. Besides the tests which where ran, I tested it manually. Deployed ACS with an advance zone/pod/cluster/hypervisor. Deployed the RVPCR. Started a user vm and tested ssh to that vm. It worked! Conclusion, this specific issue is fixed. Another one is found, but that will be fixed in a separated PR. LGMT 👍 |
|
Hey @remibergsma Any news on the tests? Would be nice to get this PR through before I push the next PR, that one that fix the https://issues.apache.org/jira/browse/CLOUDSTACK-8934 issue. Cheers, |
|
LGTM. Deployed an environment with this branch and it seems to work fine. The same tests also pass here, then tried to test something different. Deployed two VPC with each one tier and one VM and then created a VPN between them. They can now ping each other over the site-to-site VPN. Thanks @wilderrodrigues, will now merge. |
CLOUDSTACK-8915 - Cannot SSH into VMs deployed Redundant VPC routersIn order to reproduce the problem, I did the following * Create a Redundant VPC * Add a tier * Add a new VM to the tier * Add an ACL, open port 22 and associate the ACL with the tier * Acquire a pub IP * Add a PF rule to port 22 towards the VM * Try to SSH to the VM through the Pub IP It failed with "No route to host". This PR contains the following: * Fix for the keepalived (vrrp) configuration; * Refactor the default router code for both isolated and [r]VPC routers * Revert CsRedundant changes * Add default route tests * Add logging to tests - so we see what's happening during test execution. * pr/908: CLOUDSTACK-8915 - Making sure cleanup resources passes CLOUDSTACK-8915 - Fix the assertion used for the default routes test CLOUDSTACK-8915 - Copy the conntrackd configuration every time _redundant_on() function is called CLOUDSTACK-8915 - This test is still under construction CLOUDSTACK-8915 - Adding logging to tests CLOUDSTACK-8915 - Improve routers tests CLOUDSTACK-8915 - Reverting changes from commit id 1a02773 CLOUDSTACK-8915 - Reverting changes from commit id 18dbc0c CLOUDSTACK-8915 - VRRP needs a cidr in order to work properly CLOUDSTACK-8915 - Rearrenging a bit the default route code in order to make it more clear CLOUDSTACK-8915 - Add the default route only on address that have not been configured yet. Signed-off-by: Remi Bergsma <[email protected]>
|
Nice test, @remibergsma ! Thanks! |
Signed-off-by: Rohit Yadav <[email protected]>


In order to reproduce the problem, I did the following
It failed with "No route to host".
This PR contains the following: