Skip to content

Conversation

@DennisKonrad
Copy link
Contributor

@DennisKonrad DennisKonrad commented Sep 23, 2019

Description

We noticed that VMs that are target of a static nat cannot use private gateways.

Investigation showed that these VMs behave differently than all other VMs in a VPC because their whole traffic gets marked and is only allowed to leave through the public ip provided by the static nat.

Could be related to: #3366
Maybe @richardlawley @rhtyd can contribute if this is an unwanted side-effect.

To not break anything we propose a solution that does only mark the nat portion of the traffic and allow the VM use the private gateway like normal.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

We built and manually tested this to check if function is restored and nothing else breaks.

@richardlawley
Copy link
Contributor

I doubt this was introduced in #3366 - all I did there was change to the proper way of adding rules to the front. Before this change, rules were still added to the front, but this happened multiple times as the code couldn't detect it as a duplicate rule.

@DennisKonrad
Copy link
Contributor Author

Thanks you for helping to find the source of this issue @richardlawley.
We noticed the behaviour of the nat'ed VM changed in the passed year somehow. Our systemvm where the private gatways were working (for the natted vm) was about a year old. Maybe something else changed that I did not find/see?

Or could the issue solved by #3366 have caused the packet mark not to work somehow?

@richardlawley
Copy link
Contributor

I'm not sure. I made a few changes around the same time for static nat, especially when there are multiple public networks, but some of these were in code branches which don't get hit on a vpc.

Unfortunately I'm only running 4.11.2 in production, but have all of my fixes ported into our systemvm iso. If I get a chance I'll see if I can reproduce the problem.

@DennisKonrad
Copy link
Contributor Author

That would be a great help. I think most cs users don't use the private gateways and maybe this nat issue was not noticed?

What do you think of the proposed changes? Do you think it breaks functionality?

@rohityadavcloud rohityadavcloud changed the base branch from master to 4.13 September 25, 2019 05:39
@rohityadavcloud rohityadavcloud changed the base branch from 4.13 to master September 25, 2019 05:39
@DennisKonrad
Copy link
Contributor Author

DennisKonrad commented Oct 7, 2019

@rhtyd I looked into the failed test and I think it failed because of:

deletenetworkoffering failed, due to: errorCode: 431, errorText:Can't delete network offering 75 as its used by 1 networks. To make the network offering unavaiable, disable it

It seems this is not caused by the changes in this PR. Can you confirm?

@DaanHoogland
Copy link
Contributor

should this go on 4.13 as well, @andrijapanicsb @DennisKonrad ?

@andrijapanicsb
Copy link
Contributor

Makes sense @DaanHoogland

@andrijapanicsb
Copy link
Contributor

@DennisKonrad I would like to test this manually.
Can you please give EXACT explanation on the setup to reproduce the issue (create this, create that, etc....)

Thanks!

@DennisKonrad
Copy link
Contributor Author

DennisKonrad commented Jan 20, 2020

@andrijapanicsb I'll describe the setup to you so you can reproduce it. In short:

If you use VPC with private Gateways and static NAT, the one VM that the NAT points to isn't able to use the private gateway anymore.
The private gateway is used to connect to an outside network (with static routes).

This worked up to a change that I cannot pin down exactly and after we updated the NATed VM lost connection to the machines behind the private gateway. Therefor the fix

@rohityadavcloud rohityadavcloud changed the base branch from master to 4.13 January 28, 2020 05:17
@rohityadavcloud rohityadavcloud changed the base branch from 4.13 to master January 28, 2020 05:17
@rohityadavcloud
Copy link
Member

@DennisKonrad can you change the PR base to 4.13 and rebase against origin/4.13?

@DaanHoogland
Copy link
Contributor

@DennisKonrad can you change the PR base to 4.13 and rebase against origin/4.13?

and mark milestone 4.13.1, please?

@DennisKonrad DennisKonrad changed the base branch from master to 4.13 January 28, 2020 09:18
@DennisKonrad DennisKonrad modified the milestones: 4.14.0.0, 4.13.1.0 Jan 28, 2020
@DennisKonrad
Copy link
Contributor Author

@DaanHoogland @rhtyd After a lot of pain:
Rebased to origin/4.13
Changed PR base to 4.13
Changed Milestone to 4.13

@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the rule on source/internal to destination/public makes sense in view of the report. code looks good

@apache apache deleted a comment from rohityadavcloud Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from rohityadavcloud Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from andrijapanicsb Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-691

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-844)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31585 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3604-t844-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Smoke tests completed. 77 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Contributor

tests look good, second pair of eyes, @andrijapanicsb @wido @rhtyd @weizhouapache ?

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tested manually
the issue can be reproduced, and fixed with this change.

Copy link
Contributor

@wido wido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaanHoogland DaanHoogland merged commit 82d94a8 into apache:4.13 Jan 30, 2020
@DaanHoogland
Copy link
Contributor

tnx guys

DaanHoogland added a commit that referenced this pull request Jan 30, 2020
* 4.13:
  Fix Policy Based Routing for private gateway static routes (#3604)
ustcweizhou pushed a commit to ustcweizhou/cloudstack that referenced this pull request Feb 28, 2020
* Fix for routing table issue with NAT interfaces

* Mark only packets with the public ip as destination
ustcweizhou added a commit to ustcweizhou/cloudstack that referenced this pull request Nov 19, 2020
ustcweizhou added a commit to ustcweizhou/cloudstack that referenced this pull request Nov 23, 2020
qrry pushed a commit to qrry/cloudstack that referenced this pull request Dec 4, 2020
* master: (25 commits)
  integration test: skip vlan of public ip range in get_free_vlan
  vpc vr: plugin nics by this order: public/private/guest
  vpc vr: fix Conflicting device id on private gw nic
  Adding zone name to physicalnetworkresponse (apache#4510)
  Disallowing udp for lb rules for haproxy (apache#4501)
  Make global setting non-dynamic (apache#4505)
  Adding cpuallocated percentage and value to host and hostsformigrationresponse (apache#4499)
  kvm: fix router.aggregation.command.each.timeout is reset to 600 when update other kvm configs (apache#4496)
  fix failures with test_multiple_nic_support.py (apache#4495)
  Fix hosts for migration count (apache#4500)
  sql: Fix Zones are returned in a random order (apache#3934) (apache#4494)
  integration test: update steps
  integration test: add private gateway in test
  integration test: verify public nics state
  bugfix apache#9 vpc vr: Add PREROUTING rule for vm with static nat to multiple private gateways
  bugfix apache#8 vpc: add rule for traffic between vm and private gateway
  bugfix apache#7 vpc vr: allow servers in private gateway to reach internet via the VPC VR if it is gateway
  bugfix apache#6 vpc vr: Add iptables rules for ACL of private gateway
  Revert "Fix Policy Based Routing for private gateway static routes (apache#3604)"
  Revert "Add private gateway IP to router initialization config"
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants