Skip to content

Conversation

@rohityadavcloud
Copy link
Member

@rohityadavcloud rohityadavcloud commented Jun 5, 2024

Description

Followup from #8782

  • Changes behaviour of details param handling via global setting:
    • listVirtualMachines API: when the details param is not provided, it returns whether stats are returned controlled by a new global setting list.vm.default.details.stats
    • listVirtualMachinesMetrics API: when the details param is not provided, it uses all details including stats
  • Users who are affected slow performance of the listVirtualMachines API response time can set list.vm.default.details.stats to false
  • Remove ConfigKey vm.stats.increment.metrics.in.memory which was renamed to vm.stats.increment.metrics in Persistence of VM stats #5984 and also remove unused/unnecessary global settings via upgrade path
  • Changes default value of VM stats accumulation setting vm.stats.increment.metrics to false until a better solution emerges. Since Persistence of VM stats #5984, this is true and during the execution of listVM APIs the stats are clubbed/calculated which can immensely slow down list VM API calls. Any costly operations such as summing of stats shouldn't be done during the course of a synchronous API, such as the list VM API.
  • Fix UI that uses listVirtualMachinesMetrics to not call stats detail when in list view without metrics selected.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

- Changes behaviour of details param handling via global setting:
  - listVirtualMachines API: when the details param is not provided, it returns `all` details excluding/including `stats` which is controllable by a new global setting `list.vm.default.details.stats`
  - listVirtualMachinesMetrics API: when the details param is not provided, it uses `all` details including `stats`
- Users who are affected by the stats related change, can have backward compatibility at the higher-cost of listVirtualMachines API response time by setting `list.vm.default.details.stats` to true
- Remove ConfigKey vm.stats.increment.metrics.in.memory which was renamed to `vm.stats.increment.metrics` in apache#5984 and also remove unused/unnecessary global settings via upgrade path
- Changes default value of VM stats accumulation setting `vm.stats.increment.metrics` to false until a better solution emerges. Since apache#5984, this is true and during the execution of listVM APIs the stats are clubbed/calculated which can immensely slow down list VM API calls.
- Fix UI that uses listVirtualMachinesMetrics to not call `stats` detail when in list view without metrics selected.

Signed-off-by: Rohit Yadav <[email protected]>
@rohityadavcloud
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9796

@rohityadavcloud
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@codecov
Copy link

codecov bot commented Jun 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 14.96%. Comparing base (631d6ad) to head (1efbefc).
Report is 31 commits behind head on 4.19.

Additional details and impacted files
@@            Coverage Diff             @@
##               4.19    #9177    +/-   ##
==========================================
  Coverage     14.96%   14.96%            
- Complexity    10991    11011    +20     
==========================================
  Files          5373     5377     +4     
  Lines        469203   469574   +371     
  Branches      60225    60756   +531     
==========================================
+ Hits          70198    70278    +80     
- Misses       391232   391513   +281     
- Partials       7773     7783    +10     
Flag Coverage Δ
uitests 4.29% <ø> (-0.01%) ⬇️
unittests 15.67% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@blueorangutan
Copy link

[SF] Trillian test result (tid-10375)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 49031 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9177-t10375-kvm-centos7.zip
Smoke tests completed. 131 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@GutoVeronezi
Copy link
Contributor

-1 on changing the list.vm.default.details.stats default value. Refer to the discussion in #8782.

@rohityadavcloud rohityadavcloud requested a review from shwstppr June 10, 2024 10:50
@rohityadavcloud
Copy link
Member Author

@GutoVeronezi this isn't a vote, could you explain why you're opposing now that the default value change when we've agreed in #8782 to address this via documentation of the change in release notes?

@GutoVeronezi
Copy link
Contributor

@rohityadavcloud, let me build the story line for you:

We (the community) did not agree on changing the default behavior, and I am -1 because of that.

What bothers me in this whole situation is that we do not have a well-defined and documented policy for introducing disruptive features, removing deprecated methods, technologies, and designs, changing default behaviors, and so on. It is insane that some disruptive changes are introduced while others are not just because some few wants; it is not a well-defined and documented process, it is a subjective process that is confusing, volatile, unpredictable, and not scalable, which is not healthy for a community. We are having a discussion about it at #8970 (which almost everybody seems to ignore). Thus, it is important for everybody to participate so the community can agree on something.

@DaanHoogland
Copy link
Contributor

@GutoVeronezi , I understand your point, but we do not have a simple usable procedure yet. The implicit procedure was "we don't introduce any backwards incompatability, .. unless we all agree". And so in this case you are right. That does not remove the need to decide on individual issues.
Maybe @rohityadavcloud is a bit quick art the gun, but it makes sense to first introduce a setting and than later reverse the default. We have done it like that in the past. Maybe we can add this to #8970 (if it is not included yet).

So how do you propose we take this forward? cc @JoaoJandre @sureshanaparti @weizhouapache

@GutoVeronezi
Copy link
Contributor

@GutoVeronezi , I understand your point, but we do not have a simple usable procedure yet. The implicit procedure was "we don't introduce any backwards incompatability, .. unless we all agree". And so in this case you are right. That does not remove the need to decide on individual issues. Maybe @rohityadavcloud is a bit quick art the gun, but it makes sense to first introduce a setting and than later reverse the default. We have done it like that in the past. Maybe we can add this to #8970 (if it is not included yet).

So how do you propose we take this forward? cc @JoaoJandre @sureshanaparti @weizhouapache

As we already have a configuration to change the behavior, we should keep the current behavior and guide people through the release notes on how to change it. Then, we should continue the #8970 discussion. Once a process is defined and documented, we can look back to the configuration and change the default according to the policy we defined (or even remove the configuration if we decide so). Other systems do that way, like GitLab and OpenStack; the problem is that we do not have this defined and documented yet.

@rohityadavcloud
Copy link
Member Author

rohityadavcloud commented Jun 12, 2024

@GutoVeronezi by your own admission I changed my stand, closed my PR to support Joao's, and built support and consensus with others on Joao's PR and merge his PR. I also don't appreciate some of your remarks and behaviour towards me. It's unfair to me when I'm genuinely trying to help and support everybody and anybody in the community.

I don't see anything wrong with somebody changing their stand when presented with new information or convincing arguments. My reason for changing my stand wrt the default behaviour is that most users don't consume stats (outside of UI) via the list APIs, but many integrations such as Terraform provider, k8s-provider, CAPC, users ansible scripts etc (point of note that these integrations can't be changed) call the list VM APIs and can benefit from any potential speed up. Additionally, there were changes in my previously closed PR #8985 that weren't in #8782 so I had to open a PR with them anyway.

@GutoVeronezi @JoaoJandre and anyone else - what are your reasons of changing your stand wrt the default behaviour now?

For example, look at the two user reported issues that also contributed to change my stand:
#8975
#7910

I think we introduced the behaviour, and we can discuss and agree if we want to change that if that is largely beneficial to users. I don't think it would detrimental for large number of users if they have list API performance improvements and still had a global setting to revert to old behaviour if they wanted.

That said, I don't have the energy or bandwidth to fight over this, if this is the only thing blocking this I can change the global setting. Have a look, review and advise - @JoaoJandre @DaanHoogland @weizhouapache @sureshanaparti @GutoVeronezi @shwstppr and others...

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

  • Slow VM listing needs to be addressed as it has already been reported in various forms
  • I don't see changing the value for a boolean global config as a big breaking change if we plan to document it properly. Users can revert it after upgrade if they wish to continue using incremental stats.
  • If we did not agree on changing config value in the past we can discuss it now with pros and cons
  • I don't think it makes sense to block this change till we decide on #8970. It deals with wider discussion and there are no clear outcomes in the past 45 days or so

@GutoVeronezi
Copy link
Contributor

@rohityadavcloud, I did not judge your change of opinion. If you read it carefully, you will see that what I am doing is pointing out that there was no consensus on the change of the default behavior in the whole discussion. There were no direct attacks on anyone, I am just making the timeline clear.

We would like to introduce some breaking changes and changes to default behavior, but we do not have a documented process for this, and doing it on a case-by-case basis is prone to heated discussions and unfair decisions. Discussion #8970 is there to make these processes well-defined and documented so we can avoid this kind of discussions in the future.

@shwstppr

If more people interact there, we can get some outcomes 😃

@rohityadavcloud
Copy link
Member Author

Amusing, we're going in circles now and obstinately determined about it.

I've tried my part, got non-technical and discourteous reasons to block the PR, coerced to unrelated agenda. I'm also flattered some want my attention and validation so bad.

I'm gonna sit on the fence on this one, for now; save my time and energy.

list.vm.default.details.stats defaults to true now.

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@rohityadavcloud
Copy link
Member Author

@blueorangutan ui

@blueorangutan
Copy link

@rohityadavcloud a Jenkins job has been kicked to build UI QA env. I'll keep you posted as I make progress.

@blueorangutan
Copy link

UI build: ✔️
Live QA URL: https://qa.cloudstack.cloud/simulator/pr/9177 (QA-JID-373)

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9914

@rohityadavcloud
Copy link
Member Author

Requesting re-review as code has changed a bit

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code still lgtm

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-10429)

@rohityadavcloud
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@blueorangutan
Copy link

[SF] Trillian test result (tid-10430)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43903 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9177-t10430-kvm-centos7.zip
Smoke tests completed. 130 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_trigger_shutdown Failure 341.76 test_safe_shutdown.py

@rohityadavcloud rohityadavcloud merged commit 2ca0857 into apache:4.19 Jun 14, 2024
@rohityadavcloud rohityadavcloud deleted the metrics-api-improvements branch June 14, 2024 05:33
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jun 17, 2024
apache#9177)

- Changes behaviour of details param handling via global setting:
  - listVirtualMachines API: when the details param is not provided, it returns whether stats are returned controlled by a new global setting `list.vm.default.details.stats`
  - listVirtualMachinesMetrics API: when the details param is not provided, it uses `all` details including `stats`
- Users who are affected slow performance of the listVirtualMachines API response time can set `list.vm.default.details.stats` to `false`
- Remove ConfigKey vm.stats.increment.metrics.in.memory which was renamed to `vm.stats.increment.metrics` in apache#5984 and also remove unused/unnecessary global settings via upgrade path
- Changes default value of VM stats accumulation setting `vm.stats.increment.metrics` to false until a better solution emerges. Since apache#5984, this is true and during the execution of listVM APIs the stats are clubbed/calculated which can immensely slow down list VM API calls. Any costly operations such as summing of stats shouldn't be done during the course of a synchronous API, such as the list VM API.
- Fix UI that uses listVirtualMachinesMetrics to not call `stats` detail when in list view without metrics selected.

Signed-off-by: Rohit Yadav <[email protected]>
weizhouapache pushed a commit to shapeblue/cloudstack that referenced this pull request Mar 4, 2025
…Pools (apache#446)

Following changes and improvements have been added:
- Allows configuring connection pool library for database connection. As default, replaces dbcp2 connection pool library with more performant HikariCP.
db.<DATABASE>.connectionPoolLib property can be set in the db.properties to use the desired library.

> Set dbcp for using DBCP2
> Set hikaricp or for using HikariCP

- Improvements in handling of PingRoutingCommand
   
    1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.
    2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch
    3. Optimized scanning stalled VMs

- Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers`

- Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine

- Added caching for dynamic config keys with expiration after write set to 30 seconds.

- Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0.

- Added caching for some recurring DB retrievals
   
    1. CapacityManager - listing service offerings - beneficial in host capacity calculation
    2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins
    3. DownloadListener - hypervisors for zone - beneficial for host joins
    5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands
    
- Optimized MS list retrieval for agent connect 

- Optimize finding ready systemvm template for zone

- Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks

- Changes in agent-agentmanager connection with NIO client-server classes
   
    1. Optimized the use of the executor service
    2. Refactore Agent class to better handle connections.
    3. Do SSL handshakes within worker threads
    5. Added global configs to control the behaviour depending on the infra. SSL handshake and initial processing of a new agent could be a bottleneck during agent connections. Configs - `agent.max.concurrent.new.connections` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end.
    6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used.

- Improvements in StatsCollection - minimize DB retrievals.

- Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.

- Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.

- Minor improvements in resource limit calculations wrt DB retrievals

### Schema changes
Schema changes that need to be applied if updating from 4.18.1.x
[FR73B-Phase1-sql-changes.sql.txt](https://github.com/user-attachments/files/17485581/FR73B-Phase1-sql-changes.sql.txt)


Upstream PR: apache#9840

### Changes and details from scoping phase
<details>
<summary>Changes and details from scoping phase</summary>


FR73B isn't a traditional feature FR per-se and the only way to scope this is we find class of problems and try to put them in buckets and propose a time-bound phase of developing and delivering optimisations. Instead of specific proposal on how to fix them, we're looking to find approaches and methodologies that can be applied as sprints (or short investigation/fix cycles) as well as split and do well-defined problem as separate FRs.

Below are some examples of the type of problem we can find around resource contention or spikes (where resource can be CPU, RAM, DB):

- Resources spikes on management server start/restart (such as maintenance led restarts)
- Resource spikes on addition of Hosts
- Resource spikes on deploying VMs
- Resource spikes or slowness on running list APIs

As an examples, the following issues were found during the scoping exercise:

### 1. Reduce CPU and DB spikes on adding hosts or restarting mgmt server (direct agents, such as Simulator)

Introduced in apache#1403 this gates the logic only to XenServer where this would at all run. The specific code is only applicable for XenServer and SolidFire (https://youtu.be/YQ3pBeL-WaA?si=ed_gT_A8lZYJiEh.

Hotspot took away about 20-40% CPU & DB pressures alone:

<img width="1002" alt="Screenshot 2024-05-03 at 3 10 13 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/f7f86c44-f865-4734-a6fd-89bd6a85ab73">

<img width="1067" alt="Screenshot 2024-05-03 at 3 11 41 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/caa5081b-8fd6-46cd-acb1-f4c5d6b5d10f">

**After the fix:**

![Screenshot 2024-05-03 at 5 31 05 PM](https://github.com/shapeblue/cloudstack-apple/assets/95203/2ba0b1c9-9922-44a9-ae4f-fb65f77866d4)

### 2. Reduce DB load on capacity scans

Another type of code/programming pattern wherein, we fetch all DB records only to count them and discard them. Such refactoring can reduce CPU/DB load for env with really large hosts. The common pattern in code to search is to optimise of list of hosts/hostVOs. DB hot-spot reduced by ~5-13% during aggressive scans.

### 3. Reduce DB load on Ping command

Upon handling Ping commands, we try to fetch whole bunch of columns from the vm_instance (joined to other) table(s), but only use the `id` column. We can optimise and reduce DB load by only fetching the `id`. Further optimise how power reports are handled (for example, previously it calls DB query and then used an iterator -> which was optimised as doing a select query excluding list of VM ids). 

With 1,2,3, single management server host + simulator deployed against single MySQL 8.x DB was found to do upto 20k hosts across two cluster.

### 4. API and UI optimisation

In this type of issues, the metrics API for zone and cluster were optimised, so the pages would load faster. This sort of thing may be possible across the UI, for resources that are very high in number.

### 5. Log optimisations

Reducing (unnecessary) logging can improve anything b/w 5-10% improving in overall performance throughput (API or operational)

### 6. DB, SQL Query and Mgmt server CPU load Optimisations

Several optimisations were possible, as an example, this was improved wherein `isZoneReady` was causing both DB scans/load and CPU hotspot:

<img width="1314" alt="Screenshot 2024-05-04 at 9 19 33 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/b0749642-0819-4bb9-803a-faa9754ccefa">

The following were explored:

- Using mysql slow-query logging along with index scan logging to find hotspot, along with jprofiler
- Adding missing indexes to speed up queries
- Reduce table scans by optimising sql query and using indexes
- Optimising sql queries to remove duplicate rows (use of distinct)
- Reduce CPU and DB load by using jprofiler to optimise both sql query
  and CPU hotspots

Example fix:

server: reduce CPU and DB load caused by systemvm ::isZoneReady()
For this case, the sql query was fetching large number of table scans
only to determine if zone has any available pool+host to launch
systemvms. Accodingly the code and sql queries along with indexes
optimisations were used to lower both DB scans and mgmt server CPU load.

Further, tools such as EXPLAIN or EXAMPLE ANALYZE or visual explaining of queries can help optimise queries; for example, before:

<img width="508" alt="Screenshot 2024-05-08 at 6 16 17 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/d85f4d19-36a2-41ee-9334-c119a4b2fc52">

After adding an index:

<img width="558" alt="Screenshot 2024-05-08 at 6 22 32 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/14ef3d13-2d25-4f41-ba25-ee68e37b5b76">

Here's a bigger view of the user_vm_view that's optimised against by adding an index to user_ip_address table:

![zzexplain](https://github.com/shapeblue/cloudstack-apple/assets/95203/72e44291-a657-49da-adcd-5803a2fa91f9)

### 7. Better DB Connection Pooling: HikariCP

Several CPU and DB hotspots suggested about 20+% of time was spent to process `SELECT 1` query, which was found later wasn't necessary for JDBC 4 compliant drivers that would use a Connection::isValid to ascertain if a connection was good enough. Further, heap and GC spikes are seen due to load on mgmt server with 50k hosts. By replacing the dbcp2 based library with a more performant library with low-production overhead HikariCP, it was found the application head/GC load and DB CPU/Query load could be reduced further. For existing environments, the validation query can be set to `/* ping */ SELECT 1` which performance a lower-overhead application ping b/w mgmt server and DB.

Migration to HikariCP and changes shows lower number of select query load, and about 10-15% lower cpu load:

<img width="1071" alt="Screenshot 2024-05-09 at 10 56 09 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/5dbf919e-4d15-48a3-ab87-5647db666132">
<img width="372" alt="Screenshot 2024-05-09 at 10 58 40 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/9cfc80c6-eb91-4036-b7f2-1e24b6c5b78a">

Caveat: this has led to unit test failures, as many dependent on dbcp2 based assumptions, which can be fixed in due time. However, build is passing and a simulator based test-setup seems to be working. The following is telemetry of the application (mgmt server), after 50k hosts join:

<img width="1184" alt="Screenshot 2024-05-10 at 12 31 09 AM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/e47cd71e-2bae-4640-949c-a457c420ab70">
<img width="1188" alt="Screenshot 2024-05-10 at 12 31 26 AM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/33dec07b-834c-44b8-a9a4-1d7502973fc7">

For 100k hosts added/joining, the connection scaling seems more better:

<img width="1180" alt="Screenshot 2024-05-22 at 8 32 44 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/ee4d3c5d-4b6d-43f0-8efb-28aba64917d9">

### 8. Using MySQL slow logs to optimise application logic and queries

Using MySQL slow logs, using the following configuration was enabled:

```
slow_query_log		= 1
slow_query_log_file	= /var/log/mysql/mysql-slow.log
long_query_time = 1
log_queries_not_using_indexes = 1
min_examined_row_limit = 100
```

Upon analysing the slow logs, network_offering and user_vm_views related view and query & application logic for example were optimised to demonstrate how the methodology can be used to measure, find and optimise bottlenecks. It was found that queries that end up doing more table scans than the rows they returned to application (ACS mgmt server), were adding pressure on the db.

- In case of network_offering_view adding an index reduced table scans.
- In case of user_vm_view, it was found that MySQL was picking the wrong index that caused a lot of scans as many IPs addresses were there in the user_ip_address table. It turned to be related or same as an old MySQL server bug https://bugs.mysql.com/bug.php?id=41220 and the workaround fix was to force the relevant index. This speed up listVirtualMachines API in my test env (with 50-100k hosts) from 17s to under 200ms (measured locally).

### 9. Bottlenecks identified and categorised

As part of the FR scoping effort, not everything could be possibly fixed, as an example, some of the code has been marked with FIXME or TODO that relate to hotspots discovered during the profiling process. Some of which was commented, to for example speed up host additions while reduce CPU/DB load (to allow testing of 50k-100k hosts joining).

Such code can be further optimised by exploring and using new caching layer(s) that could be built using Caffein library and Hazelcast.

Misc: if distributed multi-primary MySQL cluster support is to be explored:
shapeblue/cloudstack-apple#437

Misc: list API optimisations may be worth back porting:
apache#9177
apache#8782

</details>

---------

Signed-off-by: Rohit Yadav <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Co-authored-by: Abhishek Kumar <[email protected]>
Co-authored-by: Fabricio Duarte <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants