<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Btech Engineering on Medium]]></title>
        <description><![CDATA[Stories by Btech Engineering on Medium]]></description>
        <link>https://medium.com/@btech-engineering?source=rss-587830e9864f------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*n0MC56HeQ_Jxj5rkLUGj2Q.png</url>
            <title>Stories by Btech Engineering on Medium</title>
            <link>https://medium.com/@btech-engineering?source=rss-587830e9864f------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 19 May 2026 03:17:48 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@btech-engineering/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Securing OpenStack with MFA: Implementing TOTP Authentication on Horizon]]></title>
            <link>https://medium.com/@btech-engineering/securing-openstack-with-mfa-implementing-totp-authentication-on-horizon-466c79b0d4fb?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/466c79b0d4fb</guid>
            <category><![CDATA[openstack]]></category>
            <category><![CDATA[mfa]]></category>
            <category><![CDATA[authentication]]></category>
            <category><![CDATA[cloud-computing]]></category>
            <category><![CDATA[open-source]]></category>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Tue, 31 Mar 2026 06:49:46 GMT</pubDate>
            <atom:updated>2026-03-31T06:49:46.459Z</atom:updated>
            <content:encoded><![CDATA[<blockquote>Two-Factor Authentication is no longer optional — here’s how to enable it on your OpenStack deployment using TOTP, Keystone, and Horizon.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/828/1*E4vspFD9GQWnh87PsHJKIQ.png" /></figure><h3>The Problem: One Password Is Never Enough</h3><p>In cloud infrastructure, a single compromised password can mean everything — lost data, downed services, and a very bad day for your team. OpenStack’s Keystone has supported Time-based One-Time Password (TOTP) internally for a while, but there was a painful gap: <strong>Horizon, web dashboard, had no native TOTP support</strong>. If TOTP was active in Keystone, users couldn’t even log in to the web UI.</p><p>That gap is now closed.</p><p>OpenStack Horizon now supports TOTP-based Multi-Factor Authentication (MFA) natively, enabling true 2FA directly from browser — no workarounds, no CLI-only flows.</p><h3>What Is TOTP and Why Does It Matter?</h3><p>TOTP (Time-based One-Time Password) is technology behind authenticator apps like Google Authenticator and Duo Mobile. It generates a 6-digit code that changes every 30 seconds, derived from a shared secret key and current timestamp.</p><p>The security model is simple and powerful:</p><ul><li><strong>Something you know</strong> → your password</li><li><strong>Something you have</strong> → your phone (TOTP code)</li></ul><p>Even if an attacker obtains your password through phishing or a data breach, they still cannot log in without rotating code from your device.</p><p>This feature was originally driven by demand from Infomaniak’s public cloud customers — and it’s now available for anyone running OpenStack Epoxy or later.</p><h3>Architecture Overview</h3><p>MFA flow works like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Ye5zMsrFw07vh-Up.png" /></figure><p>Keystone handles token validation; Horizon provides UI entry point. TOTP secret is stored as a <strong>credential</strong> object tied to user in Keystone.</p><h3>Environment</h3><p>For this walkthrough, we’re using an <strong>All-in-One OpenStack Epoxy</strong> deployment via Kolla-Ansible</p><h3>Part 1: Deploy OpenStack Epoxy with Kolla-Ansible</h3><p>If you already have OpenStack running, skip to Part 2.</p><p>Install Dependencies</p><pre>sudo apt-get update<br>sudo apt-get install python3-dev libffi-dev gcc libssl-dev \<br>  python3-selinux python3-setuptools python3-venv -y<br><br>python3 -m venv os-venv<br>source os-venv/bin/activate<br>pip install -U pip<br>pip install ansible==2.9.13</pre><p>Install Kolla-Ansible</p><pre>pip install kolla-ansible==20.3.0<br>sudo mkdir -p /etc/kolla<br>sudo chown $USER:$USER /etc/kolla</pre><p>Configure and Deploy</p><pre>cd ~<br>cp -r os-venv/share/kolla-ansible/etc_examples/kolla/* /etc/kolla<br>cp os-venv/share/kolla-ansible/ansible/inventory/all-in-one .<br>kolla-ansible install-deps<br>kolla-genpwd</pre><p>Edit /etc/kolla/globals.yml with your environment values:</p><pre>kolla_base_distro: &quot;ubuntu&quot;<br>network_interface: &quot;ens3&quot;<br>neutron_external_interface: &quot;ens4&quot;<br>kolla_internal_vip_address: &quot;192.168.10.240&quot;<br>enable_openstack_core: &quot;yes&quot;</pre><p>Then deploy:</p><pre>kolla-ansible bootstrap-servers -i ./all-in-one<br>kolla-ansible prechecks -i ./all-in-one<br>kolla-ansible deploy -i ./all-in-one</pre><blockquote><strong>Common Issues:</strong></blockquote><blockquote>Docker module not found → pip install docker</blockquote><blockquote>Dbus module not found → sudo apt install -y libdbus-1-dev libdbus-glib-1-dev gcc then pip install dbus-python</blockquote><h3>Part 2: Enable TOTP in Keystone</h3><p>Create a custom Keystone config to add totp as an authentication method:</p><pre>mkdir -p /etc/kolla/config/keystone<br>vim /etc/kolla/config/keystone/keystone.conf</pre><p>Add the following:</p><pre>[auth]<br>methods = password,token,totp</pre><h3>Part 3: Enable TOTP in Horizon</h3><p>Edit the Horizon Jinja2 template used by Kolla-Ansible:</p><pre>vim /root/os-venv/share/kolla-ansible/ansible/roles/horizon/templates/_9999-custom-settings.py.j2</pre><p>Add:</p><pre>OPENSTACK_KEYSTONE_MFA_TOTP_ENABLED = True<br><br>AUTHENTICATION_PLUGINS = [<br>    &#39;openstack_auth.plugin.password.PasswordPlugin&#39;,<br>    &#39;openstack_auth.plugin.totp.TotpPlugin&#39;,<br>]</pre><p>Then reconfigure both services:</p><pre>kolla-ansible -i all-in-one reconfigure --tags keystone,horizon</pre><h3>Part 4: Create a TOTP-Enabled User</h3><h4>Generate a TOTP Secret Key</h4><p>Secret must be <strong>exactly 16 characters</strong>, then Base32-encoded:</p><pre>echo &quot;iam16characters.&quot; | base32 | tr -d &quot;=&quot;<br># Output: NFQW2MJWMNUGC4TBMN2GK4TTFYFZ</pre><blockquote>Keys shorter than 16 characters will fail silently — auth code will never match.</blockquote><p>Create User and Assign Credentials</p><pre># Create user with MFA enforced<br>openstack user create \<br>  --project admin \<br>  --domain default \<br>  --project-domain default \<br>  --password-prompt \<br>  --enable-multi-factor-auth \<br>  --multi-factor-auth-rule password,totp \<br>  myuser<br><br># Assign role<br>openstack role add --user myuser --project admin member<br><br># Attach TOTP credential<br>openstack credential create --type totp myuser NFQW2MJWMNUGC4TBMN2GK4TTFYFZ</pre><p>Verify credential was created:</p><pre>openstack credential list</pre><p>Expected output</p><pre>+----------------------------------+------+----------------------------------+------------------------------+------------+<br>| ID                               | Type | User ID                          | Data                         | Project ID |<br>+----------------------------------+------+----------------------------------+------------------------------+------------+<br>| 1642a6997e754564aa45e102af92b6b4 | totp | c4014ddfce22424bb7416c63a50a7a47 | NFQW2MJWMNUGC4TBMN2GK4TTFYFZ | None       |<br>+----------------------------------+------+----------------------------------+------------------------------+------------+</pre><h4>Register in Your Authenticator App</h4><p>Open <strong>Duo Mobile</strong> (or Google Authenticator):</p><ol><li>Add new account → Choose <strong>Google</strong> or <strong>Enter key manually</strong></li><li>Paste the key: NFQW2MJWMNUGC4TBMN2GK4TTFYFZ</li><li>Give it a name (e.g., OpenStack Admin) → Save</li></ol><p>You’ll now see a rotating 6-digit code every 30 seconds.</p><h3>Part 5: Logging In via Horizon</h3><p>Go to your Horizon dashboard and log in with:</p><ul><li><strong>Username</strong></li><li><strong>Password</strong></li><li><strong>TOTP Code</strong> (from your authenticator app)</li></ul><p>That’s it — full 2FA through the web interface.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*BSqCATY8tSH82XyZ.png" /></figure><h3>Final Thoughts</h3><p>Enabling MFA on OpenStack is a relatively small configuration change with a massive security payoff. With TOTP now supported end-to-end — from Keystone to Horizon to CLI — there’s no reason to leave your cloud dashboard protected by a single password.</p><p>If you’re running OpenStack in any environment exposed to multiple users or external networks, <strong>turn this on</strong>.</p><h3>Want to Go Deeper?</h3><p>This guide is based on hands-on work from <strong>Boer Technology </strong>engineering team — a managed cloud and professional services company that runs OpenStack, Kubernetes, and cloud-native platforms at scale for enterprise customers.</p><h3>Get in Touch</h3><p>Ready to optimize your infrastructure? Let’s discuss how we can help:</p><ul><li>Website: <a href="https://btech.id/">www.btech.id</a></li><li>Email: <a href="mailto:support@btech.id">support@btech.id</a></li></ul><p><em>References:</em></p><ul><li><a href="https://openmetal.io/docs/manuals/tutorials/enable-totp-openstack">Configuring TOTP on OpenStack</a></li><li><a href="https://docs.openstack.org/keystone/latest/admin/auth-totp.html#auth-totp">Keystone TOTP Password</a></li><li><a href="https://platform9.com/kb/openstack/using-openstack-cli-with-mfa">Using OpenStack CLI with MFA</a></li></ul><p><strong>Author</strong>:<br>Managed Services Team — PT. Boer Technology</p><p><strong>Tags</strong>: #Openstack #TOTP #MFA #2FA #Authentication #Horizon #Keystone #OpenSource</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=466c79b0d4fb" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Ansible AWX: Infrastructure Automation on Top of Kubernetes]]></title>
            <link>https://medium.com/@btech-engineering/ansible-awx-infrastructure-automation-on-top-of-kubernetes-9c81986131c4?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/9c81986131c4</guid>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[ansible-awx]]></category>
            <category><![CDATA[kubernetes]]></category>
            <category><![CDATA[automation-platform]]></category>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Mon, 16 Mar 2026 04:45:58 GMT</pubDate>
            <atom:updated>2026-03-16T04:45:58.152Z</atom:updated>
            <content:encoded><![CDATA[<blockquote>This article documents our team’s research journey exploring Ansible AWX as an infrastructure automation orchestration platform — from initial deployment and OpenStack integration to air-gap installation.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kfgwq87JbPNkWlsgOK5uaw.png" /></figure><h3>What is Ansible AWX?</h3><p>Ansible AWX is open-source version of Red Hat Ansible Automation Platform, providing a Web UI to interactively manage Ansible resources. With AWX, teams can run playbooks, manage inventories, schedule jobs, and handle credentials, all through a browser without needing to SSH into a terminal.</p><p>Since version 18.0, AWX is recommended to be deployed using an Operator on top of a Kubernetes platform.</p><h3>Stage 1: Deploying AWX on K3s</h3><h4>Why K3s?</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/361/1*RfjGHyeG0msi26OTtFAHHg.png" /></figure><p>K3s is a lightweight Kubernetes distribution built by Rancher that retains the full functionality of Kubernetes while consuming far fewer resources compared to vanilla Kubernetes. It’s well-suited for lab environments with limited specs (minimum 4 cores / 8 GB RAM for a single-node AWX setup).</p><h4>Deployment Steps</h4><p>Install K3s:</p><pre>curl -sfL https://get.k3s.io | sh -</pre><p>Install AWX Operator:</p><pre>git clone https://github.com/ansible/awx-operator.git<br>cd awx-operator<br>git checkout 2.19.1<br>make deploy</pre><p>Create an AWX Instance:</p><pre>apiVersion: awx.ansible.com/v1beta1<br>kind: AWX<br>metadata:<br>  name: awx-instance<br>  namespace: awx<br>spec:<br>  service_type: NodePort</pre><p>AWX is then accessible via NodePort, and the admin password can be retrieved with:</p><pre>kubectl get secret -n awx awx-instance-admin-password -o jsonpath=&quot;{.data.password}&quot; | base64 --decode</pre><h3>Stage 2: Dynamic Inventory from OpenStack</h3><h4>Problem: DNS Resolution Failure</h4><p>When attempting to sync dynamic inventory from OpenStack, AWX can failed because pods inside Kubernetes couldn’t resolve internal domains. The fix was to update the CoreDNS ConfigMap:</p><pre>kubectl edit configmaps coredns -n kube-system</pre><p>Add the required host entries to the NodeHosts block. After restarting CoreDNS, nslookup from inside the pod succeeded and inventory sync completed successfully.</p><h4>Setting Up OpenStack Dynamic Inventory</h4><p>Create a custom Credential Type for OpenStack with input fields: username, password, project_name, auth_url, and region_name. The Injector Configuration maps these fields to OS_* environment variables.</p><p>For <strong>multi-project</strong> support, add the following to the inventory Source Variables:</p><pre>plugin: openstack.cloud.openstack<br>cloud: myopenstack<br>expand_hostvars: true<br>fail_on_errors: true<br>all_projects: true</pre><h3>Stage 3: Handling Dynamic SSH Users</h3><p>One of the main challenges was that each OpenStack instance could have a different default SSH user (ubuntu, centos, cloud-user) and a different SSH key. Our team developed several approaches:</p><h4>Approach 1: Bash Script to Update ansible_user</h4><p>The update_host.sh script reads an ip_user_map.json file and sends PATCH requests to the AWX API to update each host&#39;s variables accordingly.</p><h4>Approach 2: Auto-Detection via Ansible Facts</h4><p>Create two Job Templates — one for <em>setup</em> (collecting and storing ansible facts into fact storage), and another for running the main tasks using already-known user from stored facts.</p><pre>- name: Set ansible_user dynamically<br>  set_fact:<br>    ansible_user: &gt;-<br>      {{<br>        ansible_env.SUDO_USER |<br>        default(ansible_user_id, true) |<br>        default(&#39;ubuntu&#39;)<br>      }}</pre><h4>Approach 3: Detection via OpenStack Image Name</h4><p>Use image_name metadata from an OpenStack volume to automatically determine the correct SSH user via regex_replace.</p><h3>Stage 4: AWX Execution Nodes (Multi-Project)</h3><h4>Network Isolation Between Projects</h4><p>Each OpenStack project has its own isolated network. AWX control plane cannot always reach instances across all projects. The solution: deploy a <strong>Receptor-based Execution Node</strong> in each project as an execution agent.</p><h4>How It Works</h4><p>AWX control plane → sends instructions via Receptor → Execution Node inside the target project runs playbook → results are returned to AWX.</p><h4>OS Requirements for Execution Nodes</h4><p><strong>Ubuntu 22.04 (Jammy)</strong> or <strong>RHEL 9</strong> is required because:</p><ul><li>Podman is officially available starting from Focal 20.10+</li><li>Python &gt;= 3.9 is already the base version</li><li>The latest Ansible versions support FQCN syntax</li></ul><h4>Installing Receptor</h4><ol><li>Add a new instance in AWX UI and download bundle installer</li><li>Install Ansible on execution node</li><li>Edit inventory.yml inside bundle — update ansible_host and ansible_user</li><li>Run installer playbook:</li><li>Add DNS mapping for the execution node hostname in CoreDNS</li><li>Perform a health check from the AWX dashboard</li><li>Create an Instance Group and assign the execution node to that group</li></ol><h3>Stage 5: Manual Playbooks &amp; Volume Mounting</h3><p>To run playbooks stored manually (not from SCM/Git), AWX needs to be configured so its pods mount to the same directory on host.</p><pre>spec:<br>  web_extra_volume_mounts: |<br>    - name: playbook-volume<br>      mountPath: /var/lib/awx/projects<br>  task_extra_volume_mounts: |<br>    - name: playbook-volume<br>      mountPath: /var/lib/awx/projects<br>  extra_volumes: |<br>    - name: playbook-volume<br>      hostPath:<br>        path: /data/projects<br>        type: Directory</pre><p>All playbooks can then simply be stored in /data/projects on the AWX host VM.</p><h3>Stage 6: Custom Execution Environment (EE)</h3><p>If a playbook requires a collection not available in the default EE image (e.g., openstack.cloud), a custom EE image needs to be built using ansible-builder.</p><pre>pip3 install ansible-builder<br>ansible-builder build -t ee-openstack:latest</pre><p>The image then pushed to a registry and registered in AWX as a new Execution Environment.</p><h3>Stage 7: Air-Gap Install (Offline Method)</h3><p>In production environments isolated from the internet, the installation is done fully offline using:</p><ol><li><strong>Registry Mirror</strong> (Docker Registry v2) to cache images from docker.io and quay.io</li><li><strong>K3s Airgap Install</strong> using a pre-downloaded k3s-airgap-images-amd64.tar.zst file</li><li><strong>registries.yaml configuration</strong> so K3s pulls images from the local mirror</li></ol><p>The end result is a fully functional AWX installation running <strong>without any internet connection</strong>, using the local registry mirror as the image source.</p><h4>Capacity &amp; Forks Calculation</h4><p>AWX calculates execution node capacity based on:</p><ul><li><strong>CPU:</strong> num_cores × 4 forks</li><li><strong>RAM:</strong> ram_size_mb / 100 forks</li></ul><p>The <strong>lower</strong> of two values is used. For an instance with 4 vCPU / 16 GB RAM, AWX can process up to 1,100 events/second and execute up to 137 forks simultaneously — more than enough to manage thousands of hosts at once.</p><h3>Lessons Learned</h3><p>Key takeaways from this research:</p><ul><li><strong>DNS is everything.</strong> The most common issue when deploying AWX on K8s always comes back to CoreDNS configuration.</li><li><strong>SSH users must be dynamic.</strong> In a multi-OS OpenStack environment, a single-credential-for-all approach simply won’t work.</li><li><strong>One Execution Node per project</strong> is the best solution for network isolation across OpenStack projects.</li><li><strong>Air-gap install requires thorough preparation</strong> — all dependencies must be available in local registry before starting the installation.</li><li><strong>AWX Fact Storage is incredibly useful</strong> for caching host information so playbooks don’t need to gather facts on every run.</li></ul><h3>Closing</h3><p>Ansible AWX proves to be a powerful platform for managing infrastructure automation at scale. With its operator-based approach on Kubernetes, AWX provides high flexibility — from single-node deployments to multi-execution-node clustering in fully isolated enterprise environments.</p><p>This research is still ongoing for further exploration of advanced AWX features. Some case in production environment, we used AWX for automatically upgrade service in hundred of nodes with interactive dashboard that decrease implementation time duration become more fast and efficient.</p><h3>Need Help with Your Automation Platform?</h3><p>Implementing and managing an automation platform can be challenging. Starting from create design architecture, method of procedure implementing and managing your automation platform such as AWX for ready to use in production level.</p><p>At <strong>Boer Technology, </strong>we expertise in designing, implementing and managing automation platform system solution including AWX. Our team has experience with managing production level with a hundred of nodes, which will allow your environment to be managed centrally and efficiently.</p><h3>Get in Touch</h3><p>Ready to managing your platform automatically? Let’s discuss how we can help:</p><ul><li>Website: <a href="https://btech.id/">www.btech.id</a></li><li>Email: <a href="mailto:support@btech.id">support@btech.id</a></li></ul><p><strong>References</strong></p><ul><li><a href="https://github.com/ansible/awx">Github: AWX</a></li><li><a href="https://docs.ansible.com/projects/awx-operator/en/latest/">Ansible AWX Operator Documentation</a></li></ul><p><strong>Author</strong></p><p>Managed Services Team — PT. Boer Technology</p><p><strong>Tags:</strong> #AWX #AutomationPlatform #Ansible #OpenSource #DevOps #Kubernetes</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9c81986131c4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[VictoriaLogs Deployment: Single Node vs Cluster Mode — A Comprehensive Guide]]></title>
            <link>https://medium.com/@btech-engineering/victorialogs-deployment-single-node-vs-cluster-mode-a-comprehensive-guide-24284c0d1134?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/24284c0d1134</guid>
            <category><![CDATA[graylog]]></category>
            <category><![CDATA[monitoring]]></category>
            <category><![CDATA[grafana]]></category>
            <category><![CDATA[victorialogs]]></category>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Fri, 13 Feb 2026 08:26:26 GMT</pubDate>
            <atom:updated>2026-02-13T08:26:26.219Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction</h3><p>In the world of log management, finding right balance between performance, scalability, and resource efficiency is crucial. VictoriaLogs, developed by VictoriaMetrics team, offers a compelling solution with its lightweight architecture and powerful querying capabilities. But when should you choose a single-node deployment versus a cluster mode? This guide will walk you through both deployment strategies, helping you make an informed decision for your infrastructure.</p><h3>What is VictoriaLogs?</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*VP2ekZZwY1R79gSP.png" /></figure><p>VictoriaLogs is a high-performance, open-source log management system designed for efficient storage and analysis of large volumes of logs. It stands out with:</p><ul><li><strong>Superior Compression</strong>: Uses significantly less storage space compared to Elasticsearch or Loki</li><li><strong>Low Resource Usage</strong>: Lower CPU/RAM/disk requirements</li><li><strong>Simple Deployment</strong>: Single binary with minimal configuration</li><li><strong>Powerful Query Language</strong>: LogsQL provides intuitive and fast log searching</li><li><strong>Grafana Integration</strong>: Seamless visualization and alerting capabilities</li></ul><h4>VictoriaLogs vs Graylog</h4><p>Before diving into deployment modes, let’s understand how VictoriaLogs compares to traditional solutions:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/920/1*qG1-HKArS4HNwgPgg6tlVA.png" /></figure><h3>Part 1: Single Node Deployment</h3><h4>When to Use Single Node?</h4><p>Single node deployment is ideal for:</p><ul><li>Small to medium-sized infrastructures</li><li>Development and testing environments</li><li>Organizations with limited resources</li><li>Scenarios where vertical scaling is sufficient</li></ul><h4>Architecture Overview</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JQVU54IE_gMZQ5XZrctQtw.png" /></figure><p>Step-by-Step Installation</p><p>1. Install VictoriaLogs</p><p>Download and install VictoriaLogs binary:</p><pre># Download VictoriaLogs<br>wget https://github.com/VictoriaMetrics/VictoriaLogs/releases/download/v1.43.0/victoria-logs-linux-amd64-v1.43.0.tar.gz<br><br># Extract archive<br>tar xvf victoria-logs-linux-amd64-*.tar.gz<br><br># Move binary to system path<br>mv victoria-logs-prod /usr/local/bin/<br><br># Create storage directory<br>mkdir -p /data/victoria-logs</pre><p>2. Create Systemd Service</p><pre>cat &lt;&lt;EOF | tee /etc/systemd/system/victoria-logs.service<br>[Unit]<br>Description=Victoria Logs<br>After=network-online.target<br>Wants=network-online.target systemd-networkd-wait-online.service<br><br>[Service]<br>Restart=on-failure<br>RestartSec=5s<br>PrivateTmp=true<br>PrivateDevices=false<br>ProtectHome=true<br>ProtectSystem=full<br>ExecStart=/usr/local/bin/victoria-logs-prod \\<br>    -retentionPeriod=1w \\<br>    -storageDataPath=/data/victoria-logs<br><br>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Start and enable service:</p><pre>sudo systemctl daemon-reload<br>sudo systemctl enable --now victoria-logs<br>sudo systemctl status victoria-logs</pre><p>3. Configure Filebeat for Log Ingestion</p><p>Install required packages:</p><pre># Install Java<br>sudo apt install -y openjdk-11-jdk apt-transport-https gnupg2<br><br># Add Elasticsearch repository<br>wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -<br>echo &quot;deb https://artifacts.elastic.co/packages/8.x/apt stable main&quot; | sudo tee -a /etc/apt/sources.list.d/elastic-8.x.list<br><br># Install Filebeat<br>sudo apt update &amp;&amp; sudo apt install -y filebeat</pre><p>Configure Filebeat:</p><pre>filebeat.inputs:<br>- type: filestream<br>  id: my-filestream-id<br>  enabled: true<br>  paths:<br>    - /var/log/*.log<br><br>output.elasticsearch:<br>  hosts: [&quot;http://&lt;VICTORIA_HOST_IP&gt;:9428/insert/elasticsearch/&quot;]<br>  parameters:<br>    _msg_field: &quot;message&quot;<br>    _time_field: &quot;@timestamp&quot;<br>    _stream_fields: &quot;host.hostname,log.file.path&quot;<br>  preset: balanced</pre><p>Restart Filebeat service:</p><pre>sudo systemctl restart filebeat<br>journalctl -u filebeat -f  # Verify connection</pre><p>4. Test Setup</p><p>Query logs from CLI:</p><pre>curl http://localhost:9428/select/logsql/query -d &#39;query=*&#39;</pre><h3>Part 2: Cluster Mode Deployment</h3><h4>When to Use Cluster Mode?</h4><p>Cluster mode becomes necessary when:</p><ul><li>Single-node reaches vertical scalability limits</li><li>You need horizontal scaling across multiple machines</li><li>High availability is required</li><li>Log volume exceeds single-node capacity</li></ul><h4>Architecture Overview</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/446/1*6YLUBj4Y-F7P2S-TJFEIDw.png" /></figure><h4>Cluster Components</h4><p>VictoriaLogs Cluster consists of three primary components:</p><ol><li><strong>vlinsert</strong> — Log ingestion frontend (Master Node)</li><li><strong>vlstorage</strong> — Storage nodes (Worker Nodes)</li><li><strong>vlselect</strong> — Query layer (Master Node)</li></ol><p>Step-by-Step Installation</p><p>1. Prepare All Nodes</p><p>On all nodes (master and workers), download and install VictoriaLogs:</p><pre># Download VictoriaLogs binary<br>wget https://github.com/VictoriaMetrics/VictoriaLogs/releases/download/v1.43.0/victoria-logs-linux-amd64-v1.43.0.tar.gz<br><br># Extract and install<br>tar xvf victoria-logs-linux-amd64-*.tar.gz<br>sudo mv victoria-logs-prod /usr/local/bin/<br><br># Create storage directory<br>sudo mkdir -p /data/victoria-logs</pre><p>2. Deploy vlstorage (Worker Nodes)</p><p>On each worker node, create storage service:</p><pre>cat &lt;&lt;&#39;EOF&#39; | sudo tee /etc/systemd/system/victoria-logs-storage.service<br>[Unit]<br>Description=VictoriaLogs Storage (vlstorage)<br>After=network-online.target<br>Wants=network-online.target systemd-networkd-wait-online.service<br><br>[Service]<br>Type=simple<br>Restart=on-failure<br>RestartSec=5s<br>User=root<br>Group=root<br><br># Hardening<br>PrivateTmp=true<br>PrivateDevices=false<br>ProtectHome=true<br>ProtectSystem=full<br>NoNewPrivileges=true<br><br>ExecStart=/usr/local/bin/victoria-logs-prod \<br>  -httpListenAddr=:9430 \<br>  -storageDataPath=/data/victoria-logs \<br>  -retentionPeriod=1w<br><br>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Start service:</p><pre>sudo systemctl daemon-reload<br>sudo systemctl enable --now victoria-logs-storage<br>sudo systemctl status victoria-logs-storage</pre><p>3. Deploy vlinsert (Master Node)</p><p>On master node, create insert service:</p><pre>cat &lt;&lt;&#39;EOF&#39; | sudo tee /etc/systemd/system/victoria-logs-insert.service<br>[Unit]<br>Description=VictoriaLogs Insert (vlinsert)<br>After=network-online.target<br>Wants=network-online.target<br><br>[Service]<br>Type=simple<br>Restart=on-failure<br>RestartSec=5s<br>User=root<br>Group=root<br><br># Hardening<br>PrivateTmp=true<br>PrivateDevices=false<br>ProtectHome=true<br>ProtectSystem=full<br>NoNewPrivileges=true<br><br>ExecStart=/usr/local/bin/victoria-logs-prod \<br>  -httpListenAddr=:9428 \<br>  -storageNode=&lt;WORKER1_IP&gt;:9430 \<br>  -storageNode=&lt;WORKER2_IP&gt;:9430<br><br>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p><strong>Important</strong>: Replace &lt;WORKER1_IP&gt; and &lt;WORKER2_IP&gt; with actual worker IP addresses.</p><p>Start service:</p><pre>sudo systemctl daemon-reload<br>sudo systemctl enable --now victoria-logs-insert<br>sudo systemctl status victoria-logs-insert</pre><p>4. Deploy vlselect (Master Node)</p><p>On master node, create select service:</p><pre>cat &lt;&lt;&#39;EOF&#39; | sudo tee /etc/systemd/system/victoria-logs-select.service<br>[Unit]<br>Description=VictoriaLogs Select (vlselect)<br>After=network-online.target<br>Wants=network-online.target<br><br>[Service]<br>Type=simple<br>Restart=on-failure<br>RestartSec=5s<br>User=root<br>Group=root<br><br># Hardening<br>PrivateTmp=true<br>PrivateDevices=false<br>ProtectHome=true<br>ProtectSystem=full<br>NoNewPrivileges=true<br><br>ExecStart=/usr/local/bin/victoria-logs-prod \<br>  -httpListenAddr=:9429 \<br>  -storageNode=&lt;WORKER1_IP&gt;:9430 \<br>  -storageNode=&lt;WORKER2_IP&gt;:9430<br><br>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Start service:</p><pre>sudo systemctl daemon-reload<br>sudo systemctl enable --now victoria-logs-select<br>sudo systemctl status victoria-logs-select</pre><p>5. Configure Filebeat (Worker Nodes)</p><p>On each worker node, configure Filebeat to send logs to vlinsert:</p><pre>filebeat.inputs:<br>- type: filestream<br>  id: my-filestream-id<br>  enabled: true<br>  paths:<br>    - /var/log/*.log<br><br>output.elasticsearch:<br>  hosts: [&quot;http://&lt;MASTER_NODE_IP&gt;:9428/insert/elasticsearch/&quot;]<br>  parameters:<br>    _msg_field: &quot;message&quot;<br>    _time_field: &quot;@timestamp&quot;<br>    _stream_fields: &quot;host.hostname,log.file.path&quot;<br>  preset: balanced</pre><p>Restart Filebeat:</p><pre>sudo systemctl restart filebeat<br>sudo systemctl enable filebeat<br>journalctl -u filebeat -f</pre><p>6. Verify Cluster Health</p><p>Check component health:</p><pre># Check vlinsert health<br>curl -s http://&lt;MASTER_IP&gt;:9428/health<br><br># Check vlselect health<br>curl -s http://&lt;MASTER_IP&gt;:9429/health<br><br># Test query<br>curl -s http://&lt;MASTER_IP&gt;:9429/select/logsql/query -d &#39;query=*&#39;</pre><p>Check service logs:</p><pre># On master node<br>sudo journalctl -u victoria-logs-insert -f<br>sudo journalctl -u victoria-logs-select -f<br><br># On worker nodes<br>sudo journalctl -u victoria-logs-storage -f</pre><h4>Grafana Integration (Both Modes)</h4><p>Install Grafana (Master Node)</p><pre># Install prerequisites<br>sudo apt-get install -y software-properties-common<br><br># Add Grafana GPG key<br>wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -<br><br># Add Grafana repository<br>echo &quot;deb https://packages.grafana.com/oss/deb stable main&quot; | sudo tee /etc/apt/sources.list.d/grafana.list<br><br># Install Grafana<br>sudo apt-get update<br>sudo apt-get install -y grafana<br><br># Start and enable service<br>sudo systemctl start grafana-server<br>sudo systemctl enable grafana-server<br>sudo systemctl status grafana-server</pre><p>Install VictoriaLogs Plugin</p><pre># For Grafana v10.2.0<br>sudo grafana-cli plugins install victoriametrics-logs-datasource 0.16.3<br><br># For latest Grafana<br>sudo grafana-cli plugins install victoriametrics-logs-datasource<br><br># Restart Grafana<br>sudo systemctl restart grafana-server</pre><h4>Configure Data Source</h4><ol><li>Access Grafana: http://&lt;MASTER_NODE_IP&gt;:3000</li><li>Default credentials: admin/admin</li><li>Go to <strong>Configuration → Data Sources</strong></li><li>Click <strong>Add data source</strong></li><li>Search for <strong>VictoriaLogs</strong></li><li>Configure URL and Click <strong>Save &amp; Test</strong> :</li></ol><blockquote><strong>Single Node</strong>: <em>http://&lt;VICTORIA_HOST_IP&gt;:9428</em></blockquote><blockquote><strong>Cluster Mode</strong>: http://&lt;MASTER_NODE_IP&gt;:9429</blockquote><h4>Create Dashboard</h4><p>Example LogsQL queries:</p><pre># Basic query - all logs from specific host<br>_stream:{host.hostname=&quot;worker01&quot;,log.file.path=&quot;/var/log/syslog&quot;}<br><br># Count logs by time<br>_stream:{host.hostname=&quot;worker01&quot;} | stats by (_time) count()<br><br># Search for specific pattern<br>_stream:{host.hostname=&quot;worker01&quot;} &quot;error&quot; | stats by (_time, log.level) count()<br><br># Advanced filtering<br>&quot;processor error&quot; | stats by (_time, host.name, host.ip, log.file.path) count()</pre><h4>Setup Alerting</h4><p>Create an alert rule in Grafana:</p><ol><li>Navigate to <strong>Alerting → Alert Rules</strong></li><li>Click <strong>New alert rule</strong></li><li>Configure query:</li></ol><pre>&quot;processor error&quot; | stats by (_time, host.name, host.ip, log.file.path) count()</pre><p>Next, Set alert condition:</p><ul><li><strong>Time Range</strong>: now-15m to now</li><li><strong>Threshold</strong>: IS ABOVE 0</li></ul><p>Next, Add custom annotations with Alert-ID, Save and test</p><p>Test alert:</p><pre># Send test log<br>logger -p local0.crit &quot;CRITICAL_TEST: This is a test alert&quot;<br><br># Verify in Grafana</pre><h4>Comparison: Single Node vs Cluster Mode</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/737/1*QZvii2JjmXuIrjc7nfMzlQ.png" /></figure><h4>When to Migrate from Single Node to Cluster</h4><p>Consider migrating when you experience:</p><ol><li><strong>Resource Saturation</strong>: CPU, RAM, or disk consistently at &gt;80% usage</li><li><strong>Slow Queries</strong>: Query response times degrading</li><li><strong>Storage Limits</strong>: Approaching disk capacity limits</li><li><strong>High Ingestion Rates</strong>: Log ingestion causing service degradation</li><li><strong>Availability Requirements</strong>: Need for zero-downtime operations</li><li><strong>Business Growth</strong>: Anticipating significant log volume increase</li></ol><h4>Migration Path: Single Node to Cluster</h4><p>If you need to migrate from single node to cluster mode:</p><p>Step 1: Backup Current Data</p><pre>sudo systemctl stop victoria-logs<br>sudo tar -czf victoria-logs-migration-backup.tar.gz /data/victoria-logs</pre><p>Step 2: Deploy Cluster Infrastructure</p><ul><li>Set up master and worker nodes</li><li>Deploy vlinsert, vlselect, vlstorage as described above</li></ul><p>Step 3: Migrate Data (Optional)</p><pre># Copy data to one of the worker nodes<br>scp victoria-logs-migration-backup.tar.gz worker1:/tmp/<br><br># On worker1<br>sudo systemctl stop victoria-logs-storage<br>sudo tar -xzf /tmp/victoria-logs-migration-backup.tar.gz -C /<br>sudo systemctl start victoria-logs-storage</pre><p>Step 4: Update Filebeat Configuration</p><p>Update all Filebeat instances to point to new cluster:</p><pre>output.elasticsearch:<br>  hosts: [&quot;http://&lt;MASTER_NODE_IP&gt;:9428/insert/elasticsearch/&quot;]</pre><p>Step 5: Verify and Monitor</p><ul><li>Check cluster health</li><li>Verify log ingestion</li><li>Monitor performance metrics</li></ul><h3>Conclusion</h3><p>Both VictoriaLogs deployment modes offer compelling advantages depending on your scale and requirements:</p><h4>Choose Single Node When:</h4><ul><li>Log volume is under 500GB/day</li><li>Budget is limited</li><li>Operational simplicity is priority</li><li>You have development/testing workloads</li><li>Downtime windows are acceptable</li></ul><h4>Choose Cluster Mode When:</h4><ul><li>Log volume exceeds 500GB/day</li><li>High availability is critical</li><li>You need horizontal scalability</li><li>Zero-downtime operations are required</li><li>You have distributed infrastructure</li></ul><h4>Why VictoriaLogs Stands Out:</h4><ol><li><strong>Resource Efficiency</strong>: Uses 50–80% less resources than Elasticsearch</li><li><strong>Storage Compression</strong>: 10x better compression than alternatives</li><li><strong>Simple Setup</strong>: Single binary, minimal dependencies</li><li><strong>Powerful Query Language</strong>: LogsQL is intuitive and fast</li><li><strong>Cost-Effective</strong>: Significantly reduces infrastructure costs</li></ol><h3>Need Help with Your VictoriaLogs Deployment?</h3><p>Implementing and managing a robust log management system can be challenging. Whether you’re setting up your first VictoriaLogs instance or scaling to a cluster mode, having the right expertise makes all the difference.</p><p>At <strong>Boer Technology</strong>, we specialize in deploying and managing high-performance observability solutions including VictoriaLogs. Our team has extensive experience with both single-node and cluster deployments across various scales.</p><h4>Get in Touch</h4><p>Ready to optimize your log management infrastructure? Let’s discuss how we can help:</p><ul><li>Website: <a href="https://btech.id">www.btech.id</a></li><li>Email: <a href="mailto:support@btech.id">support@btech.id</a></li></ul><p>Reference:</p><ul><li><a href="https://docs.victoriametrics.com/VictoriaLogs/">VictoriaLogs Official Documentation</a></li><li><a href="https://github.com/VictoriaMetrics/VictoriaLogs">VictoriaLogs GitHub Repository</a></li><li><a href="https://grafana.com/grafana/plugins/victoriametrics-logs-datasource/">VictoriaLogs Grafana Plugin</a></li><li><a href="https://victoriametrics.com/blog/victorialogs-vs-loki/">VictoriaLogs vs Loki Benchmarking</a></li></ul><p><strong>Author</strong>:<br>Managed Services Team — PT. Boer Technology</p><p><strong>Tags</strong>: #VictoriaLogs #LogManagement #DevOps #Observability #OpenSource #Monitoring #Grafana #Kubernetes #CloudNative #SRE</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=24284c0d1134" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Karma: A Centralized Dashboard for Prometheus Alertmanagers]]></title>
            <link>https://medium.com/@btech-engineering/karma-a-centralized-dashboard-for-prometheus-alertmanagers-39787e0194ca?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/39787e0194ca</guid>
            <category><![CDATA[prometheus]]></category>
            <category><![CDATA[alertmanager]]></category>
            <category><![CDATA[monitoring]]></category>
            <category><![CDATA[karma]]></category>
            <category><![CDATA[alerting]]></category>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Sat, 09 Aug 2025 08:57:37 GMT</pubDate>
            <atom:updated>2025-08-11T04:40:09.596Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>A. Getting to Know Karma<br></strong>Juggling alerts from multiple Prometheus environments can feel like herding cats — noisy, scattered, and hard to control. While Alertmanager gets the job done, its built-in UI isn’t designed for easily managing alerts across clusters. That’s where <strong>Karma</strong> steps in, giving you a single, centralized dashboard to keep your alerting chaos in check.</p><p>Karma is an open-source web UI designed to aggregate and manage alerts from one or more Prometheus Alertmanager instances.<br> In the Prometheus ecosystem, it acts as a centralized control panel — giving operators, SREs, and DevOps teams a unified place to view, filter, group, and silence alerts across different environments, making large-scale alert management more efficient and less error-prone.</p><p>Karma comes with several benefits. Some of them are:</p><ul><li><strong>Centralization: </strong>View alerts from multiple Alertmanager instances in one unified dashboard</li><li><strong>Advanced Filtering: </strong>Quickly narrow down alerts by labels, severity, source, or custom queries</li><li><strong>Easy Silencing: </strong>Create, edit, and remove silences directly from the UI without manual API calls</li></ul><p><strong>B. Installing Karma and Verifying Access<br></strong>This section will demonstrize how to install Karma using binary deployment. We’ll also test first access to Karma after installation. We’ll use this topology:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/543/1*b0918G0YHhmRNmB768b-rg.jpeg" /><figcaption>Karma Demonstration Topology</figcaption></figure><p>Components and pre-configuration on this lab are:</p><ul><li><strong>Bare workload cluster<br></strong>- This cluster simulate machines with bare workloads<br>- Machines in this cluster already configured with Node Exporters to expose bare machine metrics</li><li><strong>Containerized workload cluster</strong><br>- This cluster simulate machines with Docker container workloads<br>- Docker service in this cluster already configured to expose docker metrics</li><li><strong>bare-mon (192.168.1.232/24)<br></strong>- This machine hosts Prometheus and Alertmanager for Bare workload machines<br>- Prometheus is already configured to scrape metrics from Node Exporters in bare workload clusters and has six alert rules configured: <strong>NodeCPUUsageHigh</strong>, <strong>NodeMemoryPressure</strong>, <strong>NodeFileSystemAlmostFull</strong>, <strong>NodeProcsBlocked</strong>, and <strong>DeadMansSwitch</strong> for monitoring the health of the alerting system<br>- Alertmanager is already configured and integrated with Prometheus in the same machine</li><li><strong>ctr-mon (192.168.1.234/24)<br></strong>- This machine hosts Prometheus and Alertmanager for Containerized workload machines<br>- Prometheus is already configured to scrape metrics fromDocker metrics in containerized workload clusters and has six alert rules configured: <strong>DockerContainerRunningLow</strong>, <strong>DockerContainerStartSlow</strong>, <strong>DockerProcessMemoryHigh</strong>, <strong>DockerProcessFdsHigh</strong>, <strong>DockerNetworkBytesLow</strong>, and <strong>DeadMansSwitch</strong> for monitoring the health of the alerting system<br>- Alertmanager is already configured and integrated with Prometheus in the same machine</li><li><strong>karma (192.168.1.235/24)<br></strong>- This machine is using <strong>Ubuntu 22.04 Server</strong> as OS. IPv4, Internet Access, and Hostname are already configured</li></ul><p>With all prerequisites in place, we can now proceed to installing Karma.</p><ul><li>Get the latest Karma binary. At the time of writing this article, the latest version is <strong>v0.121</strong></li></ul><pre>wget https://github.com/prymitive/karma/releases/download/v0.121/karma-linux-amd64.tar.gz<br>tar -zxvf karma-linux-amd64.tar.gz</pre><ul><li>Give executable permission to the binary and move it to a PATH directory (such as /usr/local/bin)</li></ul><pre>sudo chmod +x karma-linux-amd64<br>sudo mv karma-linux-amd64 /usr/local/bin/karma</pre><ul><li>Now, we’ll create a configuration file for Karma. Create a new directory and a new configuration file</li></ul><pre>sudo mkdir /etc/karma<br>sudo nano /etc/karma/karma.yml</pre><p>For configuration content, We’ll configure these things:<br>- For security, enable HTTP basic authentication so Karma is not open without any security measures</p><pre>authentication:<br>  basicAuth:<br>    users:<br>      - username: alert-admin<br>        password: Passw0rd$</pre><p>- Configure Karma to get alerts from Alertmanagers in <strong>bare-mon </strong>and <strong>ctr-mon</strong>. Default URL for alertmanagers is http://&lt;ip_or_fqdn&gt;:9093. We’ll also configure a health check for each Alertmanager using the <strong>DeadMansSwitch</strong> alert, which is always triggered to indicate that the alerting system is healthy.</p><pre>alertmanager:<br>  servers:<br>    - name: bare-alertmanager<br>      uri: http://192.168.1.232:9093<br>      healthcheck:<br>        filters:<br>          bare-mon:<br>            - alertname=DeadMansSwitch<br>            - instance=bare-mon<br><br>    - name: ctr-alertmanager<br>      uri: http://192.168.1.234:9093<br>      healthcheck:<br>        filters:<br>          ctr-mon:<br>            - alertname=DeadMansSwitch<br>            - instance=ctr-mon</pre><p>Final full configuration inside of /etc/karma/karma.yml will be like this:</p><pre>authentication:<br>  basicAuth:<br>    users:<br>      - username: alert-admin<br>        password: Passw0rd$<br>alertmanager:<br>  servers:<br>    - name: bare-alertmanager<br>      uri: http://192.168.1.232:9093<br>      healthcheck:<br>        filters:<br>          bare-mon:<br>            - alertname=DeadMansSwitch<br>            - instance=bare-mon<br><br>    - name: ctr-alertmanager<br>      uri: http://192.168.1.234:9093<br>      healthcheck:<br>        filters:<br>          ctr-mon:<br>            - alertname=DeadMansSwitch<br>            - instance=ctr-mon</pre><ul><li>Save the configuration. Next, we’ll create a systemd service file so the Karma process is managed as a service</li></ul><pre>sudo nano /etc/systemd/system/karma.service</pre><p>Content of the file will be:</p><pre>[Unit]<br>Description = Karma Service<br><br>[Service]<br>ExecStart = /usr/local/bin/karma --config.file /etc/karma/karma.yml<br><br>[Install]<br>WantedBy = multi-user.target</pre><ul><li>Save the file, then enable and start the Karma service. Verify that Karma is running and listening on its default port (8080)</li></ul><pre>sudo systemctl enable --now karma.service<br>sudo systemctl status karma<br>sudo lsof -P -i -n | grep karma</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/956/1*lS7FWvc36j204q_mctnmNw.png" /><figcaption>Verify status of Karma and It’s listening on port 8080</figcaption></figure><ul><li>Test to access Karma by accessing it’s URL: http://&lt;ip_or_fqdn&gt;:8080. We’ll be prompted with HTTP basic auth</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0kP8I5ghSS2SUABzyZEdEw.png" /><figcaption>Karma HTTP Basic Auth</figcaption></figure><p>Enter the credentials defined in the Karma configuration. If the login succeeds, you’ll be redirected to the Karma dashboard</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1cD_EvR0KAG6SrKSlbeTwA.png" /><figcaption>Karma Dashboard Access</figcaption></figure><p><strong>C. Essential Alert Management in Action with Karma</strong></p><ul><li>Similar to Alertmanager, the Karma dashboard displays only alerts that are firing. For example, we’ll trigger the NodeCPUUsageHigh alert by running a stress test on one machine in the bare workload cluster</li></ul><pre>sudo apt install stress-ng -y<br>stress-ng --cpu &quot;$(nproc)&quot; --cpu-method matrixprod --timeout 600s</pre><p>Wait until the alert is firing. It will then appear in the Karma dashboard</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*B7sjeXHeFIkaefcVePPqZQ.png" /><figcaption>Alert status in Prometheus</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iyiI1ZsHf1zZzAHNE7HjAg.png" /><figcaption>Karma Dashboard is showing NodeCPUUsageHigh Alert</figcaption></figure><ul><li>We can also configure silences directly from the Karma dashboard instead of using the Alertmanager dashboard. Click the silence icon at the top right of the Karma dashboard, then choose the Alertmanager instance to configure. From there, add the silence attributes such as label matchers, duration, and other relevant options</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uqFGtjTxX02yHXBv1Gff1w.png" /><figcaption>Configuring silence from Karma Dashboard</figcaption></figure><ul><li>As alerts from multiple Alertmanagers are aggregated in the Karma dashboard, we can apply label filters to display only specific alerts</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MlOJ4DLbZAMPGJqwIcmwKQ.png" /><figcaption>Applying label filter to show only specific alerts in Karma Dashboard</figcaption></figure><ul><li>Some tips for working with multiple Alertmanagers aggregated in the Karma dashboard: <br>- <strong>Use consistent labels: </strong>Ensure every alert includes an identification label with a clear, uniform naming convention (such as instance, cluster, and etc)<br>- <strong>Separate environments clearly: </strong>Consistent environment labeling prevents mixing production and staging alerts in the same view<br>- <strong>Assign meaningful severities: </strong>Use standardized severity label values like info, critical, high, medium, and low to help prioritize alerts<br>- <strong>Keep labels short and descriptive: </strong>Avoid overly long or ambiguous label values for faster filtering and grouping in Karma</li></ul><p><strong>References:</strong><br>- <a href="https://github.com/prymitive/karma">https://github.com/prymitive/karma</a></p><p><strong>Author</strong>:<br>Kevin Timoteus Sirait — PT. Boer Technology | <a href="https://medium.com/@kevintim">Medium</a> | <a href="https://www.linkedin.com/in/kevin-tim/">LinkedIn</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=39787e0194ca" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Build an Interactive OpenStack Compute Node Monitoring System with Prometheus, Grafana, and…]]></title>
            <link>https://medium.com/@btech-engineering/build-an-interactive-openstack-compute-node-monitoring-system-with-prometheus-grafana-and-5fd6e1970e37?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/5fd6e1970e37</guid>
            <category><![CDATA[prometheus]]></category>
            <category><![CDATA[monitoring-system]]></category>
            <category><![CDATA[grafana]]></category>
            <category><![CDATA[openstack]]></category>
            <category><![CDATA[telegram-bot]]></category>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Sat, 19 Oct 2024 03:12:13 GMT</pubDate>
            <atom:updated>2024-10-19T03:12:13.069Z</atom:updated>
            <content:encoded><![CDATA[<h3>Build an Interactive OpenStack Compute Node Monitoring System with Prometheus, Grafana, and Telegram Bot for Real-Time and On-Demand Queries</h3><p>In this article, we’ll explore how to build an interactive OpenStack compute node monitoring system with Prometheus, Grafana, and a Telegram bot for real-time and on-demand resource queries. This guide will take you through setting up Prometheus to collect metrics, Grafana to visualize data, and a Telegram bot to allow for quick, on-demand resource usage queries. By the end, you’ll have a dynamic monitoring solution that delivers real-time insights and instant, customizable updates directly to your Telegram chat, enhancing your ability to manage OpenStack compute nodes efficiently. Whether you’re responsible for cloud infrastructure or looking to optimize your OpenStack environment, this setup will elevate your monitoring capabilities.</p><p>We’ll use this topology:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/627/1*_OVaBcHi_GcNag5CN2f_Bw.jpeg" /><figcaption>Lab Topology</figcaption></figure><p>The components and prerequisites for this lab include:</p><ul><li><strong>OpenStack Cluster (Version 2024.1)<br></strong>An operational OpenStack cluster is required for this lab. In this instance, we already have a cluster configured with one controller node (os-controller01) and two compute nodes (os-compute01 and os-compute02). Additionally, we need to have some instances running within the cluster.</li><li><strong>Prometheus (Version 2.31.2+ds1)<br></strong>We also need prometheus installed in this lab. Prometheus will act as scraping agent for metrics from compute nodes</li><li><strong>Grafana (Version 11.2.2)<br></strong>We need Grafana for this lab for metrics visualization</li><li><strong>Bot (Python3 Version 3.10.12)<br></strong>For integration with Telegram Bot, we need to create one first. You can create your own telegram bot by contacting <a href="https://t.me/BotFather">@BotFather</a>. A server with Python3 environment installed is also required to host the bot</li></ul><p>This lab will use <strong>Ubuntu 22.04 Server</strong> as operating system for all servers and instances inside the OpenStack Cluster.</p><p>Now that all the prerequisites are in place, we can move on to the main configuration.</p><p><strong>A. Install node-exporter and libvirt-exporter to Compute Nodes</strong></p><ul><li>Login as root to make our tasks easier</li></ul><pre>sudo -i</pre><ul><li>Install node-exporter via snap. After that, enable and start the service</li></ul><pre># Install the node-exporter from snap edge channel<br>snap install node-exporter --edge<br><br># Enable and start node-exporter service<br>systemctl enable --now snap.node-exporter.node-exporter<br><br># Verify the node-exporter service status<br>systemctl status snap.node-exporter.node-exporter</pre><ul><li>Install libvirt-exporter via snap. Then, grant access from libvirt-exporter to libvirt interface. After that, enable and start the service</li></ul><pre># Install the libvirt-exporter from snap stable channel<br>snap install prometheus-libvirt-exporter<br><br># Grant access for libvirt-exporter to libvirt interface<br>snap connect prometheus-libvirt-exporter:libvirt<br><br># Enable and start libvirt-exporter service<br>systemctl enable --now snap.prometheus-libvirt-exporter.daemon <br><br># Verify the libvirt-exporter service status<br>systemctl status snap.prometheus-libvirt-exporter.daemon</pre><ul><li>To verify, we can test to scrape the metrics manually using cURL. These operations should return metrics that are collected by the exporters. Note that node-exporter is running on port 9100 while libvirt-exporter is using port 9177</li></ul><pre># Verify that node-exporter is running and collecting metrics<br>curl http://localhost:9100/metrics<br><br># Verify that libvirt-exporter is running and collecting metrics<br>curl http://localhost:9177/metrics</pre><p><strong>B. Configure Prometheus to Scrape Node and Libvirt Metrics in Prometheus Server</strong></p><ul><li>Login as root to make our tasks easier</li></ul><pre>sudo -i</pre><ul><li>After that, edit prometheus.yml file. This file location will vary according to the prometheus installation method. In this lab, the prometheus is installed using apt. The file location will be /etc/prometheus/prometheus.yml</li></ul><pre>nano /etc/prometheus/prometheus.yml</pre><ul><li>In the scrape_configs section, add a job configuration to scrape the node-exporter and libvirt-exporter metrics from each compute node. Define the job_name and scrape targets. For the target, use the &lt;ip_or_hostname&gt;:&lt;exporter-port&gt; format. In this lab, we&#39;ll use the hostnames of the compute nodes. To achieve this, the Prometheus server is already configured to translate the hostname of each compute node to its IP address (using /etc/hosts)</li></ul><pre># Put these in scrape_configs section<br># This configuration will scrape node-exporter from os-compute01 and os-compute02<br>- job_name: &#39;Node Exporter&#39;<br>  static_configs:<br>    - targets: [&#39;os-compute01:9100&#39;, &#39;os-compute02:9100&#39;]<br><br># This configuration will scrape libvirt-exporter from os-compute01 and os-compute02<br>- job_name: &#39;Libvirt Exporter&#39;<br>  static_configs:<br>    - targets: [&#39;os-compute01:9177&#39;, &#39;os-compute02:9177&#39;]</pre><ul><li>Save the configuration, and then restart prometheus service to apply</li></ul><pre>service prometheus restart</pre><ul><li>To verify the configuration, open the prometheus URL in web browser (http://&lt;prometheus_server_ip_or_hostname&gt;:9090). Then go to <strong>status &gt; targets</strong>. Make sure the previously added jobs are up</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*f3LtPEXijy7I676wKWCtCg.png" /><figcaption>Prometheus ‘Node Exporter’ and ‘Libvirt Exporter’ Jobs Status</figcaption></figure><p><strong>C. Grafana Configuration</strong></p><p>In this part, we’ll configure Grafana to Visualize three resource utilization metrics. Those are:<br>- CPU Usage of compute nodes and instances inside of it<br>- Memory Usage of compute nodes and instances inside of it<br>- Bandwidth Utilization of compute nodes and instances inside of it<br>We’ll also install grafana-image-renderer plugin and create a service account with token that we’ll use later.</p><ul><li>Add Prometheus as data source to Grafana. Specify the data source connection string to the prometheus URL.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wW7mreGnEOUuvc3ovdDI4Q.png" /><figcaption>Grafana Data Source Pointing to Prometheus URL</figcaption></figure><ul><li>After that, add a new Grafana Dashboard. Then, add three variables to it. The first variable will be called <strong>host</strong>. This variable will stores hostnames of compute nodes. Variable type will be Query . The Query type is Label values with Label nodename.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/638/1*3xLZI47HZp2cU1W0vvk10A.png" /><figcaption>‘host’ Variable Configuration</figcaption></figure><ul><li>The second variable will be called <strong>domain</strong>. This variable will stores domain names of instances retrieved by Libvirt Exporters. Variable type will be Query . The Query type is Label values with Label domain. Also, add Label filters instance=$host:9177 so it will only query instances in specific compute node. We’ll also enable Multi-Value and Include All Option configuration for this variable</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/537/1*x3F-vnFNgHCCCkosQri0nQ.png" /><figcaption>‘domain’ Variable Query Configuration</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/377/1*B-K_uGxnVZWn5ykzl0P4RQ.png" /><figcaption>‘domain’ Variable Selected Options</figcaption></figure><ul><li>The third variable will be called <strong>netiface</strong>. This variable will stores device names of compute nodes retrieved by Node Exporters. Variable type will be Query . The Query type is Label values with Label device. Also, add Label filters instance=$host:9100 so it will only query devices in specific compute node.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/549/1*VhZYpAiGG9s42LnAcmoaEQ.png" /><figcaption>‘netiface’ Variable Configuration</figcaption></figure><ul><li>After all variables are in place. We can proceed to add visualization panels. The first panel will be visualization of CPU usage of compute node and all instances inside of it (in percent). To retrieve CPU usage of compute node, we can use this PromQL:</li></ul><pre>(1 - avg(irate(node_cpu_seconds_total{mode=&quot;idle&quot;, instance=~&quot;$host:9100&quot;}[1m])) by (instance)) * 100</pre><p>For CPU usage of all instances inside the compute node, use this PromQL:</p><pre>avg by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{instance=~&quot;$host:9177&quot;, domain=~&quot;$domain&quot;}[1m]))*100</pre><p>For the Unit of Measurement, use Percent (0–100)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HaJFzsv8hXuw-lp3Ke-22A.png" /><figcaption>Host and Instances CPU Usage Visualization Panel Configuration</figcaption></figure><ul><li>The second panel will be visualization of Memory usage of compute node and all instances inside of it (in bytes). To retrieve Memory usage of compute node, we can use this PromQL:</li></ul><pre>sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{instance=~&quot;$host:9100&quot;}[1m]) - (avg_over_time(node_memory_MemFree_bytes{instance=~&quot;$host:9100&quot;}[1m]) + avg_over_time(node_memory_Cached_bytes{instance=~&quot;$host:9100&quot;}[1m])))</pre><p>We also need to retrieve the memory total of compute node by using this PromQL:</p><pre>sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{instance=~&quot;$host:9100&quot;}[1m]))</pre><p>Use this PromQL to retrieve memory usages of instances inside of the compute node:</p><pre>sum by (domain) (libvirt_domain_memory_stats_available_bytes{instance=~&quot;$host:9177&quot;, domain=~&quot;$domain&quot;} - libvirt_domain_memory_stats_usable_bytes{instance=~&quot;$host:9177&quot;, domain=~&quot;$domain&quot;})</pre><p>For the Unit of Measurement, use bytes(SI)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mRZkrP6OV1dWir06UbHE7A.png" /><figcaption>Host and Instances Memory Usage Visualization Panel Configuration</figcaption></figure><ul><li>The third panel will be visualization of Bandwidth usage of compute node and all instances inside of it (in bytes/sec). To retrieve Bandwidth usage of compute node, we can use this PromQL:</li></ul><pre>avg by (instance) (rate(node_network_receive_bytes_total{instance=~&quot;$host:9100&quot;,device=~&quot;$netiface&quot;}[1m]) + rate(node_network_transmit_bytes_total{instance=~&quot;$host:9100&quot;,device=~&quot;$netiface&quot;}[1m]))</pre><p>And use this PromQL to retrieve bandwidth usage of all instances inside of the compute node:</p><pre>avg by (domain) (rate(libvirt_domain_interface_stats_receive_bytes_total{instance=~&quot;$host:9177&quot;, domain=~&quot;$domain&quot;}[1m]) + rate(libvirt_domain_interface_stats_transmit_bytes_total{instance=~&quot;$host:9177&quot;, domain=~&quot;$domain&quot;}[1m]))</pre><p>For the Unit of Measurement, use bytes/sec(SI)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FfDkXyW77UlUnQXzrjPPng.png" /><figcaption>Host and Instances Bandwidth Usage Visualization Panel Configuration</figcaption></figure><ul><li>Save everything. Now, We’ll have visualization dashboard with 3 variables and 3 panels to monitor the resource usage of compute nodes and instances inside</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*04lCdZ47ChRsa4rtNNxYeA.png" /><figcaption>Grafana Dashboard to Monitor Resource Usage of Compute Nodes and Instances</figcaption></figure><ul><li>Now, we need to create a service account with token so we can access the visualization panels with our python3 script later. From Grafana, go to <strong>Menu &gt; Administration &gt; Users and access &gt; Service accounts</strong>. After that, Add a new service account with Viewer role.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/735/1*-DzxCTLmuLSawi99x4z-Ug.png" /><figcaption>Creating a new Viewer Service Account in Grafana</figcaption></figure><ul><li>After created, you’ll redirected to the Service Account detail page. In this page, add a new service account token. Generate and save the token for later.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/549/1*kG0loBJSU6ZG3Lz_b9DGLQ.png" /><figcaption>Generated Grafana Service Account Token. Save This Token for Later</figcaption></figure><ul><li>Next, we’ll proceed to install grafana-image-renderer plugin. We need this plugin so we can retrieve rendered image of grafana monitoring panel. We’ll do this step from Grafana Server’s terminal</li></ul><pre># Login as root <br>sudo -i<br><br># Install libgbm1. This package is required by the plugin<br>apt update &amp;&amp; apt install libgbm1 -y<br><br># Install grafana-image-renderer plugin<br>grafana-cli plugins install grafana-image-renderer<br><br># Set appropriate owner for /var/lib/grafana directory<br>chown -R grafana:grafana /var/lib/grafana<br><br># Restart grafana-server service<br>service grafana-server restart</pre><p>To verify, show the grafana-server status</p><pre>service grafana-server status</pre><p>From the output, make sure grafana-image-renderer plugin is loaded in the CGroup section</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9EBg8HVPdI1GhlIz1CLaSg.png" /><figcaption>Grafana Server Status Showing That Plugin is Loaded</figcaption></figure><p><strong>D. Create Python3 Script to Query Resource Utilization and Get the Image of Monitoring Panel</strong></p><ul><li>Install requests Python3 library</li></ul><pre>pip3 install requests</pre><ul><li>Create a new file for the script. For example, here we’ll store the project files in a directory and create new python3 file called lab_utill.py</li></ul><pre>mkdir bot_util<br>mkdir bot_util/images<br>cd bot_util<br>touch lab_util.py</pre><ul><li>Now, we’ll code the script. First, import all needed libraries</li></ul><pre># Import needed libraries<br>from datetime import datetime, timezone<br>import requests<br>import urllib</pre><ul><li>After that, we’ll define three global variables</li></ul><pre># Prometheus API URL<br>PROMETHEUS_API_URL = &quot;http://192.168.1.201:9090/api/v1/query&quot;<br><br># Grafana Render base URL<br>GRAFANA_BASE_URL = &quot;http://192.168.1.203:3000/render/d/ae0kg8ynstzb4f/openstack-resource-monitoring-dashboard&quot;<br><br># Grafana authorization header. Replace &lt;service_account_token&gt; with appropriate Grafana Service Account Token<br>GRAFANA_HEADERS = [(&#39;Authorization&#39;, &#39;Bearer &lt;service_account_token&gt;&#39;)]</pre><p>Here the brief explanation for each variable:<br>- <strong>PROMETHEUS_API_URL</strong>: This variable stores the API endpoint for prometheus query. The format will be http://&lt;ip_or_hostname_of_prometheus&gt;:&lt;prometheus_port&gt;/api/v1/query<br>- <strong>GRAFANA_BASE_URL</strong>: This variable stores the base render URL for Grafana. Assuming the Grafana Dashboard URL is http://&lt;ip_or_hostname_of_grafana&gt;:&lt;grafana_port&gt;/d/&lt;dasboard_UID_string&gt;/&lt;dashboard_name&gt;, add /render before /d/ so it will be http://&lt;ip_or_hostname_of_grafana&gt;:&lt;grafana_port&gt;/render/d/&lt;dasboard_UID_string&gt;/&lt;dashboard_name&gt; <br>- <strong>GRAFANA_HEADERS</strong>: The authorization header so the script can access the Grafana render URL defined above. The format will be [(‘Authorization’, ‘Bearer &lt;service_account_token&gt;’)]</p><ul><li>After that, we’ll define a function that will retrieve all instances from specific compute node, then create a dictionary that will map instances name to domain and return it</li></ul><pre>def instances_name_map(node_name):<br>    # Query will be done to libvirt_exporter metrics result<br>    # We need to append the libvirt_exporter port to the node_name<br>    libvirt_exporter= node_name + &quot;:9177&quot;<br><br>    # Request parameter that will be sent to the Prometheus Query API<br>    request_params = {<br>        &quot;query&quot;: f&#39;sum by (domain, instance_name) (libvirt_domain_info_meta{{instance=~&quot;{libvirt_exporter}&quot;}})&#39;,<br>        &#39;current_time&#39; : datetime.now(timezone.utc).isoformat() + &#39;Z&#39;<br>    }<br>    # Initiate empty dictionary that will store instance name to domain mapping<br>    instances_mapping = {}<br>    <br>    try:<br>        # Send request to Prometheus Query API and get the result JSON<br>        instance_map_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        #Iterate the JSON result to get the instance name to domain mapping<br>        for instance in instance_map_json[&quot;data&quot;][&quot;result&quot;]:<br>            # Check if the instance name is defined in the JSON result<br>            if &#39;instance_name&#39; in instance[&#39;metric&#39;]:<br>                instances_mapping[instance[&#39;metric&#39;][&#39;domain&#39;]] = instance[&#39;metric&#39;][&#39;instance_name&#39;]<br>            # If not, use the domain name as the instance name<br>            else:<br>                instances_mapping[instance[&#39;metric&#39;][&#39;domain&#39;]] = instance[&#39;metric&#39;][&#39;domain&#39;]<br>        # Return the dictionary with the instance name to domain mapping<br>        return instances_mapping<br>    # On exception, print error message and exit<br>    except:<br>        print (&quot;Error in getting instances name map&quot;)<br>        exit(1)</pre><ul><li>Then, define the cpu_util function to calculate the CPU usage for a specified node by constructing request parameters and querying the Prometheus API for both the node&#39;s overall CPU percentage and the top three instances with the highest CPU usage. Format the results into a summary message that includes the node name, overall CPU usage, and the top instances, and return this message from the function.</li></ul><pre>def cpu_util(node_name):<br>    # Define the node exporter and libvirt exporter hostname based on the node name<br>    node_exporter = node_name + &quot;:9100&quot;<br>    libvirt_exporter= node_name + &quot;:9177&quot;<br><br>    # Prepare the request parameters for querying CPU usage from Prometheus<br>    request_params = {<br>        &quot;query&quot;: f&#39;(1 - avg(irate(node_cpu_seconds_total{{mode=&quot;idle&quot;, instance=~&quot;{node_exporter}&quot;}}[1m])) by (instance)) * 100&#39;,<br>        &#39;current_time&#39; : datetime.now(timezone.utc).isoformat() + &#39;Z&#39;  # Get the current time in UTC format<br>    }<br><br>    try:<br>        # Send a GET request to the Prometheus API to retrieve CPU usage data<br>        node_cpu_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        # Extract and round the CPU usage percentage from the response<br>        node_cpu_percent = round(float(node_cpu_json[&quot;data&quot;][&quot;result&quot;][0][&quot;value&quot;][1]), 2)<br>    except:<br>        # Handle any exceptions by setting the CPU percentage to a default value<br>        node_cpu_percent = &quot;[No Value]&quot;<br>    <br>    # Initialize a string to hold the CPU usage information for instances<br>    instances_usage_text = &quot;&quot;<br>    # Get the mapping of instance names for the given node<br>    instances_map = instances_name_map(node_name)<br>    # Update the request parameters to query the top 3 instances by CPU usage<br>    request_params[&#39;query&#39;] = f&#39;topk(3, avg by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{{instance=~&quot;{libvirt_exporter}&quot;}}[1m]))*100)&#39;<br>    <br>    try:<br>        # Send a GET request to the Prometheus API to retrieve CPU usage data for instances<br>        instances_cpu_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        # Check if any results were returned<br>        if len(instances_cpu_json[&#39;data&#39;][&#39;result&#39;]) == 0:<br>            instances_usage_text = &quot;[No Value]&quot;  # No instances found<br>        else:<br>            # Iterate through the results to format the CPU usage information for each instance<br>            for i in range(len(instances_cpu_json[&#39;data&#39;][&#39;result&#39;])):<br>                instance_cpu_percent = round(float(instances_cpu_json[&quot;data&quot;][&quot;result&quot;][i][&quot;value&quot;][1]), 2)<br>                instance_name = instances_map[instances_cpu_json[&quot;data&quot;][&quot;result&quot;][i][&quot;metric&quot;][&quot;domain&quot;]]<br>                instances_usage_text += f&quot;{i+1}. {instance_name}: {instance_cpu_percent}%\n&quot;<br>    except:<br>        # Handle any exceptions by setting the instances usage text to a default value<br>        instances_usage_text = &quot;[No Value]&quot;<br>    <br>    # Prepare the final result message with node name and CPU usage information<br>    result_message = f&quot;&quot;&quot;<br>Node name: {node_name}<br>CPU usage: {node_cpu_percent}% out of 100%\n<br>Top 3 instances with highest CPU usage inside node:<br>{instances_usage_text}<br>    &quot;&quot;&quot;<br><br>    return result_message  # Return the formatted result message</pre><ul><li>Next, define the memory_util function to calculate the memory usage for a specified node by constructing request parameters to query the Prometheus API for both the total and used memory. Format the results into a summary message that includes the node name, memory usage, and the top instances, and return this message from the function.</li></ul><pre>def memory_util(node_name):<br>    # Define the node_exporter and libvirt_exporter variables with the appropriate ports for the specified node<br>    node_exporter = node_name + &quot;:9100&quot;<br>    libvirt_exporter = node_name + &quot;:9177&quot;<br><br>    # Create request parameters to calculate the memory usage by subtracting free, cached, buffers, and reclaimable memory from total memory<br>    request_params = {<br>        &quot;query&quot;: f&#39;sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]) - (avg_over_time(node_memory_MemFree_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]) + avg_over_time(node_memory_Cached_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]) + avg_over_time(node_memory_Buffers_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]) + avg_over_time(node_memory_SReclaimable_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m])))&#39;,<br>        &#39;current_time&#39;: datetime.now(timezone.utc).isoformat() + &#39;Z&#39;<br>    }<br><br>    # Try to fetch the node&#39;s memory usage from the Prometheus API and convert it to gigabytes<br>    try:<br>        node_memory_usage_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        node_memory_usage_gb = round(int(node_memory_usage_json[&quot;data&quot;][&quot;result&quot;][0][&quot;value&quot;][1]) / 1024 / 1024 / 1024, 2)<br>    except:<br>        node_memory_usage_gb = &quot;[No Value]&quot;<br><br>    # Update request parameters to fetch the total memory available for the node<br>    request_params[&#39;query&#39;] = f&#39;sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]))&#39;<br>    # Try to fetch the total memory from the Prometheus API and convert it to gigabytes<br>    try:<br>        node_memory_total_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        node_memory_total_gb = round(int(node_memory_total_json[&quot;data&quot;][&quot;result&quot;][0][&quot;value&quot;][1]) / 1024 / 1024 / 1024, 2)<br>    except:  <br>        node_memory_total_gb = &quot;[No Value]&quot;<br>    <br>    # Calculate the percentage of memory usage based on the usage and total memory, handling potential division errors<br>    try:<br>        node_usage_percent = round(node_memory_usage_gb / node_memory_total_gb * 100, 2)<br>    except:<br>        node_usage_percent = &quot;[No Value]&quot;<br><br>    # Initialize an empty string to hold the memory usage details for individual instances<br>    instances_usage_text = &quot;&quot;<br>    # Get the mapping of instances for the specified node<br>    instances_map = instances_name_map(node_name)<br>    # Update request parameters to fetch the top three instances with the highest memory usage<br>    request_params[&#39;query&#39;] = f&#39;topk(3, (libvirt_domain_memory_stats_available_bytes{{instance=~&quot;{libvirt_exporter}&quot;}} - libvirt_domain_memory_stats_usable_bytes{{instance=~&quot;{libvirt_exporter}&quot;}}))&#39;<br>    # Try to fetch the memory usage for the top instances from the Prometheus API<br>    try:<br>        instances_memory_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        # Check if any instances were returned and format their memory usage into the instances_usage_text<br>        if len(instances_memory_json[&#39;data&#39;][&#39;result&#39;]) == 0:<br>            instances_usage_text = &quot;[No Value]&quot;<br>        else:<br>            for i in range(len(instances_memory_json[&#39;data&#39;][&#39;result&#39;])):<br>                instance_memory_gb = round(int(instances_memory_json[&quot;data&quot;][&quot;result&quot;][i][&quot;value&quot;][1]) / 1024 / 1024 / 1024, 2)<br>                instance_name = instances_map[instances_memory_json[&quot;data&quot;][&quot;result&quot;][i][&quot;metric&quot;][&quot;domain&quot;]]<br>                instances_usage_text += f&quot;{i + 1}. {instance_name}: {instance_memory_gb} GB\n&quot;<br>    except Exception as e:<br>        instances_usage_text = &quot;[No Value]&quot;<br><br>    # Construct a result message summarizing the node&#39;s memory usage and the top instances<br>    result_message = f&quot;&quot;&quot;<br>Node name: {node_name}<br>Memory usage: {node_memory_usage_gb} GB out of {node_memory_total_gb} GB ({node_usage_percent}% usage)\n<br>Top 3 instances with highest memory usage inside node:<br>{instances_usage_text}<br>&quot;&quot;&quot;<br>    <br>    # Return the formatted result message<br>    return result_message</pre><ul><li>Now, define the bandwidth_util function to calculate the bandwidth usage for a specified node and network device by constructing request parameters to query the Prometheus API for both received and transmitted bytes. Format the results into a summary message that includes the node name, bandwidth usage, and the top instances, and return this message from the function.</li></ul><pre>def bandwidth_util(node_name, net_dev):<br>    # Define the node_exporter and libvirt_exporter variables with the appropriate ports for the specified node<br>    node_exporter = node_name + &quot;:9100&quot;<br>    libvirt_exporter = node_name + &quot;:9177&quot;<br><br>    # Create request parameters to calculate the average bandwidth by summing the received and transmitted bytes for the specified network device<br>    request_params = {<br>        &quot;query&quot;: f&#39;avg by (instance) (rate(node_network_receive_bytes_total{{instance=~&quot;{node_exporter}&quot;,device=~&quot;{net_dev}&quot;}}[1m]) + rate(node_network_transmit_bytes_total{{instance=~&quot;{node_exporter}&quot;,device=~&quot;{net_dev}&quot;}}[1m]))&#39;,<br>        &#39;current_time&#39;: datetime.now(timezone.utc).isoformat() + &#39;Z&#39;<br>    }<br><br>    # Try to fetch the node&#39;s bandwidth usage from the Prometheus API and convert it to kilobytes<br>    try:<br>        node_banwidth_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        node_bandwidth_kb = round(float(node_banwidth_json[&quot;data&quot;][&quot;result&quot;][0][&quot;value&quot;][1]) / 1024, 3)<br>    except:<br>        node_bandwidth_kb = &quot;[No Value]&quot;<br>    <br>    # Initialize an empty string to hold the bandwidth usage details for individual instances<br>    instances_usage_text = &quot;&quot;<br>    # Get the mapping of instances for the specified node<br>    instances_map = instances_name_map(node_name)<br>    # Update request parameters to fetch the top three instances with the highest bandwidth usage<br>    request_params[&#39;query&#39;] = f&#39;topk(3, avg by (domain) (rate(libvirt_domain_interface_stats_receive_bytes_total{{instance=~&quot;{libvirt_exporter}&quot;}}[1m]) + rate(libvirt_domain_interface_stats_transmit_bytes_total{{instance=~&quot;{libvirt_exporter}&quot;}}[1m])))&#39;<br>    # Try to fetch the bandwidth usage for the top instances from the Prometheus API<br>    try:<br>        instances_bandwidth_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        # Check if any instances were returned and format their bandwidth usage into the instances_usage_text<br>        if len(instances_bandwidth_json[&#39;data&#39;][&#39;result&#39;]) == 0:<br>            instances_usage_text = &quot;[No Value]&quot;<br>        else:<br>            for i in range(len(instances_bandwidth_json[&#39;data&#39;][&#39;result&#39;])):<br>                instance_bandwidth_kb = round(float(instances_bandwidth_json[&quot;data&quot;][&quot;result&quot;][i][&quot;value&quot;][1]) / 1024, 3)<br>                instance_name = instances_map[instances_bandwidth_json[&quot;data&quot;][&quot;result&quot;][i][&quot;metric&quot;][&quot;domain&quot;]]<br>                instances_usage_text += f&quot;{i + 1}. {instance_name}: {instance_bandwidth_kb} KB/s\n&quot;<br>    except Exception as e:<br>        instances_usage_text = &quot;[No Value]&quot;<br><br>    # Construct a result message summarizing the node&#39;s bandwidth usage and the top instances<br>    result_message = f&quot;&quot;&quot;<br>Node name: {node_name}<br>Bandwidth of {net_dev}: {node_bandwidth_kb} KB/s\n<br>Top 3 instances with highest bandwidth inside node:<br>{instances_usage_text}<br>&quot;&quot;&quot;<br>    <br>    # Return the formatted result message<br>    return result_message</pre><ul><li>Last step in this sub-part, implement the get_grafana_dashboard function to retrieve and save Grafana dashboard images based on specified resource types and node names. Ensure to include error handling for the image retrieval process and construct the necessary query parameters for the Grafana API request. This function will return the path to the saved image or exit the program if an error occurs.<br>Note that to retrieve the panel number/ID, view the panel and check the Grafana Panel URL, then use the viewPanel parameter value found in the URL. In this lab, the panels for CPU, memory, and bandwidth utilization have Panel IDs 1, 2, and 3, respectively</li></ul><pre>def get_grafana_dashboard(resource_type, node_name, net_dev=None):<br>    # Map resource types to their corresponding panel numbers in Grafana<br>    panel_map = {<br>        &quot;cpu&quot;: 1,<br>        &quot;memory&quot;: 2,<br>        &quot;bandwidth&quot;: 3<br>    }<br>    <br>    # Construct the query parameters for the Grafana API request<br>    grafana_params = f&#39;orgId=1&amp;from=now-1m&amp;var-host={node_name}&amp;var-domain=All&amp;var-instances=All&amp;viewPanel={panel_map[resource_type]}&amp;width=1366&amp;height=1024&amp;autofitpanels&#39;<br>    <br>    # If a specific network device is provided, include it in the parameters<br>    if net_dev != None:<br>        grafana_params += f&#39;&amp;var-netiface={net_dev}&#39;<br>    <br>    # Create the full URL for the Grafana image request<br>    image_url = f&#39;{GRAFANA_BASE_URL}?{grafana_params}&#39;<br>    <br>    # Define the name and directory for saving the image<br>    image_name = f&#39;{node_name}.png&#39;<br>    image_directory = &quot;images/&quot;<br>    <br>    try:<br>        # Set up a URL opener with custom headers for the Grafana request<br>        opener = urllib.request.build_opener()<br>        opener.addheaders = GRAFANA_HEADERS<br>        urllib.request.install_opener(opener)<br>        <br>        # Retrieve the image from the constructed URL and save it to the specified directory<br>        urllib.request.urlretrieve(image_url, f&#39;{image_directory}/{image_name}&#39;)<br>        <br>        # Return the path to the saved image<br>        return f&#39;{image_directory}/{image_name}&#39;<br>    <br>    except Exception as e:<br>        # Print an error message if the image retrieval fails and exit the program<br>        print(f&quot;Error in retrieving image: {e}&quot;)<br>        exit(1)</pre><ul><li>Save the python3 script file and proceed to the next part</li></ul><p><strong>E. Create Python3 Script to Serve the Telegram Bot</strong></p><ul><li>Install pyTelegramBotAPI library</li></ul><pre>pip3 install pyTelegramBotAPI</pre><ul><li>Create a new text file that will store help message on how to use the bot</li></ul><pre>nano bot_util/help.txt</pre><p>For the content, create a helpful explanation text on how to use the bot. Example:</p><pre>Usage:  /node_util [host] [resource] &lt;network_ifname&gt;<br><br>Parameters:<br>  host - The node/host name. Make sure It&#39;s exists in Prometheus.<br>  resource - Resource type. Either cpu/memory/bandwidth.<br>  network_ifname - Network interface name to check. Can only be used with &#39;bandwidth&#39; util type.</pre><p>After that, save the file and proceed to next step</p><ul><li>Create the Python3 script file</li></ul><pre>cd bot_util<br>touch bot.py</pre><ul><li>Now we’ll proceed to the coding part. First, import needed script and library</li></ul><pre># Import needed script and library<br>import lab_util<br>import telebot</pre><ul><li>Next, define BOT_API_TOKEN global variable that will store the Telegram bot token, and create a new Telegram bot instance using that token</li></ul><pre># Define the API token for the bot, which is required to authenticate with the Telegram Bot API<br># Replace &lt;Telegram Bot Token&gt; with appropriate Telegram Bot Token<br>BOT_API_TOKEN = &quot;&lt;Telegram Bot Token&gt;&quot;<br><br># Create an instance of the TeleBot class using the provided API token<br># This instance will be used to interact with the Telegram Bot API and handle messages<br>bot = telebot.TeleBot(BOT_API_TOKEN)</pre><ul><li>After that, create a function that will return text content of help.txt file</li></ul><pre>def help_message():<br>    # This function attempts to read the content of the &#39;help.txt&#39; file<br>    try:<br>        # Open the &#39;help.txt&#39; file in read mode<br>        with open(&#39;help.txt&#39;, &#39;r&#39;) as file:<br>            # Read the entire content of the file<br>            content = file.read()<br>            # Return the content read from the file<br>            return content<br>    except:<br>        # If an error occurs (e.g., file not found), return a default error message<br>        return &quot;Can&#39;t retrieve help message&quot;</pre><ul><li>Create message handler that will send help message when user send /start and /help command to the bot</li></ul><pre># Send the help message when user send &#39;/start&#39; command<br>@bot.message_handler(commands=[&#39;start&#39;])<br>def send_welcome(message):<br>    bot.reply_to(message, help_message())<br><br># Send the help message when user send &#39;/help&#39; command<br>@bot.message_handler(commands=[&#39;help&#39;])<br>def send_help(message):<br>    bot.reply_to(message, help_message())</pre><ul><li>Next, implement message handler when user send the /node_util command followed by needed parameters. Thehandle_node_util function will process user commands for retrieving node resource utilization metrics. This function should parse the incoming message to identify the requested resource type and node name, then call the appropriate utility functions to gather the data. Finally, ensure that the bot responds with both textual information and relevant images, while also handling any errors or invalid commands gracefully.</li></ul><pre>@bot.message_handler(commands=[&#39;node_util&#39;])<br>def handle_node_util(message):<br>    # Split the incoming message text into command parameters<br>    command_params = message.text.split()<br><br>    # Check if there are exactly 3 parameters (node_name and resource_type)<br>    if len(command_params) == 3:<br>        node_name = command_params[1]  # Extract the node name from the parameters<br>        resource_type = command_params[2]  # Extract the resource type from the parameters<br>        <br>        # If the resource type is &#39;cpu&#39;, retrieve CPU utilization and corresponding image<br>        if resource_type == &#39;cpu&#39;:<br>            cpu_util_text = lab_util.cpu_util(node_name)  # Get CPU utilization text<br>            cpu_util_image = lab_util.get_grafana_dashboard(resource_type, node_name)  # Get the Grafana dashboard image for CPU<br>            bot.reply_to(message, cpu_util_text)  # Send the CPU utilization text back to the user<br>            with open(cpu_util_image, &#39;rb&#39;) as image:  # Open the image file<br>                bot.send_photo(message.chat.id, image)  # Send the CPU utilization image to the user<br>        <br>        # If the resource type is &#39;memory&#39;, retrieve memory utilization and corresponding image<br>        elif resource_type == &#39;memory&#39;:<br>            memory_util_test = lab_util.memory_util(node_name)  # Get memory utilization text<br>            memory_util_image = lab_util.get_grafana_dashboard(resource_type, node_name)  # Get the Grafana dashboard image for memory<br>            bot.reply_to(message, memory_util_test)  # Send the memory utilization text back to the user<br>            with open(memory_util_image, &#39;rb&#39;) as image:  # Open the image file<br>                bot.send_photo(message.chat.id, image)  # Send the memory utilization image to the user<br>        <br>        # If the resource type is not recognized, send the help message<br>        else:<br>            bot.reply_to(message, help_message())<br>    <br>    # Check if there are exactly 4 parameters (node_name, resource_type, and net_dev)<br>    elif len(command_params) == 4:<br>        node_name = command_params[1]  # Extract the node name from the parameters<br>        resource_type = command_params[2]  # Extract the resource type from the parameters<br>        net_dev = command_params[3]  # Extract the network device from the parameters<br>        <br>        # If the resource type is &#39;bandwidth&#39;, retrieve bandwidth utilization and corresponding image<br>        if resource_type == &#39;bandwidth&#39;:<br>            bandwidth_util_text = lab_util.bandwidth_util(node_name, net_dev)  # Get bandwidth utilization text<br>            bandwidth_util_image = lab_util.get_grafana_dashboard(resource_type, node_name, net_dev=net_dev)  # Get the Grafana dashboard image for bandwidth<br>            bot.reply_to(message, bandwidth_util_text)  # Send the bandwidth utilization text back to the user<br>            with open(bandwidth_util_image, &#39;rb&#39;) as image:  # Open the image file<br>                bot.send_photo(message.chat.id, image)  # Send the bandwidth utilization image to the user<br>        <br>        # If the resource type is not recognized, send the help message<br>        else:<br>            bot.reply_to(message, help_message())<br><br>    # If the number of parameters is not 3 or 4, send the help message<br>    else:<br>        bot.reply_to(message, help_message())</pre><ul><li>Finally, start the bot polling mechanism to listen for user’s message</li></ul><pre>bot.polling()</pre><p><strong>F. Wrap-Up and Testing</strong></p><ul><li>Node Exporter and Libvirt Exporter have been installed to the compute nodes and exposing related metrics</li><li>Prometheus has been configured to scrape metrics data from exporters installed in compute nodes</li><li>Grafana has been configured to visualize resource utilization for compute nodes and instances inside them. Plugin to render Grafana visualization panel into image also has been installed and required Service Account Token has been created so the Python3 script can retrieve the Grafana rendered panel image</li><li>Python3 script to retrieve resource utilization and send the result to human-readable summary text and related Grafana panel image:</li></ul><pre># Import needed libraries<br>from datetime import datetime, timezone<br>import requests<br>import urllib<br><br># Prometheus API URL<br>PROMETHEUS_API_URL = &quot;http://192.168.1.201:9090/api/v1/query&quot;<br><br># Grafana Render base URL<br>GRAFANA_BASE_URL = &quot;http://192.168.1.203:3000/render/d/ae0kg8ynstzb4f/openstack-resource-monitoring-dashboard&quot;<br><br># Grafana authorization header. Replace &lt;service_account_token&gt; with appropriate Grafana Service Account Token<br>GRAFANA_HEADERS = [(&#39;Authorization&#39;, &#39;Bearer &lt;service_account_token&gt;&#39;)]<br><br>def instances_name_map(node_name):<br>    # Query will be done to libvirt_exporter metrics result<br>    # We need to append the libvirt_exporter port to the node_name<br>    libvirt_exporter= node_name + &quot;:9177&quot;<br><br>    # Request parameter that will be sent to the Prometheus Query API<br>    request_params = {<br>        &quot;query&quot;: f&#39;sum by (domain, instance_name) (libvirt_domain_info_meta{{instance=~&quot;{libvirt_exporter}&quot;}})&#39;,<br>        &#39;current_time&#39; : datetime.now(timezone.utc).isoformat() + &#39;Z&#39;<br>    }<br><br>    # Initiate empty dictionary that will store instance name to domain mapping<br>    instances_mapping = {}<br><br>    <br>    try:<br>        # Send request to Prometheus Query API and get the result JSON<br>        instance_map_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br><br>        #Iterate the JSON result to get the instance name to domain mapping<br>        for instance in instance_map_json[&quot;data&quot;][&quot;result&quot;]:<br>            # Check if the instance name is defined in the JSON result<br>            if &#39;instance_name&#39; in instance[&#39;metric&#39;]:<br>                instances_mapping[instance[&#39;metric&#39;][&#39;domain&#39;]] = instance[&#39;metric&#39;][&#39;instance_name&#39;]<br><br>            # If not, use the domain name as the instance name<br>            else:<br>                instances_mapping[instance[&#39;metric&#39;][&#39;domain&#39;]] = instance[&#39;metric&#39;][&#39;domain&#39;]<br><br>        # Return the dictionary with the instance name to domain mapping<br>        return instances_mapping<br><br>    # On exception, print error message and exit<br>    except:<br>        print (&quot;Error in getting instances name map&quot;)<br>        exit(1)<br><br>def cpu_util(node_name):<br>    # Define the node exporter and libvirt exporter hostname based on the node name<br>    node_exporter = node_name + &quot;:9100&quot;<br>    libvirt_exporter= node_name + &quot;:9177&quot;<br><br>    # Prepare the request parameters for querying CPU usage from Prometheus<br>    request_params = {<br>        &quot;query&quot;: f&#39;(1 - avg(irate(node_cpu_seconds_total{{mode=&quot;idle&quot;, instance=~&quot;{node_exporter}&quot;}}[1m])) by (instance)) * 100&#39;,<br>        &#39;current_time&#39; : datetime.now(timezone.utc).isoformat() + &#39;Z&#39;  # Get the current time in UTC format<br>    }<br><br>    try:<br>        # Send a GET request to the Prometheus API to retrieve CPU usage data<br>        node_cpu_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        # Extract and round the CPU usage percentage from the response<br>        node_cpu_percent = round(float(node_cpu_json[&quot;data&quot;][&quot;result&quot;][0][&quot;value&quot;][1]), 2)<br>    except:<br>        # Handle any exceptions by setting the CPU percentage to a default value<br>        node_cpu_percent = &quot;[No Value]&quot;<br>    <br>    # Initialize a string to hold the CPU usage information for instances<br>    instances_usage_text = &quot;&quot;<br>    # Get the mapping of instance names for the given node<br>    instances_map = instances_name_map(node_name)<br>    # Update the request parameters to query the top 3 instances by CPU usage<br>    request_params[&#39;query&#39;] = f&#39;topk(3, avg by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{{instance=~&quot;{libvirt_exporter}&quot;}}[1m]))*100)&#39;<br>    <br>    try:<br>        # Send a GET request to the Prometheus API to retrieve CPU usage data for instances<br>        instances_cpu_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        # Check if any results were returned<br>        if len(instances_cpu_json[&#39;data&#39;][&#39;result&#39;]) == 0:<br>            instances_usage_text = &quot;[No Value]&quot;  # No instances found<br>        else:<br>            # Iterate through the results to format the CPU usage information for each instance<br>            for i in range(len(instances_cpu_json[&#39;data&#39;][&#39;result&#39;])):<br>                instance_cpu_percent = round(float(instances_cpu_json[&quot;data&quot;][&quot;result&quot;][i][&quot;value&quot;][1]), 2)<br>                instance_name = instances_map[instances_cpu_json[&quot;data&quot;][&quot;result&quot;][i][&quot;metric&quot;][&quot;domain&quot;]]<br>                instances_usage_text += f&quot;{i+1}. {instance_name}: {instance_cpu_percent}%\n&quot;<br>    except:<br>        # Handle any exceptions by setting the instances usage text to a default value<br>        instances_usage_text = &quot;[No Value]&quot;<br>    <br>    # Prepare the final result message with node name and CPU usage information<br>    result_message = f&quot;&quot;&quot;<br>Node name: {node_name}<br>CPU usage: {node_cpu_percent}% out of 100%\n<br>Top 3 instances with highest CPU usage inside node:<br>{instances_usage_text}<br>    &quot;&quot;&quot;<br><br>    return result_message  # Return the formatted result message<br><br>def memory_util(node_name):<br>    # Define the node_exporter and libvirt_exporter variables with the appropriate ports for the specified node<br>    node_exporter = node_name + &quot;:9100&quot;<br>    libvirt_exporter = node_name + &quot;:9177&quot;<br><br>    # Create request parameters to calculate the memory usage by subtracting free, cached, buffers, and reclaimable memory from total memory<br>    request_params = {<br>        &quot;query&quot;: f&#39;sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]) - (avg_over_time(node_memory_MemFree_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]) + avg_over_time(node_memory_Cached_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]) + avg_over_time(node_memory_Buffers_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]) + avg_over_time(node_memory_SReclaimable_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m])))&#39;,<br>        &#39;current_time&#39;: datetime.now(timezone.utc).isoformat() + &#39;Z&#39;<br>    }<br><br>    # Try to fetch the node&#39;s memory usage from the Prometheus API and convert it to gigabytes<br>    try:<br>        node_memory_usage_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        node_memory_usage_gb = round(int(node_memory_usage_json[&quot;data&quot;][&quot;result&quot;][0][&quot;value&quot;][1]) / 1024 / 1024 / 1024, 2)<br>    except:<br>        node_memory_usage_gb = &quot;[No Value]&quot;<br><br>    # Update request parameters to fetch the total memory available for the node<br>    request_params[&#39;query&#39;] = f&#39;sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{{instance=~&quot;{node_exporter}&quot;}}[1m]))&#39;<br>    # Try to fetch the total memory from the Prometheus API and convert it to gigabytes<br>    try:<br>        node_memory_total_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        node_memory_total_gb = round(int(node_memory_total_json[&quot;data&quot;][&quot;result&quot;][0][&quot;value&quot;][1]) / 1024 / 1024 / 1024, 2)<br>    except:  <br>        node_memory_total_gb = &quot;[No Value]&quot;<br>    <br>    # Calculate the percentage of memory usage based on the usage and total memory, handling potential division errors<br>    try:<br>        node_usage_percent = round(node_memory_usage_gb / node_memory_total_gb * 100, 2)<br>    except:<br>        node_usage_percent = &quot;[No Value]&quot;<br><br>    # Initialize an empty string to hold the memory usage details for individual instances<br>    instances_usage_text = &quot;&quot;<br>    # Get the mapping of instances for the specified node<br>    instances_map = instances_name_map(node_name)<br>    # Update request parameters to fetch the top three instances with the highest memory usage<br>    request_params[&#39;query&#39;] = f&#39;topk(3, (libvirt_domain_memory_stats_available_bytes{{instance=~&quot;{libvirt_exporter}&quot;}} - libvirt_domain_memory_stats_usable_bytes{{instance=~&quot;{libvirt_exporter}&quot;}}))&#39;<br>    # Try to fetch the memory usage for the top instances from the Prometheus API<br>    try:<br>        instances_memory_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        # Check if any instances were returned and format their memory usage into the instances_usage_text<br>        if len(instances_memory_json[&#39;data&#39;][&#39;result&#39;]) == 0:<br>            instances_usage_text = &quot;[No Value]&quot;<br>        else:<br>            for i in range(len(instances_memory_json[&#39;data&#39;][&#39;result&#39;])):<br>                instance_memory_gb = round(int(instances_memory_json[&quot;data&quot;][&quot;result&quot;][i][&quot;value&quot;][1]) / 1024 / 1024 / 1024, 2)<br>                instance_name = instances_map[instances_memory_json[&quot;data&quot;][&quot;result&quot;][i][&quot;metric&quot;][&quot;domain&quot;]]<br>                instances_usage_text += f&quot;{i + 1}. {instance_name}: {instance_memory_gb} GB\n&quot;<br>    except Exception as e:<br>        instances_usage_text = &quot;[No Value]&quot;<br><br>    # Construct a result message summarizing the node&#39;s memory usage and the top instances<br>    result_message = f&quot;&quot;&quot;<br>Node name: {node_name}<br>Memory usage: {node_memory_usage_gb} GB out of {node_memory_total_gb} GB ({node_usage_percent}% usage)\n<br>Top 3 instances with highest memory usage inside node:<br>{instances_usage_text}<br>&quot;&quot;&quot;<br>    <br>    # Return the formatted result message<br>    return result_message<br><br>def bandwidth_util(node_name, net_dev):<br>    # Define the node_exporter and libvirt_exporter variables with the appropriate ports for the specified node<br>    node_exporter = node_name + &quot;:9100&quot;<br>    libvirt_exporter = node_name + &quot;:9177&quot;<br><br>    # Create request parameters to calculate the average bandwidth by summing the received and transmitted bytes for the specified network device<br>    request_params = {<br>        &quot;query&quot;: f&#39;avg by (instance) (rate(node_network_receive_bytes_total{{instance=~&quot;{node_exporter}&quot;,device=~&quot;{net_dev}&quot;}}[1m]) + rate(node_network_transmit_bytes_total{{instance=~&quot;{node_exporter}&quot;,device=~&quot;{net_dev}&quot;}}[1m]))&#39;,<br>        &#39;current_time&#39;: datetime.now(timezone.utc).isoformat() + &#39;Z&#39;<br>    }<br><br>    # Try to fetch the node&#39;s bandwidth usage from the Prometheus API and convert it to kilobytes<br>    try:<br>        node_banwidth_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        node_bandwidth_kb = round(float(node_banwidth_json[&quot;data&quot;][&quot;result&quot;][0][&quot;value&quot;][1]) / 1024, 3)<br>    except:<br>        node_bandwidth_kb = &quot;[No Value]&quot;<br>    <br>    # Initialize an empty string to hold the bandwidth usage details for individual instances<br>    instances_usage_text = &quot;&quot;<br>    # Get the mapping of instances for the specified node<br>    instances_map = instances_name_map(node_name)<br>    # Update request parameters to fetch the top three instances with the highest bandwidth usage<br>    request_params[&#39;query&#39;] = f&#39;topk(3, avg by (domain) (rate(libvirt_domain_interface_stats_receive_bytes_total{{instance=~&quot;{libvirt_exporter}&quot;}}[1m]) + rate(libvirt_domain_interface_stats_transmit_bytes_total{{instance=~&quot;{libvirt_exporter}&quot;}}[1m])))&#39;<br>    # Try to fetch the bandwidth usage for the top instances from the Prometheus API<br>    try:<br>        instances_bandwidth_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()<br>        # Check if any instances were returned and format their bandwidth usage into the instances_usage_text<br>        if len(instances_bandwidth_json[&#39;data&#39;][&#39;result&#39;]) == 0:<br>            instances_usage_text = &quot;[No Value]&quot;<br>        else:<br>            for i in range(len(instances_bandwidth_json[&#39;data&#39;][&#39;result&#39;])):<br>                instance_bandwidth_kb = round(float(instances_bandwidth_json[&quot;data&quot;][&quot;result&quot;][i][&quot;value&quot;][1]) / 1024, 3)<br>                instance_name = instances_map[instances_bandwidth_json[&quot;data&quot;][&quot;result&quot;][i][&quot;metric&quot;][&quot;domain&quot;]]<br>                instances_usage_text += f&quot;{i + 1}. {instance_name}: {instance_bandwidth_kb} KB/s\n&quot;<br>    except Exception as e:<br>        instances_usage_text = &quot;[No Value]&quot;<br><br>    # Construct a result message summarizing the node&#39;s bandwidth usage and the top instances<br>    result_message = f&quot;&quot;&quot;<br>Node name: {node_name}<br>Bandwidth of {net_dev}: {node_bandwidth_kb} KB/s\n<br>Top 3 instances with highest bandwidth inside node:<br>{instances_usage_text}<br>&quot;&quot;&quot;<br>    <br>    # Return the formatted result message<br>    return result_message<br><br>def get_grafana_dashboard(resource_type, node_name, net_dev=None):<br>    # Map resource types to their corresponding panel numbers in Grafana<br>    panel_map = {<br>        &quot;cpu&quot;: 1,<br>        &quot;memory&quot;: 2,<br>        &quot;bandwidth&quot;: 3<br>    }<br>    <br>    # Construct the query parameters for the Grafana API request<br>    grafana_params = f&#39;orgId=1&amp;from=now-1m&amp;var-host={node_name}&amp;var-domain=All&amp;var-instances=All&amp;viewPanel={panel_map[resource_type]}&amp;width=1366&amp;height=1024&amp;autofitpanels&#39;<br>    <br>    # If a specific network device is provided, include it in the parameters<br>    if net_dev != None:<br>        grafana_params += f&#39;&amp;var-netiface={net_dev}&#39;<br>    <br>    # Create the full URL for the Grafana image request<br>    image_url = f&#39;{GRAFANA_BASE_URL}?{grafana_params}&#39;<br>    <br>    # Define the name and directory for saving the image<br>    image_name = f&#39;{node_name}.png&#39;<br>    image_directory = &quot;images/&quot;<br>    <br>    try:<br>        # Set up a URL opener with custom headers for the Grafana request<br>        opener = urllib.request.build_opener()<br>        opener.addheaders = GRAFANA_HEADERS<br>        urllib.request.install_opener(opener)<br>        <br>        # Retrieve the image from the constructed URL and save it to the specified directory<br>        urllib.request.urlretrieve(image_url, f&#39;{image_directory}/{image_name}&#39;)<br>        <br>        # Return the path to the saved image<br>        return f&#39;{image_directory}/{image_name}&#39;<br>    <br>    except Exception as e:<br>        # Print an error message if the image retrieval fails and exit the program<br>        print(f&quot;Error in retrieving image: {e}&quot;)<br>        exit(1)</pre><ul><li>Python3 script to serve the Telegram bot:</li></ul><pre># Import needed script and library<br>import lab_util<br>import telebot<br><br># Define the API token for the bot, which is required to authenticate with the Telegram Bot API<br># Replace &lt;Telegram Bot Token&gt; with appropriate Telegram Bot Token<br>BOT_API_TOKEN = &quot;&lt;Telegram Bot Token&gt;&quot;<br><br># Create an instance of the TeleBot class using the provided API token<br># This instance will be used to interact with the Telegram Bot API and handle messages<br>bot = telebot.TeleBot(BOT_API_TOKEN)<br><br>def help_message():<br>    # This function attempts to read the content of the &#39;help.txt&#39; file<br>    try:<br>        # Open the &#39;help.txt&#39; file in read mode<br>        with open(&#39;help.txt&#39;, &#39;r&#39;) as file:<br>            # Read the entire content of the file<br>            content = file.read()<br>            # Return the content read from the file<br>            return content<br>    except:<br>        # If an error occurs (e.g., file not found), return a default error message<br>        return &quot;Can&#39;t retrieve help message&quot;<br><br># Send the help message when user send &#39;/start&#39; command<br>@bot.message_handler(commands=[&#39;start&#39;])<br>def send_welcome(message):<br>    bot.reply_to(message, help_message())<br><br># Send the help message when user send &#39;/help&#39; command<br>@bot.message_handler(commands=[&#39;help&#39;])<br>def send_help(message):<br>    bot.reply_to(message, help_message())<br><br>@bot.message_handler(commands=[&#39;node_util&#39;])<br>def handle_node_util(message):<br>    # Split the incoming message text into command parameters<br>    command_params = message.text.split()<br><br>    # Check if there are exactly 3 parameters (node_name and resource_type)<br>    if len(command_params) == 3:<br>        node_name = command_params[1]  # Extract the node name from the parameters<br>        resource_type = command_params[2]  # Extract the resource type from the parameters<br>        <br>        # If the resource type is &#39;cpu&#39;, retrieve CPU utilization and corresponding image<br>        if resource_type == &#39;cpu&#39;:<br>            cpu_util_text = lab_util.cpu_util(node_name)  # Get CPU utilization text<br>            cpu_util_image = lab_util.get_grafana_dashboard(resource_type, node_name)  # Get the Grafana dashboard image for CPU<br>            bot.reply_to(message, cpu_util_text)  # Send the CPU utilization text back to the user<br>            with open(cpu_util_image, &#39;rb&#39;) as image:  # Open the image file<br>                bot.send_photo(message.chat.id, image)  # Send the CPU utilization image to the user<br>        <br>        # If the resource type is &#39;memory&#39;, retrieve memory utilization and corresponding image<br>        elif resource_type == &#39;memory&#39;:<br>            memory_util_test = lab_util.memory_util(node_name)  # Get memory utilization text<br>            memory_util_image = lab_util.get_grafana_dashboard(resource_type, node_name)  # Get the Grafana dashboard image for memory<br>            bot.reply_to(message, memory_util_test)  # Send the memory utilization text back to the user<br>            with open(memory_util_image, &#39;rb&#39;) as image:  # Open the image file<br>                bot.send_photo(message.chat.id, image)  # Send the memory utilization image to the user<br>        <br>        # If the resource type is not recognized, send the help message<br>        else:<br>            bot.reply_to(message, help_message())<br>    <br>    # Check if there are exactly 4 parameters (node_name, resource_type, and net_dev)<br>    elif len(command_params) == 4:<br>        node_name = command_params[1]  # Extract the node name from the parameters<br>        resource_type = command_params[2]  # Extract the resource type from the parameters<br>        net_dev = command_params[3]  # Extract the network device from the parameters<br>        <br>        # If the resource type is &#39;bandwidth&#39;, retrieve bandwidth utilization and corresponding image<br>        if resource_type == &#39;bandwidth&#39;:<br>            bandwidth_util_text = lab_util.bandwidth_util(node_name, net_dev)  # Get bandwidth utilization text<br>            bandwidth_util_image = lab_util.get_grafana_dashboard(resource_type, node_name, net_dev=net_dev)  # Get the Grafana dashboard image for bandwidth<br>            bot.reply_to(message, bandwidth_util_text)  # Send the bandwidth utilization text back to the user<br>            with open(bandwidth_util_image, &#39;rb&#39;) as image:  # Open the image file<br>                bot.send_photo(message.chat.id, image)  # Send the bandwidth utilization image to the user<br>        <br>        # If the resource type is not recognized, send the help message<br>        else:<br>            bot.reply_to(message, help_message())<br><br>    # If the number of parameters is not 3 or 4, send the help message<br>    else:<br>        bot.reply_to(message, help_message())<br><br>bot.polling()</pre><ul><li>To test the Telegram bot, run the bot script</li></ul><pre>cd bot_util<br>python3 bot.py</pre><ul><li>Open Telegram, send the command to bot as message. For example, first we send the /start and /help command to show the help message</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*cnw_KklnmXHIwGE9NmKt0Q.png" /><figcaption>Bot replying with Help Message</figcaption></figure><ul><li>Then, we try to send command to query CPU utilization from os-compute01 node and retrieve top 3 instances with highest CPU utilization inside of it</li></ul><pre>/node_util os-compute01 cpu</pre><p>The bot send expected result with CPU utilization Grafana panel</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/787/1*diW5vVGuKLFzVNkYE4A_UA.png" /><figcaption>CPU utilization query result from the Bot</figcaption></figure><ul><li>Next, we try to send command to query Memory utilization from os-compute01 node and retrieve top 3 instances with highest Memory utilization inside of it</li></ul><pre>/node_util os-compute01 memory</pre><p>The bot send expected result with Memory utilization Grafana panel</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/790/1*ATPSLkXxGke51eB3Is4LEQ.png" /><figcaption>Memory utilization query result from the Bot</figcaption></figure><ul><li>Last, we try to send command to query Bandwidth utilization of enp1s0 network interface from os-compute01 node and retrieve top 3 instances with highest Bandwidth utilization inside of it. Note that instances bandwidth utilization is not network interface specific</li></ul><pre>/node_util os-compute01 bandwidth enp1s0</pre><p>The bot send expected result with Bandwidth Utilization Grafana panel image</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/793/1*u8sfq62fLk_B04LcYdZo9g.png" /><figcaption>Bandwidth utilization query result from the Bot</figcaption></figure><p><strong>Author</strong>: <br>Kevin Timoteus Sirait — PT. Boer Technology | <a href="https://medium.com/@kevintim">Medium</a> | <a href="https://www.linkedin.com/in/kevin-tim/">LinkedIn</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5fd6e1970e37" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Deploy Monitoring Server Using Zabbix With External PostgreSQL-16]]></title>
            <link>https://medium.com/@btech-engineering/deploy-monitoring-server-using-zabbix-with-external-postgresql-16-d3c7eed09bca?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/d3c7eed09bca</guid>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Wed, 02 Oct 2024 03:20:47 GMT</pubDate>
            <atom:updated>2024-10-02T03:24:25.363Z</atom:updated>
            <content:encoded><![CDATA[<p>In this scenario, the administrator wants to monitor the database server and router core and report monthly. The administrator planned to add a monitoring server, and Zabbix was chosen as the monitoring system.</p><h3>Topology</h3><figure><img alt="" src="https://cdn-images-1.medium.com/proxy/1*13Ya7jcXOEDxKdN1Ys23fw.png" /></figure><h3>Lab Specification:</h3><ul><li>Ubuntu Server 24.04 LTS(Zabbix Only Support LTS Version)</li><li>Zabbix Server 7.0</li><li>Zabbix Agent</li><li>Postgresql-16</li><li>NGINX</li><li>SNMP</li></ul><h3>Lab Requirement:</h3><p>Zabbix, Ubuntu Server 24.04 LTS, PostgreSQL-16, snmpwalk, zabbix agent.</p><h3>Lab Technical Configuration</h3><h4>PostgreSQL As Database Server</h4><p><strong>&gt; Installing PostgreSQL-16</strong></p><blockquote><strong>Adding PostgreSQL-16 repository</strong></blockquote><p>First of all, add the postgreSQL repository to your system and update it.</p><pre>apt install curl ca-certificates<br>install -d /usr/share/postgresql-common/pgdg<br>curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc<br>sudo sh -c &#39;echo &quot;deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main&quot; &gt; /etc/apt/sources.list.d/pgdg.list&#39;<br>apt update</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*t96ZgLAVlnD39sAfU3H8Qw.png" /></figure><blockquote><strong>Installing PostgreSQL-16 Packages</strong></blockquote><p>Okay, now start installing the database on your system.</p><pre>sudo apt -y install postgresql-16</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eQl1IPEFZ3C3ot4txvXLzA.png" /></figure><p><strong>&gt; Creating Zabbix Database, User, And Password</strong></p><blockquote><strong>Login into PostgreSQL-16 System</strong></blockquote><p>After finishing installing, try to login on to the database system using <strong>Postgres</strong> user and try to use the database;</p><pre>su - postgres<br>psql</pre><blockquote><strong>Creating user database</strong></blockquote><p>Create a database and grant it to the user;</p><pre>CREATE USER zabbix PASSWORD &#39;password&#39;;<br>CREATE DATABASE zabbix;<br>ALTER DATABASE zabbix OWNER TO zabbix;<br>\l<br>\q</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4N5UibwUsroKXJHZv9imRw.png" /></figure><p><strong>&gt; Exposing PostgreSQL To Allow Zabbix Access</strong></p><blockquote><strong>Configure postgreSQL main config</strong></blockquote><p>PostgreSQL&#39;s default connection is just for localhost; you need to change it so that the connection can be allowed. Anyway, you can use a subnet mask like 192.168.0.0/24(for network range) or 192.168.0.1/32(for single host). For production, I recommend using a range or single network to allow the connection.</p><pre>nano /etc/postgresql/16/main/postgresql.conf<br>---<br>listen_addresses = &#39;*&#39;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/692/1*d8YLfNJTz-Q_pnPIbv3ERA.png" /></figure><blockquote><strong>Configure expose rule</strong></blockquote><p>This is another security rule that allows you to determine where the database is and who can connect to it.</p><p>For labs, I will allow all. Again, for production, I will recommend that you allow and list the users and database that can access it.</p><pre>nano /etc/postgresql/16/main/pg_hba.conf<br>---<br>host    all             all             0.0.0.0/0               md5</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/627/1*QxUN2gZ5mtrpEmQDNECI7g.png" /></figure><p><strong>&gt; Applying &amp; Checking PostgreSQL Services</strong></p><p>After configuration, you need to restart and check the service to make sure the service is okay.</p><pre>systemctl restart postgresql<br>systemctl status postgresql</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/755/1*EfeLWAXJmH4WdJEL6bVmeA.png" /></figure><h4>Zabbix Server As Monitoring Server</h4><ol><li><strong>Installing Zabbix Server 7.0</strong></li></ol><p><strong>&gt; Adding Zabbix Repository</strong></p><p>Adding a repository on the system and updating it.</p><pre>wget https://repo.zabbix.com/zabbix/7.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_7.0-2+ubuntu24.04_all.deb<br>dpkg -i zabbix-release_7.0-2+ubuntu24.04_all.deb<br>apt update</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tG8p2k88LVIW77MRL_LxKg.png" /></figure><p><strong>&gt; Installing Zabbix Packages And PostgreSQL Client</strong></p><blockquote><strong>Installing Installing Zabbix Server, Frontend, Agent.</strong></blockquote><pre>apt -y install zabbix-server-pgsql zabbix-frontend-php php8.3-pgsql zabbix-nginx-conf zabbix-sql-scripts zabbix-agent</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5fgrnmo_vSUvVT-xqtTAqg.png" /></figure><blockquote><strong>Installing PostgreSQL Client On Zabbix Server.</strong></blockquote><pre>apt install -y postgresql-client-common postgresql-client-16</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ae92N9qId-gfN5XCH2_ykg.png" /></figure><p><strong>&gt; Exporting Zabbix Server Database Script Into PostgreSQL</strong></p><blockquote><strong>Export Zabbix Database From Script To Database Server</strong></blockquote><p>Starting installing the Zabbix database using the Zabbix script from the default Zabbix installation.</p><pre>zcat /usr/share/zabbix-sql-scripts/postgresql/server.sql.gz | psql -h ip_server -U zabbix -d zabbix</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1013/1*lRHwr8YYCXi9Stqad510uQ.png" /></figure><blockquote><strong>Checking zabbix database content</strong></blockquote><p>After adding the database, you must check again whether the database is installed or not on the database and ensure the table has content.</p><pre>psql -h 192.168.11.99 -U zabbix -d zabbix<br>\c zabbix<br>\dt</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/685/1*7jvdGDSS_G2xa_7k-lmYPQ.png" /></figure><p><strong>2. Configure Zabbix Server &amp; Connecting Into PostgreSQL</strong></p><p><strong>&gt; Connecting Zabbix To PostgreSQL</strong></p><p>Configure the Zabbix server to connect to the database.</p><pre>nano /etc/zabbix/zabbix_server.conf<br>---<br>DBHost=ip_server<br>DBName=zabbix<br>DBUser=zabbix<br>DBPassword=password</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/962/1*Xu1hmfWny1aIYKJTlIP9pg.png" /></figure><p><strong>&gt; Configure NGINX For Zabbix Dashboard</strong></p><p>Configure nginx, copy Zabbix nginx configuration from Zabbix installation and copy it to the nginx-enabled folder.</p><pre>cd /etc/nginx/sites-enabled<br>rm default<br>ln -s /etc/zabbix/nginx.conf zabbix.conf</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/688/1*ebQLbSSgo4WZdC6enWIj1w.png" /></figure><p><strong>&gt; Starting Zabbix Server Services</strong></p><p>After configuring it, restart services zabbix, nginx, and php.</p><pre>systemctl restart zabbix-server zabbix-agent nginx php8.3-fpm<br>systemctl enable zabbix-server zabbix-agent nginx php8.3-fpm</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KKE-HTFBffxElQyuvcdzNg.png" /></figure><p><strong>&gt; Accessing &amp; Configure Zabbix From Website</strong></p><blockquote><strong>Configuring Zabbix Server</strong></blockquote><p>After restarting the service, go to the website and configure it.</p><pre>http://your_ip_server</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*C5mptUZ59cH5S4mFRimn1g.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5DTbQ0EfhNFRKTV9xx2ByA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Q3gJemRHWNOVHCwRfrnpnA.png" /></figure><blockquote><strong>Accessing Zabbix Server</strong></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GrHwKSiHgfMKRUA7a7nTmw.png" /><figcaption>Username: Admin Password: zabbix</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GvXlCxVeXecR8ht53mK5xQ.png" /></figure><h4>Author:</h4><p><a href="https://www.linkedin.com/in/muhammad-huda-fiqri-b171b5208/">Muhammad Huda Fiqri — PT.Boer Technology</a></p><h4>Reference:</h4><p><a href="https://www.zabbix.com/download?zabbix=6.4&amp;os_distribution=ubuntu&amp;os_version=24.04&amp;components=server_frontend_agent&amp;db=pgsql&amp;ws=nginx">Zabbix Server Installation Documentation</a><br><a href="https://www.postgresql.org/download/linux/ubuntu/">PostgreSQL Installation Documentation</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d3c7eed09bca" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Enhancing Kubernetes Observability with Pixie]]></title>
            <link>https://medium.com/@btech-engineering/enhancing-kubernetes-observability-with-pixie-80cac8902418?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/80cac8902418</guid>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Wed, 17 Jul 2024 10:04:43 GMT</pubDate>
            <atom:updated>2024-07-17T10:04:43.417Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/213/1*ldM1wUftyOlDktSBVvKwEw.png" /></figure><h3>Introduction</h3><p>In the dynamic world of Kubernetes, observability is crucial. Pixie, an open-source observability tool, allows developers to monitor their Kubernetes applications seamlessly.</p><p>Leveraging eBPF, Pixie captures telemetry data automatically, eliminating the need for manual instrumentation. This post delves into Pixie’s features, components, advantages, and setup guide to help you get started with this powerful tool.</p><h4>What is Pixie?</h4><p>Pixie provides real-time visibility into your Kubernetes clusters. Pixie covers everything from high-level overviews of service maps and application traffic to detailed insights like pod states and flame graphs. The core concept involves Kubernetes clusters sending metrics via Vizier to Pixie Cloud, which processes the data through the Pixie API and generates insightful dashboard views.</p><h4>Key Components</h4><p><strong>Pixie Edge Module (PEM): </strong>Pixie’s agent, installed per node, uses eBPF to collect data stored locally on the node.</p><p><strong>Vizier: </strong>Installed per cluster, Vizier is responsible for query execution and managing PEMs.</p><p><strong>Pixie Cloud: </strong>Handles user management, authentication, and data proxying.</p><p><strong>Pixie CLI: </strong>Deploys Pixie, runs queries, and manages resources like API keys.</p><p><strong>Pixie Client API: This provides</strong> programmatic access to Pixie for integrations, Slackbots, and custom user logic that require Pixie data.</p><h4>Pixie vs. Other Cloud-Native Monitoring Tools</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/961/1*DbahnOSQE2RLmDTpBTYUiQ.png" /></figure><h3>Setup Pixie Server</h3><h4>Hardware Requirements</h4><ul><li>CPU : x86–64 architecture, &gt;4 cores</li><li>Memory: 16 GB for workers</li></ul><p>Note: Pixie uses Elastic, requesting a limit of 700m vCPU and 8 GB RAM (excluding other pods).</p><h4>Software Prerequisites</h4><ul><li>Kubernetes Cluster v1.21+</li><li>MetalLB</li><li>Default StorageClass</li><li>mkcert</li><li>kustomize</li></ul><h4>Step 1: Clone Pixie Repository</h4><pre>git clone https://github.com/pixie-io/pixie.git<br>cd pixie</pre><h4>Step 2: Switch to Production-Ready Branch/Tag</h4><pre>export LATEST_CLOUD_RELEASE=$(git tag | grep &#39;release/cloud&#39; | sort -r | head -n 1 | awk -F/ &#39;{print $NF}&#39;)</pre><h4>Step 3: Update Image Version in Kustomization</h4><pre>perl -pi -e &quot;s|newTag: latest|newTag: \&quot;${LATEST_CLOUD_RELEASE}\&quot;|g&quot; k8s/cloud/public/kustomization.yaml</pre><h4>Step 4 (Optional): Change the default domain and use CA</h4><p>Modify the following files if you wish to change the default domain or use a CA-authorized certificate:</p><ul><li>k8s/cloud/public/proxy_envoy.yaml</li><li>k8s/cloud/public/domain_config.yaml</li><li>scripts/create_cloud_secrets.sh</li></ul><p>Add mkcert to the local trust CA:</p><pre>mkcert -install</pre><h4>Step 5: Create Namespace</h4><pre>kubectl create namespace plc</pre><h4>Step 6: Create Secrets File</h4><pre>./scripts/create_cloud_secrets.sh</pre><h4>Step 7: Deploy Elastic and Postgres</h4><pre>kustomize build k8s/cloud_deps/base/elastic/operator | kubectl apply -f -</pre><pre>kustomize build k8s/cloud_deps/public | kubectl apply -f -</pre><h4>Step 8: Deploy Pixie Labs</h4><pre>kustomize build k8s/cloud/public/ | kubectl apply -f -</pre><h4>Step 9: Check Pods in plc Namespace</h4><pre>kubectl -n plc get pods</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/459/1*A0m6bAN86jJ7mWlZZCOP1Q.png" /><figcaption>Result</figcaption></figure><h4>Step 10: Check the External IP of Service and Setup Hosts</h4><pre>kubectl -n plc get svc</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*chwYbVRKeoXxP0-J95Xtrg.png" /></figure><p>Add the IP and domain to `/etc/hosts`:</p><pre>192.168.x.x dev.withpixie.dev work.dev.withpixie.dev</pre><h4>Step 11: Access the Pixie Site</h4><p>Open `work.dev.withpixie.dev` or your custom domain.</p><h4>Step 12: Login</h4><p>Use the default credentials: <strong>admin@default.com</strong> and password: <strong>admin</strong>.</p><p>Note: Refer to <a href="https://docs.px.dev/installing-pixie/install-guides/self-hosted-pixie/">Pixie Documentation</a> for more details.</p><h3>Setting Up Pixie Client</h3><h4>Hardware Requirements</h4><ul><li>Free Memory: ≥ 6GB RAM</li></ul><h4>Software Prerequisites</h4><ul><li>Kubernetes Cluster</li><li>Pixie Server in plc namespace</li></ul><h4>Step 1: Install Pixie CLI</h4><pre>bash -c &quot;$(curl -fsSL https://work.dev.withpixie.dev/install.sh)&quot;</pre><h4>Step 2: Set PL_CLOUD_ADDR</h4><pre>export PL_CLOUD_ADDR=dev.withpixie.dev</pre><h4>Step 3: Authenticate Pixie Client</h4><pre>px auth login</pre><h4>Step 4: Deploy Pixie Client</h4><pre>px deploy - dev_cloud_namespace plc - pem_memory_limit=1Gi</pre><p><strong>Note</strong>: The minimum <strong>pem_memory_limit</strong> is 1GB.</p><h4>Step 5: Open Pixie Website</h4><p>Once all pods are running, the data will be automatically displayed and updated.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*il6P78zDLNukvBi_sYj08g.png" /></figure><h3>Monitoring</h3><h4>Network Monitoring</h4><p>With Pixie, you can monitor network traffic using the `px/net_flow_graph` script. This script displays connectivity between pods and clusters, showing data sent and received.</p><p>Key Metrics:</p><ul><li><strong>FROM_ENTITY</strong>: Source pod</li><li><strong>TO_ENTITY</strong>: Target pod/service</li><li><strong>BYTES_SENT</strong>: Total data sent</li><li><strong>BYTES_RECV</strong>: Total data received</li><li><strong>BYTES_TOTAL</strong>: Sum of sent and received data</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dkWKe7tZeT-_czE8xwPoag.png" /></figure><h4>Infrastructure Health Monitoring</h4><p>Monitor resource usage of nodes and pods using the `px/nodes` script. This script provides insights into CPU usage, network traffic, and data traffic.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QDJTR_CRMVKklCIT0iRCBA.png" /></figure><h4>Detailed Node Monitoring</h4><p>You can view detailed usage grouped by pod, namespace, or service by clicking on a node.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ErzcI1BxsfpO6w11TNem8Q.png" /></figure><h4>Service Performance Monitoring</h4><p>Monitor service performance using the `px/service` script. It displays HTTP performance metrics and traffic data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3Y73OJQOU-3MPMfORS0Vnw.png" /></figure><h4>Traffic Insights</h4><p>The script also shows the traffic sent to the service and the request paths.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ldQh6hQVLsIU9UHZbT63FQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*O5xQI5IEU8q6rB96srNwpQ.png" /></figure><h4>Database Query Profiling</h4><p>Pixie supports database performance monitoring for MySQL and PostgreSQL. Use the `px/mysql_stats` script for database usage graphs and `px/mysql_data` for query monitoring.</p><p>Key Metrics:</p><ul><li>Source and destination of database calls</li><li>Executed queries</li><li>Query performance</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lBLxD9HxEWxMUE3WnQgUkQ.png" /><figcaption>Example of Monitoring Database MySQL</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*a3vlIV6u22MwNpLY84Lzbw.png" /><figcaption>Example of Monitoring Database MySQL for each query</figcaption></figure><h4>Request Tracing</h4><p>Pixie can trace HTTP requests using the `px/http_data_filtered` script. This script monitors request speed, data size, and latency based on the requested path.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vfns-34TNmjTg0cTaNkLMg.png" /><figcaption>Example of Request Tracing HTTP on sock-shop/catalogue service</figcaption></figure><h3>Conclusion</h3><p>Pixie offers a comprehensive solution for Kubernetes observability, providing detailed insights into your clusters’ performance and health. With its easy deployment, powerful monitoring capabilities, and user-friendly interface, Pixie is an invaluable tool for any Kubernetes environment.</p><p>Explore Pixie and enhance your Kubernetes observability today! For more information, visit the <a href="https://docs.px.dev/">Pixie documentation</a>.</p><h3>Author :</h3><p>Farih Nazihullah, DevSecOps Specialist | <a href="https://www.linkedin.com/in/farihnazihullah/"><strong>LinkedIn</strong></a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=80cac8902418" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Choosing the Right Machine Type on Google Cloud Platform (GCP): Comparison, Benchmark, and Cost]]></title>
            <link>https://medium.com/@btech-engineering/choosing-the-right-machine-type-on-google-cloud-platform-gcp-comparison-benchmark-and-cost-21c08fe732a4?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/21c08fe732a4</guid>
            <category><![CDATA[gcp]]></category>
            <category><![CDATA[cloud-benchmark]]></category>
            <category><![CDATA[vm]]></category>
            <category><![CDATA[public-cloud]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Fri, 12 Jul 2024 03:47:05 GMT</pubDate>
            <atom:updated>2024-07-12T03:47:05.303Z</atom:updated>
            <content:encoded><![CDATA[<p>In this fast-paced digital era, choosing the right cloud infrastructure is an important decision for every organization. Google Cloud Platform (GCP) offers various machine types that can be customized to suit your workload needs. However, with so many options available, how can you choose the most suitable machine type?</p><p>This article will discuss three main aspects when choosing a machine type in GCP: comparison, benchmark, and cost. Through comparisons, we will see the key differences between the various machine types offered by GCP. Benchmarking will help us understand the performance of each machine type based on specific workloads. Lastly, cost analysis will allow us to consider budget efficiency in the long run.</p><p>By understanding these three factors, you will have a comprehensive guide to make better decisions in selecting machine types at GCP, to optimize performance and cost for your organization’s specific needs. Let’s start by looking at the comparisons between the machine types available at GCP.</p><h3>Generation Table</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/969/0*JIuHFIUrf-vzXgNL.png" /></figure><h3>2rd Generation (N2, N2D, C2, C2D)</h3><h4><strong>N2 (Intel Cascade Lake)</strong></h4><ul><li>The N2 series is suitable for workloads that utilize high clock frequencies, providing higher performance per thread. This benefits applications that require high responsiveness and fast computing.</li><li>Designed to deliver high CPU performance.</li><li>Ideal for workloads that require maximum computing power, such as high-traffic web applications, game servers, or complex data analysis.</li></ul><h4><strong>N2D (AMD EPYC Rome)</strong></h4><ul><li>They are made to provide a more affordable option without significantly compromising performance.</li><li>Suitable for workloads such as application development, testing, or development environments that do not require maximum CPU power.</li></ul><h4><strong>C2</strong></h4><ul><li>C2 VMs run on 2nd generation Intel Xeon Scalable processors (Cascade Lake) that offer a sustained single-core maximum turbo frequency of up to 3.9 GHz. C2 offers VMs with 4 to 60 vCPUs and 4 GB of memory per vCPU.</li><li>The C2 machine series provides full transparency into the underlying server platform architecture, so you can improve performance. The machines in this series offer significantly more computing power and are generally more robust for compute-intensive workloads compared to the N1 high CPU machines.</li><li>More suitable for applications that require high turbo frequency and good responsiveness. Ideal for compute-intensive workloads that require large computing power.</li></ul><h4><strong>C2D (Intel Cascade Lake)</strong></h4><ul><li>C2D VMs run on 3rd generation AMD EPYC Milan processors and offer increased frequency up to 3.5 GHz. C2D VMs are flexible in size between 2 to 112 vCPUs and 2 to 8 GB of memory per vCPU.</li><li>The C2D series of machines provides the largest VM size and is best suited for high-performance computing (HPC). The C2D series also has the largest highest level cache (LLC) cache per available core.</li><li>More suitable for high-performance computing (HPC) and applications that require multiple vCPUs with higher VM size and configuration flexibility. Supports larger memory per vCPU and has the highest level cache (LLC) per core.</li></ul><h3><strong>Performance</strong></h3><ul><li>{N, C}2: Provides single-thread performance, suitable for applications that require fast computation and high responsiveness.</li><li>{N, C}2D: Provides better multi-threaded performance with more vCPUs and memory, ideal for applications with high CPU intensity and high-performance computing workloads.</li></ul><h3><strong>3rd Generation (C3, C3D)</strong></h3><h4><strong>C3 (Intel Skylake)</strong></h4><ul><li>C3 VMs are powered by 4th generation Intel Xeon Scalable processors (codenamed Sapphire Rapids), DDR5, and Titanium memory. The C3 machine type is optimized for the underlying NUMA architecture to provide optimal, reliable, and consistent performance.</li></ul><h4><strong>C3D (Intel Cascade Lake)</strong></h4><ul><li>C3D VMs are powered with 4th generation AMD EPYC™ (Genoa) processors with a maximum frequency of 3.7 GHz. The C3D engine type is optimized for the underlying hardware architecture to deliver optimal, reliable, and consistent performance.</li><li>The high CPU configuration offers the lowest price per performance for compute-bound workloads that do not require large amounts of memory.</li></ul><h3>Comparison Tables Benchmark</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/940/1*eMo15-BnsE1jVwNUSHkO-Q.png" /></figure><h3>Recommendation Budget</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/939/1*HzuMeKHsZrahA4cggVEJ4Q.png" /></figure><h3>Summary</h3><ul><li>C series has limitations in terms of regional availability and does not support custom specifications.</li><li><strong>Best Performance CPU</strong>: For applications that require high-performance CPUs, the C3D series with 4th generation AMD EPYC™ (Genoa) processors offers optimal performance at the best price-per-performance for workloads that require a lot of CPU and little memory.</li><li><strong>Web Server Applications</strong>: If your web server application is more CPU-intensive and requires high responsiveness, choose the C3 (Intel Sapphire Rapids) or C3D (AMD Genoa) series VM. However, if you need cost efficiency with a high number of vCPUs, choose N2D (AMD EPYC Rome).</li></ul><h3><strong>Author</strong></h3><ul><li><a href="https://www.linkedin.com/in/adelia-nurlina/">Adelia Nurlina Putri — PT.Boer Technology</a></li><li><a href="https://www.linkedin.com/in/ajie-fauhad/">Ajie Fauhad Fadhlullah — PT.Boer Technology</a></li><li><a href="https://www.linkedin.com/in/mahbubi-hamdani/">Mahbubi Hamdani — PT.Boer Technology</a></li></ul><h3>Reference</h3><ul><li><a href="https://dev.to/dkechag/google-cloud-c3d-review-record-breaking-performance-with-epyc-genoa-g13">https://dev.to/dkechag/google-cloud-c3d-review-record-breaking-performance-with-epyc-genoa-g13</a></li><li><a href="https://gcloud-compute.com/grid.html?platform=amd">https://gcloud-compute.com/grid.html?platform=amd</a></li><li><a href="https://cloud.google.com/compute/docs/machine-resource#machine_type_comparison">https://cloud.google.com/compute/docs/machine-resource#machine_type_comparison</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=21c08fe732a4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking the Future of Container Management: Discover the Power of RKE2]]></title>
            <link>https://medium.com/@btech-engineering/unlocking-the-future-of-container-management-discover-the-power-of-rke2-c3e5aa4b6b88?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/c3e5aa4b6b88</guid>
            <category><![CDATA[rke2]]></category>
            <category><![CDATA[kubernetes]]></category>
            <category><![CDATA[rancher]]></category>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Mon, 08 Jul 2024 08:43:23 GMT</pubDate>
            <atom:updated>2024-07-09T02:10:34.899Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/396/0*_2Xuhkfcq24s1N0T.png" /></figure><p>Have you ever explored tools beyond traditional Kubernetes for managing your containers? Meet <strong>RKE2</strong>, an innovative solution for orchestrating containerized applications within Kubernetes clusters.</p><p>RKE2, also known as RKE Government or Rancher Kubernetes Engine 2, brings enhanced security, simplicity, and efficiency to your container management needs.</p><h3>Compare to RKE1 &amp; K3S</h3><p>RKE2 combines the best of both worlds from the 1.x version of RKE (hereafter referred to as RKE1) and K3s.</p><p>From K3s, it inherits the usability, ease-of-operations, and deployment model.</p><p>RKE1 inherits close alignment with upstream Kubernetes. In some places, K3s have diverged from upstream Kubernetes to optimize for edge deployments, but RKE1 and RKE2 can stay closely aligned with upstream.</p><p>Importantly, RKE2 does not rely on Docker as RKE1 does. RKE1 leveraged Docker for deploying and managing the control plane components and the container runtime for Kubernetes. RKE2 launches control plane components as static pods, managed by the kubelet. The embedded container runtime is containerd.</p><h3>Key features of RKE2</h3><ul><li>Ease-of-operations</li><li>Data Center Optimization</li><li>Scalability</li></ul><h3>Table of content</h3><ul><li>Topology of Lab</li><li>Deploying new RKE2 Cluster</li><li>Accessing &amp; Knowing more</li><li>Managing RKE2 Certificate</li><li>Integrating with external storage (NFS)</li><li>Create a sample container app</li><li>Backup &amp; Restore Cluster</li><li>Rancher UI &amp; How it Works with RKE2</li></ul><p>Let&#39;s get into the Lab!</p><h3>Topology of Lab</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/990/0*-AuCARr2D3cn2ej5.png" /></figure><p>The topologi consist 3 node of RKE2 Server and 1 NFS Server that we’re gonna use to implement external storage solution scenario later, why we choose three RKE2 Server? There’s 2 type of RKE2 Node, Server &amp; Agent. When you build RKE2 Server, it basically acts as a controller/master node, but when you install RKE2 Agent, it will act as a worker node. But in today’s lab i’m gonna show you how High Availability Topology Kubernetes environment, would be easly deploy using RKE2, as its key feature “ease of operation” but unfortunately we’re not gonna test its handle of failure, at another topics maybe?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/704/1*DDt9nmsQBFuYHEOw9DrpVQ.png" /></figure><h3>Deploying New RKE2 Cluster</h3><h4>Pre-Installation</h4><ol><li>Mapping IP Address to domain</li></ol><pre>cat &lt;&lt; EOF &gt;&gt; /etc/hosts<br>172.16.90.10 rke2-node1 rke2-node1.btech.id<br>172.16.90.20 rke2-node2 rke2-node2.btech.id<br>172.16.90.30 rke2-node3 rke2-node3.btech.id<br>EOF</pre><p>2. Make sure each node can remote passwordless</p><pre>#exec this on every nodes<br>ssh-keygen -t rsa</pre><pre>#exec these on each node<br>ssh-copy-id rke2-node1<br>ssh-copy-id rke2-node2<br>ssh-copy-id rke2-node3</pre><p>3. Disable AppArmor,firewalld &amp; swap to avoid any conflicts and incompatible while installing RKE2</p><pre>systemctl disable apparmor.service<br>systemctl disable firewalld.service<br>systemctl stop apparmor.service<br>systemctl stop firewalld.service</pre><pre>systemctl disable swap.target<br>swapoff -a</pre><h4>Installing &amp; Configuring RKE2 Server</h4><ol><li>Downloading &amp; Installing RKE2 Server</li></ol><pre>curl -sfL https://get.rke2.io | sh -</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*VhLRZp-boQL1uOgM.png" /></figure><p>2. Configure RKE2 Server on first node for joining cluster to other nodes. It takes 2 minutes to bring service up, but it depends on your server spesification.</p><pre>mkdir -p /etc/rancher/rke2<br>nano /etc/rancher/rke2/config.yaml</pre><pre>---<br>token: my-shared-secret-token<br>tls-san:<br>  - rke2-node1.btech.id<br>  - rke2-node2.btech.id<br>  - rke2-node3.btech.id</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*5JiO8XcCW0PKZaxG.png" /></figure><p>3. Enable &amp; verify service on the first node</p><pre>systemctl enable --now rke2-server.service<br>journalctl -xe</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*MDUrwh78vYGlaN0O.png" /></figure><p>4. Verify cluster nodes, we can see that only first node appear there, its because we haven’t configure on second &amp; third node to join the cluster, so on the next step we’ll do it.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*iBD63Oa_9zXiBqhS.png" /></figure><p>5. Configuring also on second &amp; third nodes to join the cluster &amp; verify</p><pre>mkdir -p /etc/rancher/rke2<br>nano /etc/rancher/rke2/config.yaml</pre><pre>server: <a href="https://rke2-node1.btech.id:9345">https://rke2-node1.btech.id:9345</a><br>token: my-shared-secret-token<br>tls-san:<br>  - rke2-node1.btech.id<br>  - rke2-node2.btech.id<br>  - rke2-node3.btech.id</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*DZeIr-04AOXzrmlR.png" /></figure><p>6. After configuring the second &amp; third servers, we can verify the back member of the RKE2 cluster. See different?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*oGnlsOBamtrgrrWi.png" /></figure><h3>Accessing &amp; Knowing more</h3><h4>Accessing the cluster</h4><p>You can access the rke2 cluster after installation by exporting the environment variable of RKE2 Kubeconfig stores in /etc/rancher/rke2/rke2.yaml by default kubectl also included by RKE2, so you have to move the binary to the right place.</p><pre>#copy kubectl binary to executble path<br>cp /var/lib/rancher/rke2/bin/kubectl /usr/local/bin</pre><pre>#add env variable that needed by kubectl to connect to RKE2 Cluster and rke2 binary.</pre><pre>export PATH=$PATH:/opt/rke2/bin:/var/lib/rancher/rke2/bin<br>export KUBECONFIG=/etc/rancher/rke2/rke2.yaml</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*M8mWgj8TIjayiG5g.png" /></figure><ol><li>Verify cluster component status &amp; client version</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*UTXWc1h4dJCE2-aG.png" /></figure><p>2. By default, ingress controller &amp; network component are installed on RKE2 server, so you don’t have to install it anymore, and there you can see that network component installed is canal, in simplify, its combines two CNI networking project (Flannel &amp; Calico) making its more powerfull!</p><pre>kubectl get pods -A | grep ingress<br>kubectl get pods -A | grep canal</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*_rOjQ0UfIdWpYEdn.png" /></figure><h4>Managing RKE2 Certificate</h4><ol><li>You can check the certificate expiration time &amp; also rotate it manually</li></ol><pre>#for checking<br>rke2 certificate check</pre><p>2. Check the certificate before renewing expires on 25 June 2025, 14:05 WIB (After converting to UTC+7)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*PbaMnQWJPDpNAZ9S.png" /></figure><p>3. Let&#39;s try to rotate/renew</p><pre>#execute on all nodes<br>#for renewing<br>systemctl stop rke2-server <br>rke2 certificate rotate<br>systemctl start rke2-server</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nQozB_lMW2z_hdlU.png" /></figure><p>4. Verification As you can see in the picture below, the certificate expires on 30 June 2025, 21:00 WIB (After converting to UTC+7)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*a-KtEWzeNNnJsh4n.png" /></figure><h3>Integrating with external storage (NFS)</h3><ol><li>Install nfs-server</li></ol><pre>apt -y install nfs-kernel-server</pre><p>2. Create shared folder</p><pre>mkdir -p /data/nfs-share<br>chmod 777 /data/nfs-share</pre><p>3. Configure nfs-server to share folder</p><pre>nano /etc/exports<br>---<br>/data/nfs-share 172.16.90.0/24(rw,sync,no_subtree_check,no_root_squash)</pre><p>4. Restart services</p><pre>sudo exportfs -av<br>systemctl restart nfs-server<br>systemctl status nfs-server</pre><p>5. Then test mounting &amp; create some files to some node (rke2-node-1)</p><pre>apt install nfs-common -y<br>mkdir -p /mnt/nfs-clientshare<br>sudo mount -t nfs 192.168.100.10:/data/nfs-share /mnt/nfs_clientshare<br>touch index.php /mnt/nfs_clientshare</pre><p>6. When nfs is successfully configured, let&#39;s deploy the nfs provisioner on the controller node (rke2-node-1), which enables you to use nfs external storage easily for your application.</p><pre>#make sure to install nfs-client on all kubernetes node<br>apt install nfs-common -y<br># install helm first if you have not before.<br>curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -<br>sudo apt-get install apt-transport-https --yes<br>echo &quot;deb https://baltocdn.com/helm/stable/debian/ all main&quot; | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list<br>sudo apt-get update<br>sudo apt-get install helm</pre><p>7. Deploy nfs provisioner using helm</p><pre>helm install nfs-cluster-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \<br>    --set nfs.server=192.168.100.10 \<br>    --set nfs.path=/data/nfs-share \<br>    --set storageClass.name=nfs-cluster-client \<br>    --set storageClass.provisionerName=cluster.local/nfs-cluster-provisioner</pre><p>8. Make sure nfs provisioner pods are running (Here We using the default namespace; better to use a different one)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*X-yrBvMpjdf-IsH6.png" /></figure><p>9. Create PVC (Persistent Volume Claim)</p><pre># pvc<br>apiVersion: v1<br>kind: PersistentVolumeClaim<br>metadata:<br>  name: sample-nfs-pvc<br>spec:<br>  accessModes:<br>    - ReadWriteOnce<br>  storageClassName: nfs-cluster-client<br>  resources:<br>    requests:<br>      storage: 10Gi</pre><p>10. Apply it &amp; verify, when you see picture below, you can see Persistent Volume created automatically, its advantage of using nfs provisioner, so you don’t have to define PV first. Also, you can see in our shared nfs folder, there’s a default-blabla folder. Its folder will store the files/content created by our pods when consuming the NFS share volume.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*3-XrkBteWiki7mqv.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*WUMw-XASEr0uq3CQ.png" /></figure><p>So that’s all for the nfs provisioner, I will move to some application creation steps to see if the nfs-server will work with our application.</p><h3>Create a sample container app</h3><ol><li>Create a deployment application that will consume volume from the external NFS servers.</li></ol><pre># deployment<br><br>apiVersion: apps/v1<br>kind: Deployment<br>metadata:<br>  labels:<br>    app: web-nginx<br>  name: nfs-nginx<br>spec:<br>  replicas: 3<br>  selector:<br>    matchLabels:<br>      app: web-nginx<br>  template:<br>    metadata:<br>      labels:<br>        app: web-nginx<br>    spec:<br>      volumes:<br>        - name: nfs-nginx<br>          persistentVolumeClaim:<br>            claimName: sample-nfs-pvc<br>      containers:<br>        - image: nginx<br>          name: nginx<br>          volumeMounts:<br>            - name: nfs-nginx<br>              mountPath: /usr/share/nginx/html</pre><p>2. Create services</p><pre>apiVersion: v1<br>kind: Service<br>metadata:<br>  name: nginx-service<br>spec:<br>  selector:<br>    app: web-nginx<br>  ports:<br>    - protocol: TCP<br>      port: 80<br>      targetPort: 80</pre><p>3. Create ingress outside the cluster using the domain.</p><pre>apiVersion: networking.k8s.io/v1<br>kind: Ingress<br>metadata:<br>  name: sample-ingress<br>spec:<br>  ingressClassName: nginx<br>  rules:<br>    - host: rke2-test.btech.id<br>      http:<br>        paths:<br>          - path: /<br>            pathType: Prefix<br>            backend:<br>              service:<br>                name: nginx-service<br>                port:<br>                  number: 80</pre><p>4. For verification, let’s get into one of the nginx containers we’ve created before, then test to create a file on <strong>/usr/share/nginx/html</strong> to see if the NFS server will copy the content on the shared folder.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*XNALrBQ6ljU-Zo9W.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*OOtEnLQ4Ty2N2X8u.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1008/0*smfCiAjz0sl8Jmtv.png" /></figure><p>That’s all for testing our application, which consumes an external nfs-server share filesystem. It works appropriately so we can dynamically change the content on the nfs share folder without disturbing the container.</p><h3>Backup &amp; Restore Cluster</h3><p>Rancher provides functionality to back up our cluster, allowing us to restore it in the event of an incident. So, let&#39;s get into it!</p><h4>Prerequisite</h4><ul><li>The new cluster is similar to the old cluster.</li><li>New cluster configured (Share all ssh keys, update system, mapping its new domain to /etc/hosts)</li></ul><ol><li>We have 3 new servers similar to older ones, and here’s the domain.</li></ol><pre>#from old node<br>172.16.90.10 new-rke2-node-1.btech.id new-rke2-test.btech.id<br>172.16.90.20 new-rke2-node-2.btech.id <br>172.16.90.30 new-rke2-node-3.btech.id</pre><p>2. Then for backup our old cluster, here’s the following folder that we have to backup.</p><pre>- /var/lib/rancher/rke2/server/cred<br>- /var/lib/rancher/rke2/server/tls<br>- /var/lib/rancher/rke2/server/token<br>- /etc/rancher/*<br>- Snapshot (/var/lib/rancher/rke2/server/db/snapshots)</pre><p>3. Before backing up that folder, let’s create a snapshot.</p><pre>rke2 etcd-snapshot save --name backup-restore-rke2-testing</pre><pre>mkdir -p /root/backup</pre><pre>cp /var/lib/rancher/rke2/server/db/snapshots/backup-restore-rke2-testing-rke2-node-1-1720013739 /root/backup/<br>cp /var/lib/rancher/rke2/server/token /root/backup/<br>cp -r /etc/rancher /root/backup/<br>cp -r /var/lib/rancher/rke2/server/cred /root/backup/<br>cp -r /var/lib/rancher/rke2/server/tls /root/backup/</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*28B_VamTmC-NelkW.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ik8SX2dhuY4yzqNR.png" /></figure><p>4. Then transfer /root/backup on old nodes to the new node</p><pre>#from old node<br>scp -r /root/backup root@172.16.90.11:/root/</pre><p>5. Then restore the configuration files we backup before</p><pre>#from new node<br>mkdir -p /var/lib/rancher/rke2/server<br>mkdir -p /etc/rancher</pre><pre>cp -r /root/backup/rancher/* /etc/rancher/<br>cp -r /root/backup/token /var/lib/rancher/rke2/server/token<br>cp -r /root/backup/cred /var/lib/rancher/rke2/server/<br>cp -r /root/backup/tls /var/lib/rancher/rke2/server/</pre><p>6. Next, we will install the rke2 cluster using an old backup. First, on new-rke2-node-1, install rke2 server.</p><pre>nano /etc/rancher/rke2/config.yaml<br>---<br>token: my-shared-secret<br>tls-san:<br>  - new-rke2-node-1.btech.id<br>  - new-rke2-node-2.btech.id<br>  - new-rke2-node-3.btech.id<br>---<br>#save &amp; exit -&gt;<br><br>curl -sfL https://get.rke2.io | sh -</pre><p>7. Stop rke2 server &amp; restore backup data.</p><pre>systemctl stop rke2-server<br>rke2 server \<br>  --cluster-reset \<br>  --cluster-reset-restore-path=/root/backup/backup-restore-rke2-testing-rke2-node-1-1720013739</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*9wXp08JNOHK8picH.png" /></figure><p>8. Start after restoration done</p><pre>systemctl start rke2-server</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*V7YCyD7B7RFtMOu2.png" /></figure><p>9. On the other new node, follow the following instructions below:</p><pre>curl -sfL https://get.rke2.io | sh -</pre><pre>nano /etc/rancher/rke2/config.yaml<br>---<br>server: <a href="https://new-rke2-node-1.btech.id:9345">https://new-rke2-node-1.btech.id:9345</a><br>token: my-shared-secret<br>tls-san:<br>  - new-rke2-node-1.btech.id<br>  - new-rke2-node-2.btech.id<br>  - new-rke2-node-3.btech.id<br>  <br>systemctl enable --now rke2-server.service</pre><p>10. After all service rke2 running on the new server, then verify it using the following command.</p><pre>cp /var/lib/rancher/rke2/bin/kubectl /usr/local/bin<br>export PATH=$PATH:/opt/rke2/bin:/var/lib/rancher/rke2/bin<br>export KUBECONFIG=/etc/rancher/rke2/rke2.yaml<br>kubectl get componentstatuses<br>kubectl get nodes</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*5xqCpdbzijfDnE6k.png" /></figure><p>11. Let&#39;s take a look at our application. You can see that the domain of ingress is still using the old one, so we have to change it.</p><pre>kubectl edit ingress sample-ingress</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/614/0*lTjk-wFs0sDaqTyf.png" /></figure><ul><li>before</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*CaFtSJVXjxpLXzWv.png" /></figure><ul><li>after</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8XLm23RAySCgELbQ.png" /></figure><p>11. Verify at the end. Perfect!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/962/0*H0knJ5MQjg6JC9n3.png" /></figure><p>So we have backed up and restored our RKE2 Cluster, and as you can see, the application still works properly. Even though we have adjusted a few things, it&#39;s to make sure we have a guarantee of our server availability when any incident comes.</p><h3>Rancher UI &amp; How it Works with RKE2</h3><p>Rancher is a comprehensive container management platform for production environments, enabling organizations to run Kubernetes clusters efficiently.</p><p>It simplifies Kubernetes deployment and management, meets IT requirements, and empowers DevOps teams. You can import your existing Kubernetes cluster running internally or using a cloud provider and manage it in one platform with an intuitive user interface.</p><p>Let’s take a look at how it works with RKE2.</p><h4>Prerequisite</h4><ul><li>Kubernetes cluster</li><li>Ingress Controller</li><li>Helm tools</li><li>Cert manager (Installed in Kubernetes)</li></ul><ol><li>We will use a new cluster from the last restore, so install Helm first.</li></ol><pre>curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg &gt; /dev/null<br>sudo apt-get install apt-transport-https --yes<br>echo &quot;deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main&quot; | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list<br>sudo apt-get update<br>sudo apt-get install helm</pre><p>2. Install cert-manager</p><pre>#install cert manager<br><br># If you have installed the CRDs manually instead of with the `--set installCRDs=true` option added to your Helm install command, you should upgrade your CRD resources before upgrading the Helm chart:<br>kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/&lt;VERSION&gt;/cert-manager.crds.yaml<br><br># Add the Jetstack Helm repository<br>helm repo add jetstack https://charts.jetstack.io<br><br># Update your local Helm chart repository cache<br>helm repo update<br><br># Install the cert-manager Helm chart<br>helm install cert-manager jetstack/cert-manager \<br>  --namespace cert-manager \<br>  --create-namespace \<br>  --set installCRDs=true<br><br><br>kubectl get pods --namespace cert-manager</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*bDJ_D9aUy6N1HaU8.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ytubSWbGVZLqDfB3.png" /></figure><p>3. Deploy rancher management/UI</p><pre>helm install rancher rancher-stable/rancher \<br>  --namespace cattle-system \<br>  --set hostname=rancher-rke2-test.btech.id \<br>  --set bootstrapPassword=btechid<br><br>kubectl get pods -n cattle-system</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*7EuuBOQ8YrpYod5L.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/980/0*pO5ScQ22JwcTg9RZ.png" /></figure><p>4. Access the web dashboard:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*f7HFYzvkJAoHHDyk.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*BuVhY8JST5N-ou6i.png" /></figure><p>5. You can manage your existing cluster. So you can operate them easily.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1T8wnTwAxYMWQ6y4.png" /></figure><p>6. But we only have 1 existing cluster, how do we import external cluster? Go to home dashboard -&gt; Import</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*p4VuLSCUxB3Oci1x.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*FkN98hAF45p4cHyC.png" /></figure><p>7. Give it a name, and you can specify the user that would be able to operate this cluster, but in this case, we’re gonna let it default.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wrxyvQSn0jGItdZw.png" /></figure><p>8. Then, you’ll see this picture below. Copy the second code. We always get an error with the first code about a certificate authority issue because we’re using a self-signed certificate, so just copy the second one and paste it to the controller node of your existing cluster.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*T994jghd6P6sNc6t.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Ygo0a9bDTY57vgFN.png" /></figure><p>9. Check the container agent created on the cattle-system namespace. If you see a pod that is <strong>error</strong> or in a <strong>crashloopback</strong> state and sees the following error log, do the following steps.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*yMtYUpB8pHaKErX2.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xFgcA7XJy5WRAgLK.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*GyYocBgCt2TCBvOB.png" /></figure><p>10. Edit deployment of cattle agent</p><pre>kubectl edit deployment -n cattle-system cattle-cluster-agent</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/938/0*xl6EqryXC5FkYq2Z.png" /></figure><p>11. Save it and you’ll see its running now.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*DuclKnvoVIFpAsmz.png" /></figure><p>12. Then you can return to the Rancher management UI and see that you can now easily manage your imported cluster from the UI.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ebQLkjYztcAG-kbN.png" /></figure><p>With this guide, you can master Kubernetes with RKE2 and Rancher. You now have the tools to streamline your container operations, from setting up your cluster to integrating storage, managing certificates, and more. Dive into the world of Kubernetes with confidence, and enjoy the simplicity and power that Rancher brings to your DevOps journey.</p><p>Happy containerizing folks!</p><p>see you on the following topic. Ciao!</p><p><a href="https://www.linkedin.com/in/marvin-saputra/"><strong>Marvin — PT. Boer Technology.</strong></a></p><h3><strong>Bonus</strong></h3><p>In March 2024, the Btech team embarked on an exciting journey with one of Indonesia’s prominent state-owned enterprises (BUMN). Their mission is to revolutionize the enterprise’s development lifecycle. With the implementation of Rancher management and an RKE2 cluster, Btech showcased its cutting-edge expertise, laying the foundation for a more efficient and scalable infrastructure. This bold move signifies a significant step towards modernizing the company’s IT operations, embracing the future of technology with open arms.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gdtleahB4lqF_tg6gKtsIg.jpeg" /></figure><p>The Btech team is implementing a detailed plan to transition from a traditional development process to a modern DevSecOps environment using GitLab CI. This plan is not just a strategy; it represents a commitment to being agile, secure, and continuously innovative. The youthful and forward-thinking Btech team is leading this transformation with a clear vision and unwavering dedication to thriving in the constantly changing digital landscape.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c3e5aa4b6b88" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Migrating Virtual Machines from VMware to AWS using AWS Application Migration Service]]></title>
            <link>https://medium.com/@btech-engineering/migrating-virtual-machines-from-vmware-to-aws-using-aws-application-migration-service-b5ffa75bcfa2?source=rss-587830e9864f------2</link>
            <guid isPermaLink="false">https://medium.com/p/b5ffa75bcfa2</guid>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[vm]]></category>
            <category><![CDATA[migrate]]></category>
            <dc:creator><![CDATA[Btech Engineering]]></dc:creator>
            <pubDate>Wed, 26 Jun 2024 02:58:47 GMT</pubDate>
            <atom:updated>2024-06-26T02:58:47.464Z</atom:updated>
            <content:encoded><![CDATA[<p>Many companies prioritize migrating virtual machine workloads from on-premises environments, such as VMware, to cloud environments to take advantage of cloud computing&#39;s benefits, including scalability, flexibility, and cost efficiency.</p><p>However, this migration process can be very complex and time-consuming. Fortunately, one solution is available: <strong>AWS Application Migration Service</strong>.</p><p><strong>AWS Application Migration Service (AWS MGN)</strong> is a highly automated lift-and-shift (rehost) solution that simplifies physical, virtual, and cloud migration to AWS without compatibility issues, performance disruption, or long cutover windows. AWS MGN uses &quot;Agents&quot; to replicate source resources to AWS, installed directly on the VMs to be migrated.</p><h3><strong>Setup VM on VMware</strong></h3><h4><strong>Step 1 — VM to be migrated</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Uyj7HIpDF5CiqZWNFITIQg.png" /></figure><h4>Step 2 — Example of data and service for migration verification</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4TSY753y0ferKaFhemMFiQ.png" /></figure><h3>Migrate VM to AWS</h3><h4>Step 1 — Go to AWS MGN</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Y6AgH8_w-a380IXGVORSTg.png" /></figure><h4>Step 2 — Create IAM with AWS MGN permission</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xB3if8wNYrQoc-91mik9vg.png" /></figure><h4>Step 3 — Create access key for users has been created</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FttMVp9hnxTxabrd3V-V6Q.png" /></figure><h4>Step 4 — Add source server on AWS MGN</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pRsK4INItY2_Knm_N-XHuA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zke8dwX7T2eGuoprq44GqA.png" /></figure><h4>Step 5 — Run command on VMware VM</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VbImGBVxgH02BbM6XsNQJg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jFWiVHnGGFAoSQVFsKrAOQ.png" /></figure><h4>Step 6 — Verify source server</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*r2y2hURMuFZ11dgQN1KbRQ.png" /></figure><h4>Step 7 — Monitor the progress of the replication</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LqlsgWtM9ENmxW10Q6IGwA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pJslvuEdVtI7LRRcKSrBiw.png" /></figure><h4>Step 8 — Launch a test instance</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kW19xDoOAhV6megp1RzE8g.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/922/1*WAl3wohtONNsLyRY9vnyTw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*02XI5yU9M_nwj1Cye3xClQ.png" /></figure><h4>Step 9 — Mark test instances as ready for cutover</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*g_AJOMszD3CMOjtTKOov1g.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/884/1*q-RXVDAvG6hIwrtx9JpLtg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SryWMOsQ9IggfSJ0JUDjTw.png" /></figure><h4>Step 10 — Launch a cutover</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Mw1P2TCwKk__m03JYxIXmg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/882/1*XYT9tqEgsHIr7bxf5hjtnQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cgKVEECFlKPTwQB2Y_wrvQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EwsVS-VeY70MDk4yM50eFQ.png" /></figure><h4>Step 11 — Initiate cutover and finalize migration</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Nv9qMK5wDc6YmRtnoI4nfQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/890/1*h4M8UEzEJs9HYmehC3swuQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vNIQqNbqYaveaRi9lR4DzA.png" /></figure><h3>Verify Migration</h3><h4>Step 1 — Allocate Elastic IP to migrated VM</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nL2MnfCbDn2k4Gs9DjFDzg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*l3Tm_xSYDasRSy6tvq_N3A.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wzFCkvqAVtCG0CM9wskzaQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rbb42nAg2_b-JLwhO4ouRw.png" /></figure><h4>Step 2 — Verify data and service</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/956/1*ZQhD9BUxvDXNTJzK8F5oiQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*u9IARYeW3Pk5iaFNr5Fe2g.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bnf-HI2sgAxz2Y2Pkbnk3g.png" /></figure><p>Migrating virtual machines from VMware to AWS using AWS Application Migration Service offers a range of benefits, including a more straightforward migration process, minimal disruption, and improved operational efficiency.</p><p>It not only eases the migration process but also ensures virtual machine workloads run safely and efficiently on AWS. If you are considering migrating virtual machines in an on-premise environment to AWS, AWS Application Migration Service is the solution you can use.</p><p><strong>By </strong><a href="https://www.linkedin.com/in/muhammad-alfian-tirta-kusuma/"><strong>Muhammad Alfian Tirta</strong></a><strong>, CX Team Btech</strong></p><h3>Our Tagline</h3><p><em># Together is Better &amp; Continuous Learning</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b5ffa75bcfa2" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>