Stories by Btech Engineering on Medium

Securing OpenStack with MFA: Implementing TOTP Authentication on Horizon

Btech Engineering — Tue, 31 Mar 2026 06:49:46 GMT

Two-Factor Authentication is no longer optional — here’s how to enable it on your OpenStack deployment using TOTP, Keystone, and Horizon.

The Problem: One Password Is Never Enough

In cloud infrastructure, a single compromised password can mean everything — lost data, downed services, and a very bad day for your team. OpenStack’s Keystone has supported Time-based One-Time Password (TOTP) internally for a while, but there was a painful gap: Horizon, web dashboard, had no native TOTP support. If TOTP was active in Keystone, users couldn’t even log in to the web UI.

That gap is now closed.

OpenStack Horizon now supports TOTP-based Multi-Factor Authentication (MFA) natively, enabling true 2FA directly from browser — no workarounds, no CLI-only flows.

What Is TOTP and Why Does It Matter?

TOTP (Time-based One-Time Password) is technology behind authenticator apps like Google Authenticator and Duo Mobile. It generates a 6-digit code that changes every 30 seconds, derived from a shared secret key and current timestamp.

The security model is simple and powerful:

Something you know → your password
Something you have → your phone (TOTP code)

Even if an attacker obtains your password through phishing or a data breach, they still cannot log in without rotating code from your device.

This feature was originally driven by demand from Infomaniak’s public cloud customers — and it’s now available for anyone running OpenStack Epoxy or later.

Architecture Overview

MFA flow works like this:

Keystone handles token validation; Horizon provides UI entry point. TOTP secret is stored as a credential object tied to user in Keystone.

Environment

For this walkthrough, we’re using an All-in-One OpenStack Epoxy deployment via Kolla-Ansible

Part 1: Deploy OpenStack Epoxy with Kolla-Ansible

If you already have OpenStack running, skip to Part 2.

Install Dependencies

sudo apt-get update
sudo apt-get install python3-dev libffi-dev gcc libssl-dev \
  python3-selinux python3-setuptools python3-venv -y

python3 -m venv os-venv
source os-venv/bin/activate
pip install -U pip
pip install ansible==2.9.13

Install Kolla-Ansible

pip install kolla-ansible==20.3.0
sudo mkdir -p /etc/kolla
sudo chown $USER:$USER /etc/kolla

Configure and Deploy

cd ~
cp -r os-venv/share/kolla-ansible/etc_examples/kolla/* /etc/kolla
cp os-venv/share/kolla-ansible/ansible/inventory/all-in-one .
kolla-ansible install-deps
kolla-genpwd

Edit /etc/kolla/globals.yml with your environment values:

kolla_base_distro: "ubuntu"
network_interface: "ens3"
neutron_external_interface: "ens4"
kolla_internal_vip_address: "192.168.10.240"
enable_openstack_core: "yes"

Then deploy:

kolla-ansible bootstrap-servers -i ./all-in-one
kolla-ansible prechecks -i ./all-in-one
kolla-ansible deploy -i ./all-in-one

Common Issues:

Docker module not found → pip install docker

Dbus module not found → sudo apt install -y libdbus-1-dev libdbus-glib-1-dev gcc then pip install dbus-python

Part 2: Enable TOTP in Keystone

Create a custom Keystone config to add totp as an authentication method:

mkdir -p /etc/kolla/config/keystone
vim /etc/kolla/config/keystone/keystone.conf

Add the following:

[auth]
methods = password,token,totp

Part 3: Enable TOTP in Horizon

Edit the Horizon Jinja2 template used by Kolla-Ansible:

vim /root/os-venv/share/kolla-ansible/ansible/roles/horizon/templates/_9999-custom-settings.py.j2

Add:

OPENSTACK_KEYSTONE_MFA_TOTP_ENABLED = True

AUTHENTICATION_PLUGINS = [
    'openstack_auth.plugin.password.PasswordPlugin',
    'openstack_auth.plugin.totp.TotpPlugin',
]

Then reconfigure both services:

kolla-ansible -i all-in-one reconfigure --tags keystone,horizon

Part 4: Create a TOTP-Enabled User

Generate a TOTP Secret Key

Secret must be exactly 16 characters, then Base32-encoded:

echo "iam16characters." | base32 | tr -d "="
# Output: NFQW2MJWMNUGC4TBMN2GK4TTFYFZ

Keys shorter than 16 characters will fail silently — auth code will never match.

Create User and Assign Credentials

# Create user with MFA enforced
openstack user create \
  --project admin \
  --domain default \
  --project-domain default \
  --password-prompt \
  --enable-multi-factor-auth \
  --multi-factor-auth-rule password,totp \
  myuser

# Assign role
openstack role add --user myuser --project admin member

# Attach TOTP credential
openstack credential create --type totp myuser NFQW2MJWMNUGC4TBMN2GK4TTFYFZ

Verify credential was created:

openstack credential list

Expected output

+----------------------------------+------+----------------------------------+------------------------------+------------+
| ID                               | Type | User ID                          | Data                         | Project ID |
+----------------------------------+------+----------------------------------+------------------------------+------------+
| 1642a6997e754564aa45e102af92b6b4 | totp | c4014ddfce22424bb7416c63a50a7a47 | NFQW2MJWMNUGC4TBMN2GK4TTFYFZ | None       |
+----------------------------------+------+----------------------------------+------------------------------+------------+

Register in Your Authenticator App

Open Duo Mobile (or Google Authenticator):

Add new account → Choose Google or Enter key manually
Paste the key: NFQW2MJWMNUGC4TBMN2GK4TTFYFZ
Give it a name (e.g., OpenStack Admin) → Save

You’ll now see a rotating 6-digit code every 30 seconds.

Part 5: Logging In via Horizon

Go to your Horizon dashboard and log in with:

Username
Password
TOTP Code (from your authenticator app)

That’s it — full 2FA through the web interface.

Final Thoughts

Enabling MFA on OpenStack is a relatively small configuration change with a massive security payoff. With TOTP now supported end-to-end — from Keystone to Horizon to CLI — there’s no reason to leave your cloud dashboard protected by a single password.

If you’re running OpenStack in any environment exposed to multiple users or external networks, turn this on.

Want to Go Deeper?

This guide is based on hands-on work from Boer Technology engineering team — a managed cloud and professional services company that runs OpenStack, Kubernetes, and cloud-native platforms at scale for enterprise customers.

Get in Touch

Ready to optimize your infrastructure? Let’s discuss how we can help:

Website: www.btech.id
Email: support@btech.id

References:

Author:
Managed Services Team — PT. Boer Technology

Tags: #Openstack #TOTP #MFA #2FA #Authentication #Horizon #Keystone #OpenSource

Ansible AWX: Infrastructure Automation on Top of Kubernetes

Btech Engineering — Mon, 16 Mar 2026 04:45:58 GMT

This article documents our team’s research journey exploring Ansible AWX as an infrastructure automation orchestration platform — from initial deployment and OpenStack integration to air-gap installation.

What is Ansible AWX?

Ansible AWX is open-source version of Red Hat Ansible Automation Platform, providing a Web UI to interactively manage Ansible resources. With AWX, teams can run playbooks, manage inventories, schedule jobs, and handle credentials, all through a browser without needing to SSH into a terminal.

Since version 18.0, AWX is recommended to be deployed using an Operator on top of a Kubernetes platform.

Stage 1: Deploying AWX on K3s

Why K3s?

K3s is a lightweight Kubernetes distribution built by Rancher that retains the full functionality of Kubernetes while consuming far fewer resources compared to vanilla Kubernetes. It’s well-suited for lab environments with limited specs (minimum 4 cores / 8 GB RAM for a single-node AWX setup).

Deployment Steps

Install K3s:

curl -sfL https://get.k3s.io | sh -

Install AWX Operator:

git clone https://github.com/ansible/awx-operator.git
cd awx-operator
git checkout 2.19.1
make deploy

Create an AWX Instance:

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx-instance
  namespace: awx
spec:
  service_type: NodePort

AWX is then accessible via NodePort, and the admin password can be retrieved with:

kubectl get secret -n awx awx-instance-admin-password -o jsonpath="{.data.password}" | base64 --decode

Stage 2: Dynamic Inventory from OpenStack

Problem: DNS Resolution Failure

When attempting to sync dynamic inventory from OpenStack, AWX can failed because pods inside Kubernetes couldn’t resolve internal domains. The fix was to update the CoreDNS ConfigMap:

kubectl edit configmaps coredns -n kube-system

Add the required host entries to the NodeHosts block. After restarting CoreDNS, nslookup from inside the pod succeeded and inventory sync completed successfully.

Setting Up OpenStack Dynamic Inventory

Create a custom Credential Type for OpenStack with input fields: username, password, project_name, auth_url, and region_name. The Injector Configuration maps these fields to OS_* environment variables.

For multi-project support, add the following to the inventory Source Variables:

plugin: openstack.cloud.openstack
cloud: myopenstack
expand_hostvars: true
fail_on_errors: true
all_projects: true

Stage 3: Handling Dynamic SSH Users

One of the main challenges was that each OpenStack instance could have a different default SSH user (ubuntu, centos, cloud-user) and a different SSH key. Our team developed several approaches:

Approach 1: Bash Script to Update ansible_user

The update_host.sh script reads an ip_user_map.json file and sends PATCH requests to the AWX API to update each host's variables accordingly.

Approach 2: Auto-Detection via Ansible Facts

Create two Job Templates — one for setup (collecting and storing ansible facts into fact storage), and another for running the main tasks using already-known user from stored facts.

- name: Set ansible_user dynamically
  set_fact:
    ansible_user: >-
      {{
        ansible_env.SUDO_USER |
        default(ansible_user_id, true) |
        default('ubuntu')
      }}

Approach 3: Detection via OpenStack Image Name

Use image_name metadata from an OpenStack volume to automatically determine the correct SSH user via regex_replace.

Stage 4: AWX Execution Nodes (Multi-Project)

Network Isolation Between Projects

Each OpenStack project has its own isolated network. AWX control plane cannot always reach instances across all projects. The solution: deploy a Receptor-based Execution Node in each project as an execution agent.

How It Works

AWX control plane → sends instructions via Receptor → Execution Node inside the target project runs playbook → results are returned to AWX.

OS Requirements for Execution Nodes

Ubuntu 22.04 (Jammy) or RHEL 9 is required because:

Podman is officially available starting from Focal 20.10+
Python >= 3.9 is already the base version
The latest Ansible versions support FQCN syntax

Installing Receptor

Add a new instance in AWX UI and download bundle installer
Install Ansible on execution node
Edit inventory.yml inside bundle — update ansible_host and ansible_user
Run installer playbook:
Add DNS mapping for the execution node hostname in CoreDNS
Perform a health check from the AWX dashboard
Create an Instance Group and assign the execution node to that group

Stage 5: Manual Playbooks & Volume Mounting

To run playbooks stored manually (not from SCM/Git), AWX needs to be configured so its pods mount to the same directory on host.

spec:
  web_extra_volume_mounts: |
    - name: playbook-volume
      mountPath: /var/lib/awx/projects
  task_extra_volume_mounts: |
    - name: playbook-volume
      mountPath: /var/lib/awx/projects
  extra_volumes: |
    - name: playbook-volume
      hostPath:
        path: /data/projects
        type: Directory

All playbooks can then simply be stored in /data/projects on the AWX host VM.

Stage 6: Custom Execution Environment (EE)

If a playbook requires a collection not available in the default EE image (e.g., openstack.cloud), a custom EE image needs to be built using ansible-builder.

pip3 install ansible-builder
ansible-builder build -t ee-openstack:latest

The image then pushed to a registry and registered in AWX as a new Execution Environment.

Stage 7: Air-Gap Install (Offline Method)

In production environments isolated from the internet, the installation is done fully offline using:

Registry Mirror (Docker Registry v2) to cache images from docker.io and quay.io
K3s Airgap Install using a pre-downloaded k3s-airgap-images-amd64.tar.zst file
registries.yaml configuration so K3s pulls images from the local mirror

The end result is a fully functional AWX installation running without any internet connection, using the local registry mirror as the image source.

Capacity & Forks Calculation

AWX calculates execution node capacity based on:

CPU: num_cores × 4 forks
RAM: ram_size_mb / 100 forks

The lower of two values is used. For an instance with 4 vCPU / 16 GB RAM, AWX can process up to 1,100 events/second and execute up to 137 forks simultaneously — more than enough to manage thousands of hosts at once.

Lessons Learned

Key takeaways from this research:

DNS is everything. The most common issue when deploying AWX on K8s always comes back to CoreDNS configuration.
SSH users must be dynamic. In a multi-OS OpenStack environment, a single-credential-for-all approach simply won’t work.
One Execution Node per project is the best solution for network isolation across OpenStack projects.
Air-gap install requires thorough preparation — all dependencies must be available in local registry before starting the installation.
AWX Fact Storage is incredibly useful for caching host information so playbooks don’t need to gather facts on every run.

Closing

Ansible AWX proves to be a powerful platform for managing infrastructure automation at scale. With its operator-based approach on Kubernetes, AWX provides high flexibility — from single-node deployments to multi-execution-node clustering in fully isolated enterprise environments.

This research is still ongoing for further exploration of advanced AWX features. Some case in production environment, we used AWX for automatically upgrade service in hundred of nodes with interactive dashboard that decrease implementation time duration become more fast and efficient.

Need Help with Your Automation Platform?

Implementing and managing an automation platform can be challenging. Starting from create design architecture, method of procedure implementing and managing your automation platform such as AWX for ready to use in production level.

At Boer Technology, we expertise in designing, implementing and managing automation platform system solution including AWX. Our team has experience with managing production level with a hundred of nodes, which will allow your environment to be managed centrally and efficiently.

Get in Touch

Ready to managing your platform automatically? Let’s discuss how we can help:

Website: www.btech.id
Email: support@btech.id

References

Author

Managed Services Team — PT. Boer Technology

Tags: #AWX #AutomationPlatform #Ansible #OpenSource #DevOps #Kubernetes

VictoriaLogs Deployment: Single Node vs Cluster Mode — A Comprehensive Guide

Btech Engineering — Fri, 13 Feb 2026 08:26:26 GMT

Introduction

In the world of log management, finding right balance between performance, scalability, and resource efficiency is crucial. VictoriaLogs, developed by VictoriaMetrics team, offers a compelling solution with its lightweight architecture and powerful querying capabilities. But when should you choose a single-node deployment versus a cluster mode? This guide will walk you through both deployment strategies, helping you make an informed decision for your infrastructure.

What is VictoriaLogs?

VictoriaLogs is a high-performance, open-source log management system designed for efficient storage and analysis of large volumes of logs. It stands out with:

Superior Compression: Uses significantly less storage space compared to Elasticsearch or Loki
Low Resource Usage: Lower CPU/RAM/disk requirements
Simple Deployment: Single binary with minimal configuration
Powerful Query Language: LogsQL provides intuitive and fast log searching
Grafana Integration: Seamless visualization and alerting capabilities

VictoriaLogs vs Graylog

Before diving into deployment modes, let’s understand how VictoriaLogs compares to traditional solutions:

Part 1: Single Node Deployment

When to Use Single Node?

Single node deployment is ideal for:

Small to medium-sized infrastructures
Development and testing environments
Organizations with limited resources
Scenarios where vertical scaling is sufficient

Architecture Overview

Step-by-Step Installation

1. Install VictoriaLogs

Download and install VictoriaLogs binary:

# Download VictoriaLogs
wget https://github.com/VictoriaMetrics/VictoriaLogs/releases/download/v1.43.0/victoria-logs-linux-amd64-v1.43.0.tar.gz

# Extract archive
tar xvf victoria-logs-linux-amd64-*.tar.gz

# Move binary to system path
mv victoria-logs-prod /usr/local/bin/

# Create storage directory
mkdir -p /data/victoria-logs

2. Create Systemd Service

cat <[Unit]
Description=Victoria Logs
After=network-online.target
Wants=network-online.target systemd-networkd-wait-online.service

[Service]
Restart=on-failure
RestartSec=5s
PrivateTmp=true
PrivateDevices=false
ProtectHome=true
ProtectSystem=full
ExecStart=/usr/local/bin/victoria-logs-prod \\
    -retentionPeriod=1w \\
    -storageDataPath=/data/victoria-logs

[Install]
WantedBy=multi-user.target
EOF

Start and enable service:

sudo systemctl daemon-reload
sudo systemctl enable --now victoria-logs
sudo systemctl status victoria-logs

3. Configure Filebeat for Log Ingestion

Install required packages:

# Install Java
sudo apt install -y openjdk-11-jdk apt-transport-https gnupg2

# Add Elasticsearch repository
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-8.x.list

# Install Filebeat
sudo apt update && sudo apt install -y filebeat

Configure Filebeat:

filebeat.inputs:
- type: filestream
  id: my-filestream-id
  enabled: true
  paths:
    - /var/log/*.log

output.elasticsearch:
  hosts: ["http://:9428/insert/elasticsearch/"]
  parameters:
    _msg_field: "message"
    _time_field: "@timestamp"
    _stream_fields: "host.hostname,log.file.path"
  preset: balanced

Restart Filebeat service:

sudo systemctl restart filebeat
journalctl -u filebeat -f  # Verify connection

4. Test Setup

Query logs from CLI:

curl http://localhost:9428/select/logsql/query -d 'query=*'

Part 2: Cluster Mode Deployment

When to Use Cluster Mode?

Cluster mode becomes necessary when:

Single-node reaches vertical scalability limits
You need horizontal scaling across multiple machines
High availability is required
Log volume exceeds single-node capacity

Architecture Overview

Cluster Components

VictoriaLogs Cluster consists of three primary components:

vlinsert — Log ingestion frontend (Master Node)
vlstorage — Storage nodes (Worker Nodes)
vlselect — Query layer (Master Node)

Step-by-Step Installation

1. Prepare All Nodes

On all nodes (master and workers), download and install VictoriaLogs:

# Download VictoriaLogs binary
wget https://github.com/VictoriaMetrics/VictoriaLogs/releases/download/v1.43.0/victoria-logs-linux-amd64-v1.43.0.tar.gz

# Extract and install
tar xvf victoria-logs-linux-amd64-*.tar.gz
sudo mv victoria-logs-prod /usr/local/bin/

# Create storage directory
sudo mkdir -p /data/victoria-logs

2. Deploy vlstorage (Worker Nodes)

On each worker node, create storage service:

cat <<'EOF' | sudo tee /etc/systemd/system/victoria-logs-storage.service
[Unit]
Description=VictoriaLogs Storage (vlstorage)
After=network-online.target
Wants=network-online.target systemd-networkd-wait-online.service

[Service]
Type=simple
Restart=on-failure
RestartSec=5s
User=root
Group=root

# Hardening
PrivateTmp=true
PrivateDevices=false
ProtectHome=true
ProtectSystem=full
NoNewPrivileges=true

ExecStart=/usr/local/bin/victoria-logs-prod \
  -httpListenAddr=:9430 \
  -storageDataPath=/data/victoria-logs \
  -retentionPeriod=1w

[Install]
WantedBy=multi-user.target
EOF

Start service:

sudo systemctl daemon-reload
sudo systemctl enable --now victoria-logs-storage
sudo systemctl status victoria-logs-storage

3. Deploy vlinsert (Master Node)

On master node, create insert service:

cat <<'EOF' | sudo tee /etc/systemd/system/victoria-logs-insert.service
[Unit]
Description=VictoriaLogs Insert (vlinsert)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
Restart=on-failure
RestartSec=5s
User=root
Group=root

# Hardening
PrivateTmp=true
PrivateDevices=false
ProtectHome=true
ProtectSystem=full
NoNewPrivileges=true

ExecStart=/usr/local/bin/victoria-logs-prod \
  -httpListenAddr=:9428 \
  -storageNode=:9430 \
  -storageNode=:9430

[Install]
WantedBy=multi-user.target
EOF

Important: Replace and with actual worker IP addresses.

Start service:

sudo systemctl daemon-reload
sudo systemctl enable --now victoria-logs-insert
sudo systemctl status victoria-logs-insert

4. Deploy vlselect (Master Node)

On master node, create select service:

cat <<'EOF' | sudo tee /etc/systemd/system/victoria-logs-select.service
[Unit]
Description=VictoriaLogs Select (vlselect)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
Restart=on-failure
RestartSec=5s
User=root
Group=root

# Hardening
PrivateTmp=true
PrivateDevices=false
ProtectHome=true
ProtectSystem=full
NoNewPrivileges=true

ExecStart=/usr/local/bin/victoria-logs-prod \
  -httpListenAddr=:9429 \
  -storageNode=:9430 \
  -storageNode=:9430

[Install]
WantedBy=multi-user.target
EOF

Start service:

sudo systemctl daemon-reload
sudo systemctl enable --now victoria-logs-select
sudo systemctl status victoria-logs-select

5. Configure Filebeat (Worker Nodes)

On each worker node, configure Filebeat to send logs to vlinsert:

filebeat.inputs:
- type: filestream
  id: my-filestream-id
  enabled: true
  paths:
    - /var/log/*.log

output.elasticsearch:
  hosts: ["http://:9428/insert/elasticsearch/"]
  parameters:
    _msg_field: "message"
    _time_field: "@timestamp"
    _stream_fields: "host.hostname,log.file.path"
  preset: balanced

Restart Filebeat:

sudo systemctl restart filebeat
sudo systemctl enable filebeat
journalctl -u filebeat -f

6. Verify Cluster Health

Check component health:

# Check vlinsert health
curl -s http://:9428/health

# Check vlselect health
curl -s http://:9429/health

# Test query
curl -s http://:9429/select/logsql/query -d 'query=*'

Check service logs:

# On master node
sudo journalctl -u victoria-logs-insert -f
sudo journalctl -u victoria-logs-select -f

# On worker nodes
sudo journalctl -u victoria-logs-storage -f

Grafana Integration (Both Modes)

Install Grafana (Master Node)

# Install prerequisites
sudo apt-get install -y software-properties-common

# Add Grafana GPG key
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Add Grafana repository
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

# Install Grafana
sudo apt-get update
sudo apt-get install -y grafana

# Start and enable service
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
sudo systemctl status grafana-server

Install VictoriaLogs Plugin

# For Grafana v10.2.0
sudo grafana-cli plugins install victoriametrics-logs-datasource 0.16.3

# For latest Grafana
sudo grafana-cli plugins install victoriametrics-logs-datasource

# Restart Grafana
sudo systemctl restart grafana-server

Configure Data Source

Access Grafana: http://:3000
Default credentials: admin/admin
Go to Configuration → Data Sources
Click Add data source
Search for VictoriaLogs
Configure URL and Click Save & Test :

Single Node: http://:9428

Cluster Mode: http://:9429

Create Dashboard

Example LogsQL queries:

# Basic query - all logs from specific host
_stream:{host.hostname="worker01",log.file.path="/var/log/syslog"}

# Count logs by time
_stream:{host.hostname="worker01"} | stats by (_time) count()

# Search for specific pattern
_stream:{host.hostname="worker01"} "error" | stats by (_time, log.level) count()

# Advanced filtering
"processor error" | stats by (_time, host.name, host.ip, log.file.path) count()

Setup Alerting

Create an alert rule in Grafana:

Navigate to Alerting → Alert Rules
Click New alert rule
Configure query:

"processor error" | stats by (_time, host.name, host.ip, log.file.path) count()

Next, Set alert condition:

Time Range: now-15m to now
Threshold: IS ABOVE 0

Next, Add custom annotations with Alert-ID, Save and test

Test alert:

# Send test log
logger -p local0.crit "CRITICAL_TEST: This is a test alert"

# Verify in Grafana

Comparison: Single Node vs Cluster Mode

When to Migrate from Single Node to Cluster

Consider migrating when you experience:

Resource Saturation: CPU, RAM, or disk consistently at >80% usage
Slow Queries: Query response times degrading
Storage Limits: Approaching disk capacity limits
High Ingestion Rates: Log ingestion causing service degradation
Availability Requirements: Need for zero-downtime operations
Business Growth: Anticipating significant log volume increase

Migration Path: Single Node to Cluster

If you need to migrate from single node to cluster mode:

Step 1: Backup Current Data

sudo systemctl stop victoria-logs
sudo tar -czf victoria-logs-migration-backup.tar.gz /data/victoria-logs

Step 2: Deploy Cluster Infrastructure

Set up master and worker nodes
Deploy vlinsert, vlselect, vlstorage as described above

Step 3: Migrate Data (Optional)

# Copy data to one of the worker nodes
scp victoria-logs-migration-backup.tar.gz worker1:/tmp/

# On worker1
sudo systemctl stop victoria-logs-storage
sudo tar -xzf /tmp/victoria-logs-migration-backup.tar.gz -C /
sudo systemctl start victoria-logs-storage

Step 4: Update Filebeat Configuration

Update all Filebeat instances to point to new cluster:

output.elasticsearch:
  hosts: ["http://:9428/insert/elasticsearch/"]

Step 5: Verify and Monitor

Check cluster health
Verify log ingestion
Monitor performance metrics

Conclusion

Both VictoriaLogs deployment modes offer compelling advantages depending on your scale and requirements:

Choose Single Node When:

Log volume is under 500GB/day
Budget is limited
Operational simplicity is priority
You have development/testing workloads
Downtime windows are acceptable

Choose Cluster Mode When:

Log volume exceeds 500GB/day
High availability is critical
You need horizontal scalability
Zero-downtime operations are required
You have distributed infrastructure

Why VictoriaLogs Stands Out:

Resource Efficiency: Uses 50–80% less resources than Elasticsearch
Storage Compression: 10x better compression than alternatives
Simple Setup: Single binary, minimal dependencies
Powerful Query Language: LogsQL is intuitive and fast
Cost-Effective: Significantly reduces infrastructure costs

Need Help with Your VictoriaLogs Deployment?

Implementing and managing a robust log management system can be challenging. Whether you’re setting up your first VictoriaLogs instance or scaling to a cluster mode, having the right expertise makes all the difference.

At Boer Technology, we specialize in deploying and managing high-performance observability solutions including VictoriaLogs. Our team has extensive experience with both single-node and cluster deployments across various scales.

Get in Touch

Ready to optimize your log management infrastructure? Let’s discuss how we can help:

Website: www.btech.id
Email: support@btech.id

Reference:

Author:
Managed Services Team — PT. Boer Technology

Tags: #VictoriaLogs #LogManagement #DevOps #Observability #OpenSource #Monitoring #Grafana #Kubernetes #CloudNative #SRE

Karma: A Centralized Dashboard for Prometheus Alertmanagers

Btech Engineering — Sat, 09 Aug 2025 08:57:37 GMT

A. Getting to Know Karma
Juggling alerts from multiple Prometheus environments can feel like herding cats — noisy, scattered, and hard to control. While Alertmanager gets the job done, its built-in UI isn’t designed for easily managing alerts across clusters. That’s where Karma steps in, giving you a single, centralized dashboard to keep your alerting chaos in check.

Karma is an open-source web UI designed to aggregate and manage alerts from one or more Prometheus Alertmanager instances.
In the Prometheus ecosystem, it acts as a centralized control panel — giving operators, SREs, and DevOps teams a unified place to view, filter, group, and silence alerts across different environments, making large-scale alert management more efficient and less error-prone.

Karma comes with several benefits. Some of them are:

Centralization: View alerts from multiple Alertmanager instances in one unified dashboard
Advanced Filtering: Quickly narrow down alerts by labels, severity, source, or custom queries
Easy Silencing: Create, edit, and remove silences directly from the UI without manual API calls

B. Installing Karma and Verifying Access
This section will demonstrize how to install Karma using binary deployment. We’ll also test first access to Karma after installation. We’ll use this topology:

Karma Demonstration Topology

Components and pre-configuration on this lab are:

Bare workload cluster
- This cluster simulate machines with bare workloads
- Machines in this cluster already configured with Node Exporters to expose bare machine metrics
Containerized workload cluster
- This cluster simulate machines with Docker container workloads
- Docker service in this cluster already configured to expose docker metrics
bare-mon (192.168.1.232/24)
- This machine hosts Prometheus and Alertmanager for Bare workload machines
- Prometheus is already configured to scrape metrics from Node Exporters in bare workload clusters and has six alert rules configured: NodeCPUUsageHigh, NodeMemoryPressure, NodeFileSystemAlmostFull, NodeProcsBlocked, and DeadMansSwitch for monitoring the health of the alerting system
- Alertmanager is already configured and integrated with Prometheus in the same machine
ctr-mon (192.168.1.234/24)
- This machine hosts Prometheus and Alertmanager for Containerized workload machines
- Prometheus is already configured to scrape metrics fromDocker metrics in containerized workload clusters and has six alert rules configured: DockerContainerRunningLow, DockerContainerStartSlow, DockerProcessMemoryHigh, DockerProcessFdsHigh, DockerNetworkBytesLow, and DeadMansSwitch for monitoring the health of the alerting system
- Alertmanager is already configured and integrated with Prometheus in the same machine
karma (192.168.1.235/24)
- This machine is using Ubuntu 22.04 Server as OS. IPv4, Internet Access, and Hostname are already configured

With all prerequisites in place, we can now proceed to installing Karma.

Get the latest Karma binary. At the time of writing this article, the latest version is v0.121

wget https://github.com/prymitive/karma/releases/download/v0.121/karma-linux-amd64.tar.gz
tar -zxvf karma-linux-amd64.tar.gz

Give executable permission to the binary and move it to a PATH directory (such as /usr/local/bin)

sudo chmod +x karma-linux-amd64
sudo mv karma-linux-amd64 /usr/local/bin/karma

Now, we’ll create a configuration file for Karma. Create a new directory and a new configuration file

sudo mkdir /etc/karma
sudo nano /etc/karma/karma.yml

For configuration content, We’ll configure these things:
- For security, enable HTTP basic authentication so Karma is not open without any security measures

authentication:
  basicAuth:
    users:
      - username: alert-admin
        password: Passw0rd$

- Configure Karma to get alerts from Alertmanagers in bare-mon and ctr-mon. Default URL for alertmanagers is http://:9093. We’ll also configure a health check for each Alertmanager using the DeadMansSwitch alert, which is always triggered to indicate that the alerting system is healthy.

alertmanager:
  servers:
    - name: bare-alertmanager
      uri: http://192.168.1.232:9093
      healthcheck:
        filters:
          bare-mon:
            - alertname=DeadMansSwitch
            - instance=bare-mon

    - name: ctr-alertmanager
      uri: http://192.168.1.234:9093
      healthcheck:
        filters:
          ctr-mon:
            - alertname=DeadMansSwitch
            - instance=ctr-mon

Final full configuration inside of /etc/karma/karma.yml will be like this:

authentication:
  basicAuth:
    users:
      - username: alert-admin
        password: Passw0rd$
alertmanager:
  servers:
    - name: bare-alertmanager
      uri: http://192.168.1.232:9093
      healthcheck:
        filters:
          bare-mon:
            - alertname=DeadMansSwitch
            - instance=bare-mon

    - name: ctr-alertmanager
      uri: http://192.168.1.234:9093
      healthcheck:
        filters:
          ctr-mon:
            - alertname=DeadMansSwitch
            - instance=ctr-mon

Save the configuration. Next, we’ll create a systemd service file so the Karma process is managed as a service

sudo nano /etc/systemd/system/karma.service

Content of the file will be:

[Unit]
Description = Karma Service

[Service]
ExecStart = /usr/local/bin/karma --config.file /etc/karma/karma.yml

[Install]
WantedBy = multi-user.target

Save the file, then enable and start the Karma service. Verify that Karma is running and listening on its default port (8080)

sudo systemctl enable --now karma.service
sudo systemctl status karma
sudo lsof -P -i -n | grep karma

Verify status of Karma and It’s listening on port 8080

Test to access Karma by accessing it’s URL: http://:8080. We’ll be prompted with HTTP basic auth

Karma HTTP Basic Auth

Enter the credentials defined in the Karma configuration. If the login succeeds, you’ll be redirected to the Karma dashboard

Karma Dashboard Access

C. Essential Alert Management in Action with Karma

Similar to Alertmanager, the Karma dashboard displays only alerts that are firing. For example, we’ll trigger the NodeCPUUsageHigh alert by running a stress test on one machine in the bare workload cluster

sudo apt install stress-ng -y
stress-ng --cpu "$(nproc)" --cpu-method matrixprod --timeout 600s

Wait until the alert is firing. It will then appear in the Karma dashboard

Alert status in Prometheus

Karma Dashboard is showing NodeCPUUsageHigh Alert

We can also configure silences directly from the Karma dashboard instead of using the Alertmanager dashboard. Click the silence icon at the top right of the Karma dashboard, then choose the Alertmanager instance to configure. From there, add the silence attributes such as label matchers, duration, and other relevant options

Configuring silence from Karma Dashboard

As alerts from multiple Alertmanagers are aggregated in the Karma dashboard, we can apply label filters to display only specific alerts

Applying label filter to show only specific alerts in Karma Dashboard

Some tips for working with multiple Alertmanagers aggregated in the Karma dashboard:
- Use consistent labels: Ensure every alert includes an identification label with a clear, uniform naming convention (such as instance, cluster, and etc)
- Separate environments clearly: Consistent environment labeling prevents mixing production and staging alerts in the same view
- Assign meaningful severities: Use standardized severity label values like info, critical, high, medium, and low to help prioritize alerts
- Keep labels short and descriptive: Avoid overly long or ambiguous label values for faster filtering and grouping in Karma

References:
- https://github.com/prymitive/karma

Author:
Kevin Timoteus Sirait — PT. Boer Technology | Medium | LinkedIn

Build an Interactive OpenStack Compute Node Monitoring System with Prometheus, Grafana, and…

Btech Engineering — Sat, 19 Oct 2024 03:12:13 GMT

Build an Interactive OpenStack Compute Node Monitoring System with Prometheus, Grafana, and Telegram Bot for Real-Time and On-Demand Queries

In this article, we’ll explore how to build an interactive OpenStack compute node monitoring system with Prometheus, Grafana, and a Telegram bot for real-time and on-demand resource queries. This guide will take you through setting up Prometheus to collect metrics, Grafana to visualize data, and a Telegram bot to allow for quick, on-demand resource usage queries. By the end, you’ll have a dynamic monitoring solution that delivers real-time insights and instant, customizable updates directly to your Telegram chat, enhancing your ability to manage OpenStack compute nodes efficiently. Whether you’re responsible for cloud infrastructure or looking to optimize your OpenStack environment, this setup will elevate your monitoring capabilities.

We’ll use this topology:

Lab Topology

The components and prerequisites for this lab include:

OpenStack Cluster (Version 2024.1)
An operational OpenStack cluster is required for this lab. In this instance, we already have a cluster configured with one controller node (os-controller01) and two compute nodes (os-compute01 and os-compute02). Additionally, we need to have some instances running within the cluster.
Prometheus (Version 2.31.2+ds1)
We also need prometheus installed in this lab. Prometheus will act as scraping agent for metrics from compute nodes
Grafana (Version 11.2.2)
We need Grafana for this lab for metrics visualization
Bot (Python3 Version 3.10.12)
For integration with Telegram Bot, we need to create one first. You can create your own telegram bot by contacting @BotFather. A server with Python3 environment installed is also required to host the bot

This lab will use Ubuntu 22.04 Server as operating system for all servers and instances inside the OpenStack Cluster.

Now that all the prerequisites are in place, we can move on to the main configuration.

A. Install node-exporter and libvirt-exporter to Compute Nodes

sudo -i

Install node-exporter via snap. After that, enable and start the service

# Install the node-exporter from snap edge channel
snap install node-exporter --edge

# Enable and start node-exporter service
systemctl enable --now snap.node-exporter.node-exporter

# Verify the node-exporter service status
systemctl status snap.node-exporter.node-exporter

Install libvirt-exporter via snap. Then, grant access from libvirt-exporter to libvirt interface. After that, enable and start the service

# Install the libvirt-exporter from snap stable channel
snap install prometheus-libvirt-exporter

# Grant access for libvirt-exporter to libvirt interface
snap connect prometheus-libvirt-exporter:libvirt

# Enable and start libvirt-exporter service
systemctl enable --now snap.prometheus-libvirt-exporter.daemon 

# Verify the libvirt-exporter service status
systemctl status snap.prometheus-libvirt-exporter.daemon

To verify, we can test to scrape the metrics manually using cURL. These operations should return metrics that are collected by the exporters. Note that node-exporter is running on port 9100 while libvirt-exporter is using port 9177

# Verify that node-exporter is running and collecting metrics
curl http://localhost:9100/metrics

# Verify that libvirt-exporter is running and collecting metrics
curl http://localhost:9177/metrics

B. Configure Prometheus to Scrape Node and Libvirt Metrics in Prometheus Server

sudo -i

After that, edit prometheus.yml file. This file location will vary according to the prometheus installation method. In this lab, the prometheus is installed using apt. The file location will be /etc/prometheus/prometheus.yml

nano /etc/prometheus/prometheus.yml

In the scrape_configs section, add a job configuration to scrape the node-exporter and libvirt-exporter metrics from each compute node. Define the job_name and scrape targets. For the target, use the : format. In this lab, we'll use the hostnames of the compute nodes. To achieve this, the Prometheus server is already configured to translate the hostname of each compute node to its IP address (using /etc/hosts)

# Put these in scrape_configs section
# This configuration will scrape node-exporter from os-compute01 and os-compute02
- job_name: 'Node Exporter'
  static_configs:
    - targets: ['os-compute01:9100', 'os-compute02:9100']

# This configuration will scrape libvirt-exporter from os-compute01 and os-compute02
- job_name: 'Libvirt Exporter'
  static_configs:
    - targets: ['os-compute01:9177', 'os-compute02:9177']

Save the configuration, and then restart prometheus service to apply

service prometheus restart

To verify the configuration, open the prometheus URL in web browser (http://:9090). Then go to status > targets. Make sure the previously added jobs are up

Prometheus ‘Node Exporter’ and ‘Libvirt Exporter’ Jobs Status

C. Grafana Configuration

In this part, we’ll configure Grafana to Visualize three resource utilization metrics. Those are:
- CPU Usage of compute nodes and instances inside of it
- Memory Usage of compute nodes and instances inside of it
- Bandwidth Utilization of compute nodes and instances inside of it
We’ll also install grafana-image-renderer plugin and create a service account with token that we’ll use later.

Add Prometheus as data source to Grafana. Specify the data source connection string to the prometheus URL.

Grafana Data Source Pointing to Prometheus URL

After that, add a new Grafana Dashboard. Then, add three variables to it. The first variable will be called host. This variable will stores hostnames of compute nodes. Variable type will be Query . The Query type is Label values with Label nodename.

‘host’ Variable Configuration

The second variable will be called domain. This variable will stores domain names of instances retrieved by Libvirt Exporters. Variable type will be Query . The Query type is Label values with Label domain. Also, add Label filters instance=$host:9177 so it will only query instances in specific compute node. We’ll also enable Multi-Value and Include All Option configuration for this variable

‘domain’ Variable Query Configuration

‘domain’ Variable Selected Options

The third variable will be called netiface. This variable will stores device names of compute nodes retrieved by Node Exporters. Variable type will be Query . The Query type is Label values with Label device. Also, add Label filters instance=$host:9100 so it will only query devices in specific compute node.

‘netiface’ Variable Configuration

After all variables are in place. We can proceed to add visualization panels. The first panel will be visualization of CPU usage of compute node and all instances inside of it (in percent). To retrieve CPU usage of compute node, we can use this PromQL:

(1 - avg(irate(node_cpu_seconds_total{mode="idle", instance=~"$host:9100"}[1m])) by (instance)) * 100

For CPU usage of all instances inside the compute node, use this PromQL:

avg by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{instance=~"$host:9177", domain=~"$domain"}[1m]))*100

For the Unit of Measurement, use Percent (0–100)

Host and Instances CPU Usage Visualization Panel Configuration

The second panel will be visualization of Memory usage of compute node and all instances inside of it (in bytes). To retrieve Memory usage of compute node, we can use this PromQL:

sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{instance=~"$host:9100"}[1m]) - (avg_over_time(node_memory_MemFree_bytes{instance=~"$host:9100"}[1m]) + avg_over_time(node_memory_Cached_bytes{instance=~"$host:9100"}[1m])))

We also need to retrieve the memory total of compute node by using this PromQL:

sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{instance=~"$host:9100"}[1m]))

Use this PromQL to retrieve memory usages of instances inside of the compute node:

sum by (domain) (libvirt_domain_memory_stats_available_bytes{instance=~"$host:9177", domain=~"$domain"} - libvirt_domain_memory_stats_usable_bytes{instance=~"$host:9177", domain=~"$domain"})

For the Unit of Measurement, use bytes(SI)

Host and Instances Memory Usage Visualization Panel Configuration

The third panel will be visualization of Bandwidth usage of compute node and all instances inside of it (in bytes/sec). To retrieve Bandwidth usage of compute node, we can use this PromQL:

avg by (instance) (rate(node_network_receive_bytes_total{instance=~"$host:9100",device=~"$netiface"}[1m]) + rate(node_network_transmit_bytes_total{instance=~"$host:9100",device=~"$netiface"}[1m]))

And use this PromQL to retrieve bandwidth usage of all instances inside of the compute node:

avg by (domain) (rate(libvirt_domain_interface_stats_receive_bytes_total{instance=~"$host:9177", domain=~"$domain"}[1m]) + rate(libvirt_domain_interface_stats_transmit_bytes_total{instance=~"$host:9177", domain=~"$domain"}[1m]))

For the Unit of Measurement, use bytes/sec(SI)

Host and Instances Bandwidth Usage Visualization Panel Configuration

Save everything. Now, We’ll have visualization dashboard with 3 variables and 3 panels to monitor the resource usage of compute nodes and instances inside

Grafana Dashboard to Monitor Resource Usage of Compute Nodes and Instances

Now, we need to create a service account with token so we can access the visualization panels with our python3 script later. From Grafana, go to Menu > Administration > Users and access > Service accounts. After that, Add a new service account with Viewer role.

Creating a new Viewer Service Account in Grafana

After created, you’ll redirected to the Service Account detail page. In this page, add a new service account token. Generate and save the token for later.

Generated Grafana Service Account Token. Save This Token for Later

Next, we’ll proceed to install grafana-image-renderer plugin. We need this plugin so we can retrieve rendered image of grafana monitoring panel. We’ll do this step from Grafana Server’s terminal

# Login as root 
sudo -i

# Install libgbm1. This package is required by the plugin
apt update && apt install libgbm1 -y

# Install grafana-image-renderer plugin
grafana-cli plugins install grafana-image-renderer

# Set appropriate owner for /var/lib/grafana directory
chown -R grafana:grafana /var/lib/grafana

# Restart grafana-server service
service grafana-server restart

To verify, show the grafana-server status

service grafana-server status

From the output, make sure grafana-image-renderer plugin is loaded in the CGroup section

Grafana Server Status Showing That Plugin is Loaded

D. Create Python3 Script to Query Resource Utilization and Get the Image of Monitoring Panel

Install requests Python3 library

pip3 install requests

Create a new file for the script. For example, here we’ll store the project files in a directory and create new python3 file called lab_utill.py

mkdir bot_util
mkdir bot_util/images
cd bot_util
touch lab_util.py

Now, we’ll code the script. First, import all needed libraries

# Import needed libraries
from datetime import datetime, timezone
import requests
import urllib

After that, we’ll define three global variables

# Prometheus API URL
PROMETHEUS_API_URL = "http://192.168.1.201:9090/api/v1/query"

# Grafana Render base URL
GRAFANA_BASE_URL = "http://192.168.1.203:3000/render/d/ae0kg8ynstzb4f/openstack-resource-monitoring-dashboard"

# Grafana authorization header. Replace  with appropriate Grafana Service Account Token
GRAFANA_HEADERS = [('Authorization', 'Bearer ')]

Here the brief explanation for each variable:
- PROMETHEUS_API_URL: This variable stores the API endpoint for prometheus query. The format will be http://:/api/v1/query
- GRAFANA_BASE_URL: This variable stores the base render URL for Grafana. Assuming the Grafana Dashboard URL is http://:/d//, add /render before /d/ so it will be http://:/render/d//
- GRAFANA_HEADERS: The authorization header so the script can access the Grafana render URL defined above. The format will be [(‘Authorization’, ‘Bearer ’)]

After that, we’ll define a function that will retrieve all instances from specific compute node, then create a dictionary that will map instances name to domain and return it

def instances_name_map(node_name):
    # Query will be done to libvirt_exporter metrics result
    # We need to append the libvirt_exporter port to the node_name
    libvirt_exporter= node_name + ":9177"

    # Request parameter that will be sent to the Prometheus Query API
    request_params = {
        "query": f'sum by (domain, instance_name) (libvirt_domain_info_meta{{instance=~"{libvirt_exporter}"}})',
        'current_time' : datetime.now(timezone.utc).isoformat() + 'Z'
    }
    # Initiate empty dictionary that will store instance name to domain mapping
    instances_mapping = {}
    
    try:
        # Send request to Prometheus Query API and get the result JSON
        instance_map_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        #Iterate the JSON result to get the instance name to domain mapping
        for instance in instance_map_json["data"]["result"]:
            # Check if the instance name is defined in the JSON result
            if 'instance_name' in instance['metric']:
                instances_mapping[instance['metric']['domain']] = instance['metric']['instance_name']
            # If not, use the domain name as the instance name
            else:
                instances_mapping[instance['metric']['domain']] = instance['metric']['domain']
        # Return the dictionary with the instance name to domain mapping
        return instances_mapping
    # On exception, print error message and exit
    except:
        print ("Error in getting instances name map")
        exit(1)

Then, define the cpu_util function to calculate the CPU usage for a specified node by constructing request parameters and querying the Prometheus API for both the node's overall CPU percentage and the top three instances with the highest CPU usage. Format the results into a summary message that includes the node name, overall CPU usage, and the top instances, and return this message from the function.

def cpu_util(node_name):
    # Define the node exporter and libvirt exporter hostname based on the node name
    node_exporter = node_name + ":9100"
    libvirt_exporter= node_name + ":9177"

    # Prepare the request parameters for querying CPU usage from Prometheus
    request_params = {
        "query": f'(1 - avg(irate(node_cpu_seconds_total{{mode="idle", instance=~"{node_exporter}"}}[1m])) by (instance)) * 100',
        'current_time' : datetime.now(timezone.utc).isoformat() + 'Z'  # Get the current time in UTC format
    }

    try:
        # Send a GET request to the Prometheus API to retrieve CPU usage data
        node_cpu_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        # Extract and round the CPU usage percentage from the response
        node_cpu_percent = round(float(node_cpu_json["data"]["result"][0]["value"][1]), 2)
    except:
        # Handle any exceptions by setting the CPU percentage to a default value
        node_cpu_percent = "[No Value]"
    
    # Initialize a string to hold the CPU usage information for instances
    instances_usage_text = ""
    # Get the mapping of instance names for the given node
    instances_map = instances_name_map(node_name)
    # Update the request parameters to query the top 3 instances by CPU usage
    request_params['query'] = f'topk(3, avg by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{{instance=~"{libvirt_exporter}"}}[1m]))*100)'
    
    try:
        # Send a GET request to the Prometheus API to retrieve CPU usage data for instances
        instances_cpu_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        # Check if any results were returned
        if len(instances_cpu_json['data']['result']) == 0:
            instances_usage_text = "[No Value]"  # No instances found
        else:
            # Iterate through the results to format the CPU usage information for each instance
            for i in range(len(instances_cpu_json['data']['result'])):
                instance_cpu_percent = round(float(instances_cpu_json["data"]["result"][i]["value"][1]), 2)
                instance_name = instances_map[instances_cpu_json["data"]["result"][i]["metric"]["domain"]]
                instances_usage_text += f"{i+1}. {instance_name}: {instance_cpu_percent}%\n"
    except:
        # Handle any exceptions by setting the instances usage text to a default value
        instances_usage_text = "[No Value]"
    
    # Prepare the final result message with node name and CPU usage information
    result_message = f"""
Node name: {node_name}
CPU usage: {node_cpu_percent}% out of 100%\n
Top 3 instances with highest CPU usage inside node:
{instances_usage_text}
    """

    return result_message  # Return the formatted result message

Next, define the memory_util function to calculate the memory usage for a specified node by constructing request parameters to query the Prometheus API for both the total and used memory. Format the results into a summary message that includes the node name, memory usage, and the top instances, and return this message from the function.

def memory_util(node_name):
    # Define the node_exporter and libvirt_exporter variables with the appropriate ports for the specified node
    node_exporter = node_name + ":9100"
    libvirt_exporter = node_name + ":9177"

    # Create request parameters to calculate the memory usage by subtracting free, cached, buffers, and reclaimable memory from total memory
    request_params = {
        "query": f'sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{{instance=~"{node_exporter}"}}[1m]) - (avg_over_time(node_memory_MemFree_bytes{{instance=~"{node_exporter}"}}[1m]) + avg_over_time(node_memory_Cached_bytes{{instance=~"{node_exporter}"}}[1m]) + avg_over_time(node_memory_Buffers_bytes{{instance=~"{node_exporter}"}}[1m]) + avg_over_time(node_memory_SReclaimable_bytes{{instance=~"{node_exporter}"}}[1m])))',
        'current_time': datetime.now(timezone.utc).isoformat() + 'Z'
    }

    # Try to fetch the node's memory usage from the Prometheus API and convert it to gigabytes
    try:
        node_memory_usage_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        node_memory_usage_gb = round(int(node_memory_usage_json["data"]["result"][0]["value"][1]) / 1024 / 1024 / 1024, 2)
    except:
        node_memory_usage_gb = "[No Value]"

    # Update request parameters to fetch the total memory available for the node
    request_params['query'] = f'sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{{instance=~"{node_exporter}"}}[1m]))'
    # Try to fetch the total memory from the Prometheus API and convert it to gigabytes
    try:
        node_memory_total_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        node_memory_total_gb = round(int(node_memory_total_json["data"]["result"][0]["value"][1]) / 1024 / 1024 / 1024, 2)
    except:  
        node_memory_total_gb = "[No Value]"
    
    # Calculate the percentage of memory usage based on the usage and total memory, handling potential division errors
    try:
        node_usage_percent = round(node_memory_usage_gb / node_memory_total_gb * 100, 2)
    except:
        node_usage_percent = "[No Value]"

    # Initialize an empty string to hold the memory usage details for individual instances
    instances_usage_text = ""
    # Get the mapping of instances for the specified node
    instances_map = instances_name_map(node_name)
    # Update request parameters to fetch the top three instances with the highest memory usage
    request_params['query'] = f'topk(3, (libvirt_domain_memory_stats_available_bytes{{instance=~"{libvirt_exporter}"}} - libvirt_domain_memory_stats_usable_bytes{{instance=~"{libvirt_exporter}"}}))'
    # Try to fetch the memory usage for the top instances from the Prometheus API
    try:
        instances_memory_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        # Check if any instances were returned and format their memory usage into the instances_usage_text
        if len(instances_memory_json['data']['result']) == 0:
            instances_usage_text = "[No Value]"
        else:
            for i in range(len(instances_memory_json['data']['result'])):
                instance_memory_gb = round(int(instances_memory_json["data"]["result"][i]["value"][1]) / 1024 / 1024 / 1024, 2)
                instance_name = instances_map[instances_memory_json["data"]["result"][i]["metric"]["domain"]]
                instances_usage_text += f"{i + 1}. {instance_name}: {instance_memory_gb} GB\n"
    except Exception as e:
        instances_usage_text = "[No Value]"

    # Construct a result message summarizing the node's memory usage and the top instances
    result_message = f"""
Node name: {node_name}
Memory usage: {node_memory_usage_gb} GB out of {node_memory_total_gb} GB ({node_usage_percent}% usage)\n
Top 3 instances with highest memory usage inside node:
{instances_usage_text}
"""
    
    # Return the formatted result message
    return result_message

Now, define the bandwidth_util function to calculate the bandwidth usage for a specified node and network device by constructing request parameters to query the Prometheus API for both received and transmitted bytes. Format the results into a summary message that includes the node name, bandwidth usage, and the top instances, and return this message from the function.

def bandwidth_util(node_name, net_dev):
    # Define the node_exporter and libvirt_exporter variables with the appropriate ports for the specified node
    node_exporter = node_name + ":9100"
    libvirt_exporter = node_name + ":9177"

    # Create request parameters to calculate the average bandwidth by summing the received and transmitted bytes for the specified network device
    request_params = {
        "query": f'avg by (instance) (rate(node_network_receive_bytes_total{{instance=~"{node_exporter}",device=~"{net_dev}"}}[1m]) + rate(node_network_transmit_bytes_total{{instance=~"{node_exporter}",device=~"{net_dev}"}}[1m]))',
        'current_time': datetime.now(timezone.utc).isoformat() + 'Z'
    }

    # Try to fetch the node's bandwidth usage from the Prometheus API and convert it to kilobytes
    try:
        node_banwidth_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        node_bandwidth_kb = round(float(node_banwidth_json["data"]["result"][0]["value"][1]) / 1024, 3)
    except:
        node_bandwidth_kb = "[No Value]"
    
    # Initialize an empty string to hold the bandwidth usage details for individual instances
    instances_usage_text = ""
    # Get the mapping of instances for the specified node
    instances_map = instances_name_map(node_name)
    # Update request parameters to fetch the top three instances with the highest bandwidth usage
    request_params['query'] = f'topk(3, avg by (domain) (rate(libvirt_domain_interface_stats_receive_bytes_total{{instance=~"{libvirt_exporter}"}}[1m]) + rate(libvirt_domain_interface_stats_transmit_bytes_total{{instance=~"{libvirt_exporter}"}}[1m])))'
    # Try to fetch the bandwidth usage for the top instances from the Prometheus API
    try:
        instances_bandwidth_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        # Check if any instances were returned and format their bandwidth usage into the instances_usage_text
        if len(instances_bandwidth_json['data']['result']) == 0:
            instances_usage_text = "[No Value]"
        else:
            for i in range(len(instances_bandwidth_json['data']['result'])):
                instance_bandwidth_kb = round(float(instances_bandwidth_json["data"]["result"][i]["value"][1]) / 1024, 3)
                instance_name = instances_map[instances_bandwidth_json["data"]["result"][i]["metric"]["domain"]]
                instances_usage_text += f"{i + 1}. {instance_name}: {instance_bandwidth_kb} KB/s\n"
    except Exception as e:
        instances_usage_text = "[No Value]"

    # Construct a result message summarizing the node's bandwidth usage and the top instances
    result_message = f"""
Node name: {node_name}
Bandwidth of {net_dev}: {node_bandwidth_kb} KB/s\n
Top 3 instances with highest bandwidth inside node:
{instances_usage_text}
"""
    
    # Return the formatted result message
    return result_message

Last step in this sub-part, implement the get_grafana_dashboard function to retrieve and save Grafana dashboard images based on specified resource types and node names. Ensure to include error handling for the image retrieval process and construct the necessary query parameters for the Grafana API request. This function will return the path to the saved image or exit the program if an error occurs.
Note that to retrieve the panel number/ID, view the panel and check the Grafana Panel URL, then use the viewPanel parameter value found in the URL. In this lab, the panels for CPU, memory, and bandwidth utilization have Panel IDs 1, 2, and 3, respectively

def get_grafana_dashboard(resource_type, node_name, net_dev=None):
    # Map resource types to their corresponding panel numbers in Grafana
    panel_map = {
        "cpu": 1,
        "memory": 2,
        "bandwidth": 3
    }
    
    # Construct the query parameters for the Grafana API request
    grafana_params = f'orgId=1&from=now-1m&var-host={node_name}&var-domain=All&var-instances=All&viewPanel={panel_map[resource_type]}&width=1366&height=1024&autofitpanels'
    
    # If a specific network device is provided, include it in the parameters
    if net_dev != None:
        grafana_params += f'&var-netiface={net_dev}'
    
    # Create the full URL for the Grafana image request
    image_url = f'{GRAFANA_BASE_URL}?{grafana_params}'
    
    # Define the name and directory for saving the image
    image_name = f'{node_name}.png'
    image_directory = "images/"
    
    try:
        # Set up a URL opener with custom headers for the Grafana request
        opener = urllib.request.build_opener()
        opener.addheaders = GRAFANA_HEADERS
        urllib.request.install_opener(opener)
        
        # Retrieve the image from the constructed URL and save it to the specified directory
        urllib.request.urlretrieve(image_url, f'{image_directory}/{image_name}')
        
        # Return the path to the saved image
        return f'{image_directory}/{image_name}'
    
    except Exception as e:
        # Print an error message if the image retrieval fails and exit the program
        print(f"Error in retrieving image: {e}")
        exit(1)

Save the python3 script file and proceed to the next part

E. Create Python3 Script to Serve the Telegram Bot

Install pyTelegramBotAPI library

pip3 install pyTelegramBotAPI

Create a new text file that will store help message on how to use the bot

nano bot_util/help.txt

For the content, create a helpful explanation text on how to use the bot. Example:

Usage:  /node_util [host] [resource] 

Parameters:
  host - The node/host name. Make sure It's exists in Prometheus.
  resource - Resource type. Either cpu/memory/bandwidth.
  network_ifname - Network interface name to check. Can only be used with 'bandwidth' util type.

After that, save the file and proceed to next step

Create the Python3 script file

cd bot_util
touch bot.py

Now we’ll proceed to the coding part. First, import needed script and library

# Import needed script and library
import lab_util
import telebot

Next, define BOT_API_TOKEN global variable that will store the Telegram bot token, and create a new Telegram bot instance using that token

# Define the API token for the bot, which is required to authenticate with the Telegram Bot API
# Replace  with appropriate Telegram Bot Token
BOT_API_TOKEN = ""

# Create an instance of the TeleBot class using the provided API token
# This instance will be used to interact with the Telegram Bot API and handle messages
bot = telebot.TeleBot(BOT_API_TOKEN)

After that, create a function that will return text content of help.txt file

def help_message():
    # This function attempts to read the content of the 'help.txt' file
    try:
        # Open the 'help.txt' file in read mode
        with open('help.txt', 'r') as file:
            # Read the entire content of the file
            content = file.read()
            # Return the content read from the file
            return content
    except:
        # If an error occurs (e.g., file not found), return a default error message
        return "Can't retrieve help message"

Create message handler that will send help message when user send /start and /help command to the bot

# Send the help message when user send '/start' command
@bot.message_handler(commands=['start'])
def send_welcome(message):
    bot.reply_to(message, help_message())

# Send the help message when user send '/help' command
@bot.message_handler(commands=['help'])
def send_help(message):
    bot.reply_to(message, help_message())

Next, implement message handler when user send the /node_util command followed by needed parameters. Thehandle_node_util function will process user commands for retrieving node resource utilization metrics. This function should parse the incoming message to identify the requested resource type and node name, then call the appropriate utility functions to gather the data. Finally, ensure that the bot responds with both textual information and relevant images, while also handling any errors or invalid commands gracefully.

@bot.message_handler(commands=['node_util'])
def handle_node_util(message):
    # Split the incoming message text into command parameters
    command_params = message.text.split()

    # Check if there are exactly 3 parameters (node_name and resource_type)
    if len(command_params) == 3:
        node_name = command_params[1]  # Extract the node name from the parameters
        resource_type = command_params[2]  # Extract the resource type from the parameters
        
        # If the resource type is 'cpu', retrieve CPU utilization and corresponding image
        if resource_type == 'cpu':
            cpu_util_text = lab_util.cpu_util(node_name)  # Get CPU utilization text
            cpu_util_image = lab_util.get_grafana_dashboard(resource_type, node_name)  # Get the Grafana dashboard image for CPU
            bot.reply_to(message, cpu_util_text)  # Send the CPU utilization text back to the user
            with open(cpu_util_image, 'rb') as image:  # Open the image file
                bot.send_photo(message.chat.id, image)  # Send the CPU utilization image to the user
        
        # If the resource type is 'memory', retrieve memory utilization and corresponding image
        elif resource_type == 'memory':
            memory_util_test = lab_util.memory_util(node_name)  # Get memory utilization text
            memory_util_image = lab_util.get_grafana_dashboard(resource_type, node_name)  # Get the Grafana dashboard image for memory
            bot.reply_to(message, memory_util_test)  # Send the memory utilization text back to the user
            with open(memory_util_image, 'rb') as image:  # Open the image file
                bot.send_photo(message.chat.id, image)  # Send the memory utilization image to the user
        
        # If the resource type is not recognized, send the help message
        else:
            bot.reply_to(message, help_message())
    
    # Check if there are exactly 4 parameters (node_name, resource_type, and net_dev)
    elif len(command_params) == 4:
        node_name = command_params[1]  # Extract the node name from the parameters
        resource_type = command_params[2]  # Extract the resource type from the parameters
        net_dev = command_params[3]  # Extract the network device from the parameters
        
        # If the resource type is 'bandwidth', retrieve bandwidth utilization and corresponding image
        if resource_type == 'bandwidth':
            bandwidth_util_text = lab_util.bandwidth_util(node_name, net_dev)  # Get bandwidth utilization text
            bandwidth_util_image = lab_util.get_grafana_dashboard(resource_type, node_name, net_dev=net_dev)  # Get the Grafana dashboard image for bandwidth
            bot.reply_to(message, bandwidth_util_text)  # Send the bandwidth utilization text back to the user
            with open(bandwidth_util_image, 'rb') as image:  # Open the image file
                bot.send_photo(message.chat.id, image)  # Send the bandwidth utilization image to the user
        
        # If the resource type is not recognized, send the help message
        else:
            bot.reply_to(message, help_message())

    # If the number of parameters is not 3 or 4, send the help message
    else:
        bot.reply_to(message, help_message())

Finally, start the bot polling mechanism to listen for user’s message

bot.polling()

F. Wrap-Up and Testing

Node Exporter and Libvirt Exporter have been installed to the compute nodes and exposing related metrics
Prometheus has been configured to scrape metrics data from exporters installed in compute nodes
Grafana has been configured to visualize resource utilization for compute nodes and instances inside them. Plugin to render Grafana visualization panel into image also has been installed and required Service Account Token has been created so the Python3 script can retrieve the Grafana rendered panel image
Python3 script to retrieve resource utilization and send the result to human-readable summary text and related Grafana panel image:

# Import needed libraries
from datetime import datetime, timezone
import requests
import urllib

# Prometheus API URL
PROMETHEUS_API_URL = "http://192.168.1.201:9090/api/v1/query"

# Grafana Render base URL
GRAFANA_BASE_URL = "http://192.168.1.203:3000/render/d/ae0kg8ynstzb4f/openstack-resource-monitoring-dashboard"

# Grafana authorization header. Replace  with appropriate Grafana Service Account Token
GRAFANA_HEADERS = [('Authorization', 'Bearer ')]

def instances_name_map(node_name):
    # Query will be done to libvirt_exporter metrics result
    # We need to append the libvirt_exporter port to the node_name
    libvirt_exporter= node_name + ":9177"

    # Request parameter that will be sent to the Prometheus Query API
    request_params = {
        "query": f'sum by (domain, instance_name) (libvirt_domain_info_meta{{instance=~"{libvirt_exporter}"}})',
        'current_time' : datetime.now(timezone.utc).isoformat() + 'Z'
    }

    # Initiate empty dictionary that will store instance name to domain mapping
    instances_mapping = {}

    
    try:
        # Send request to Prometheus Query API and get the result JSON
        instance_map_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()

        #Iterate the JSON result to get the instance name to domain mapping
        for instance in instance_map_json["data"]["result"]:
            # Check if the instance name is defined in the JSON result
            if 'instance_name' in instance['metric']:
                instances_mapping[instance['metric']['domain']] = instance['metric']['instance_name']

            # If not, use the domain name as the instance name
            else:
                instances_mapping[instance['metric']['domain']] = instance['metric']['domain']

        # Return the dictionary with the instance name to domain mapping
        return instances_mapping

    # On exception, print error message and exit
    except:
        print ("Error in getting instances name map")
        exit(1)

def cpu_util(node_name):
    # Define the node exporter and libvirt exporter hostname based on the node name
    node_exporter = node_name + ":9100"
    libvirt_exporter= node_name + ":9177"

    # Prepare the request parameters for querying CPU usage from Prometheus
    request_params = {
        "query": f'(1 - avg(irate(node_cpu_seconds_total{{mode="idle", instance=~"{node_exporter}"}}[1m])) by (instance)) * 100',
        'current_time' : datetime.now(timezone.utc).isoformat() + 'Z'  # Get the current time in UTC format
    }

    try:
        # Send a GET request to the Prometheus API to retrieve CPU usage data
        node_cpu_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        # Extract and round the CPU usage percentage from the response
        node_cpu_percent = round(float(node_cpu_json["data"]["result"][0]["value"][1]), 2)
    except:
        # Handle any exceptions by setting the CPU percentage to a default value
        node_cpu_percent = "[No Value]"
    
    # Initialize a string to hold the CPU usage information for instances
    instances_usage_text = ""
    # Get the mapping of instance names for the given node
    instances_map = instances_name_map(node_name)
    # Update the request parameters to query the top 3 instances by CPU usage
    request_params['query'] = f'topk(3, avg by (domain) (rate(libvirt_domain_vcpu_time_seconds_total{{instance=~"{libvirt_exporter}"}}[1m]))*100)'
    
    try:
        # Send a GET request to the Prometheus API to retrieve CPU usage data for instances
        instances_cpu_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        # Check if any results were returned
        if len(instances_cpu_json['data']['result']) == 0:
            instances_usage_text = "[No Value]"  # No instances found
        else:
            # Iterate through the results to format the CPU usage information for each instance
            for i in range(len(instances_cpu_json['data']['result'])):
                instance_cpu_percent = round(float(instances_cpu_json["data"]["result"][i]["value"][1]), 2)
                instance_name = instances_map[instances_cpu_json["data"]["result"][i]["metric"]["domain"]]
                instances_usage_text += f"{i+1}. {instance_name}: {instance_cpu_percent}%\n"
    except:
        # Handle any exceptions by setting the instances usage text to a default value
        instances_usage_text = "[No Value]"
    
    # Prepare the final result message with node name and CPU usage information
    result_message = f"""
Node name: {node_name}
CPU usage: {node_cpu_percent}% out of 100%\n
Top 3 instances with highest CPU usage inside node:
{instances_usage_text}
    """

    return result_message  # Return the formatted result message

def memory_util(node_name):
    # Define the node_exporter and libvirt_exporter variables with the appropriate ports for the specified node
    node_exporter = node_name + ":9100"
    libvirt_exporter = node_name + ":9177"

    # Create request parameters to calculate the memory usage by subtracting free, cached, buffers, and reclaimable memory from total memory
    request_params = {
        "query": f'sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{{instance=~"{node_exporter}"}}[1m]) - (avg_over_time(node_memory_MemFree_bytes{{instance=~"{node_exporter}"}}[1m]) + avg_over_time(node_memory_Cached_bytes{{instance=~"{node_exporter}"}}[1m]) + avg_over_time(node_memory_Buffers_bytes{{instance=~"{node_exporter}"}}[1m]) + avg_over_time(node_memory_SReclaimable_bytes{{instance=~"{node_exporter}"}}[1m])))',
        'current_time': datetime.now(timezone.utc).isoformat() + 'Z'
    }

    # Try to fetch the node's memory usage from the Prometheus API and convert it to gigabytes
    try:
        node_memory_usage_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        node_memory_usage_gb = round(int(node_memory_usage_json["data"]["result"][0]["value"][1]) / 1024 / 1024 / 1024, 2)
    except:
        node_memory_usage_gb = "[No Value]"

    # Update request parameters to fetch the total memory available for the node
    request_params['query'] = f'sum by (instance) (avg_over_time(node_memory_MemTotal_bytes{{instance=~"{node_exporter}"}}[1m]))'
    # Try to fetch the total memory from the Prometheus API and convert it to gigabytes
    try:
        node_memory_total_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        node_memory_total_gb = round(int(node_memory_total_json["data"]["result"][0]["value"][1]) / 1024 / 1024 / 1024, 2)
    except:  
        node_memory_total_gb = "[No Value]"
    
    # Calculate the percentage of memory usage based on the usage and total memory, handling potential division errors
    try:
        node_usage_percent = round(node_memory_usage_gb / node_memory_total_gb * 100, 2)
    except:
        node_usage_percent = "[No Value]"

    # Initialize an empty string to hold the memory usage details for individual instances
    instances_usage_text = ""
    # Get the mapping of instances for the specified node
    instances_map = instances_name_map(node_name)
    # Update request parameters to fetch the top three instances with the highest memory usage
    request_params['query'] = f'topk(3, (libvirt_domain_memory_stats_available_bytes{{instance=~"{libvirt_exporter}"}} - libvirt_domain_memory_stats_usable_bytes{{instance=~"{libvirt_exporter}"}}))'
    # Try to fetch the memory usage for the top instances from the Prometheus API
    try:
        instances_memory_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        # Check if any instances were returned and format their memory usage into the instances_usage_text
        if len(instances_memory_json['data']['result']) == 0:
            instances_usage_text = "[No Value]"
        else:
            for i in range(len(instances_memory_json['data']['result'])):
                instance_memory_gb = round(int(instances_memory_json["data"]["result"][i]["value"][1]) / 1024 / 1024 / 1024, 2)
                instance_name = instances_map[instances_memory_json["data"]["result"][i]["metric"]["domain"]]
                instances_usage_text += f"{i + 1}. {instance_name}: {instance_memory_gb} GB\n"
    except Exception as e:
        instances_usage_text = "[No Value]"

    # Construct a result message summarizing the node's memory usage and the top instances
    result_message = f"""
Node name: {node_name}
Memory usage: {node_memory_usage_gb} GB out of {node_memory_total_gb} GB ({node_usage_percent}% usage)\n
Top 3 instances with highest memory usage inside node:
{instances_usage_text}
"""
    
    # Return the formatted result message
    return result_message

def bandwidth_util(node_name, net_dev):
    # Define the node_exporter and libvirt_exporter variables with the appropriate ports for the specified node
    node_exporter = node_name + ":9100"
    libvirt_exporter = node_name + ":9177"

    # Create request parameters to calculate the average bandwidth by summing the received and transmitted bytes for the specified network device
    request_params = {
        "query": f'avg by (instance) (rate(node_network_receive_bytes_total{{instance=~"{node_exporter}",device=~"{net_dev}"}}[1m]) + rate(node_network_transmit_bytes_total{{instance=~"{node_exporter}",device=~"{net_dev}"}}[1m]))',
        'current_time': datetime.now(timezone.utc).isoformat() + 'Z'
    }

    # Try to fetch the node's bandwidth usage from the Prometheus API and convert it to kilobytes
    try:
        node_banwidth_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        node_bandwidth_kb = round(float(node_banwidth_json["data"]["result"][0]["value"][1]) / 1024, 3)
    except:
        node_bandwidth_kb = "[No Value]"
    
    # Initialize an empty string to hold the bandwidth usage details for individual instances
    instances_usage_text = ""
    # Get the mapping of instances for the specified node
    instances_map = instances_name_map(node_name)
    # Update request parameters to fetch the top three instances with the highest bandwidth usage
    request_params['query'] = f'topk(3, avg by (domain) (rate(libvirt_domain_interface_stats_receive_bytes_total{{instance=~"{libvirt_exporter}"}}[1m]) + rate(libvirt_domain_interface_stats_transmit_bytes_total{{instance=~"{libvirt_exporter}"}}[1m])))'
    # Try to fetch the bandwidth usage for the top instances from the Prometheus API
    try:
        instances_bandwidth_json = requests.get(PROMETHEUS_API_URL, params=request_params).json()
        # Check if any instances were returned and format their bandwidth usage into the instances_usage_text
        if len(instances_bandwidth_json['data']['result']) == 0:
            instances_usage_text = "[No Value]"
        else:
            for i in range(len(instances_bandwidth_json['data']['result'])):
                instance_bandwidth_kb = round(float(instances_bandwidth_json["data"]["result"][i]["value"][1]) / 1024, 3)
                instance_name = instances_map[instances_bandwidth_json["data"]["result"][i]["metric"]["domain"]]
                instances_usage_text += f"{i + 1}. {instance_name}: {instance_bandwidth_kb} KB/s\n"
    except Exception as e:
        instances_usage_text = "[No Value]"

    # Construct a result message summarizing the node's bandwidth usage and the top instances
    result_message = f"""
Node name: {node_name}
Bandwidth of {net_dev}: {node_bandwidth_kb} KB/s\n
Top 3 instances with highest bandwidth inside node:
{instances_usage_text}
"""
    
    # Return the formatted result message
    return result_message

def get_grafana_dashboard(resource_type, node_name, net_dev=None):
    # Map resource types to their corresponding panel numbers in Grafana
    panel_map = {
        "cpu": 1,
        "memory": 2,
        "bandwidth": 3
    }
    
    # Construct the query parameters for the Grafana API request
    grafana_params = f'orgId=1&from=now-1m&var-host={node_name}&var-domain=All&var-instances=All&viewPanel={panel_map[resource_type]}&width=1366&height=1024&autofitpanels'
    
    # If a specific network device is provided, include it in the parameters
    if net_dev != None:
        grafana_params += f'&var-netiface={net_dev}'
    
    # Create the full URL for the Grafana image request
    image_url = f'{GRAFANA_BASE_URL}?{grafana_params}'
    
    # Define the name and directory for saving the image
    image_name = f'{node_name}.png'
    image_directory = "images/"
    
    try:
        # Set up a URL opener with custom headers for the Grafana request
        opener = urllib.request.build_opener()
        opener.addheaders = GRAFANA_HEADERS
        urllib.request.install_opener(opener)
        
        # Retrieve the image from the constructed URL and save it to the specified directory
        urllib.request.urlretrieve(image_url, f'{image_directory}/{image_name}')
        
        # Return the path to the saved image
        return f'{image_directory}/{image_name}'
    
    except Exception as e:
        # Print an error message if the image retrieval fails and exit the program
        print(f"Error in retrieving image: {e}")
        exit(1)

Python3 script to serve the Telegram bot:

# Import needed script and library
import lab_util
import telebot

# Define the API token for the bot, which is required to authenticate with the Telegram Bot API
# Replace  with appropriate Telegram Bot Token
BOT_API_TOKEN = ""

# Create an instance of the TeleBot class using the provided API token
# This instance will be used to interact with the Telegram Bot API and handle messages
bot = telebot.TeleBot(BOT_API_TOKEN)

def help_message():
    # This function attempts to read the content of the 'help.txt' file
    try:
        # Open the 'help.txt' file in read mode
        with open('help.txt', 'r') as file:
            # Read the entire content of the file
            content = file.read()
            # Return the content read from the file
            return content
    except:
        # If an error occurs (e.g., file not found), return a default error message
        return "Can't retrieve help message"

# Send the help message when user send '/start' command
@bot.message_handler(commands=['start'])
def send_welcome(message):
    bot.reply_to(message, help_message())

# Send the help message when user send '/help' command
@bot.message_handler(commands=['help'])
def send_help(message):
    bot.reply_to(message, help_message())

@bot.message_handler(commands=['node_util'])
def handle_node_util(message):
    # Split the incoming message text into command parameters
    command_params = message.text.split()

    # Check if there are exactly 3 parameters (node_name and resource_type)
    if len(command_params) == 3:
        node_name = command_params[1]  # Extract the node name from the parameters
        resource_type = command_params[2]  # Extract the resource type from the parameters
        
        # If the resource type is 'cpu', retrieve CPU utilization and corresponding image
        if resource_type == 'cpu':
            cpu_util_text = lab_util.cpu_util(node_name)  # Get CPU utilization text
            cpu_util_image = lab_util.get_grafana_dashboard(resource_type, node_name)  # Get the Grafana dashboard image for CPU
            bot.reply_to(message, cpu_util_text)  # Send the CPU utilization text back to the user
            with open(cpu_util_image, 'rb') as image:  # Open the image file
                bot.send_photo(message.chat.id, image)  # Send the CPU utilization image to the user
        
        # If the resource type is 'memory', retrieve memory utilization and corresponding image
        elif resource_type == 'memory':
            memory_util_test = lab_util.memory_util(node_name)  # Get memory utilization text
            memory_util_image = lab_util.get_grafana_dashboard(resource_type, node_name)  # Get the Grafana dashboard image for memory
            bot.reply_to(message, memory_util_test)  # Send the memory utilization text back to the user
            with open(memory_util_image, 'rb') as image:  # Open the image file
                bot.send_photo(message.chat.id, image)  # Send the memory utilization image to the user
        
        # If the resource type is not recognized, send the help message
        else:
            bot.reply_to(message, help_message())
    
    # Check if there are exactly 4 parameters (node_name, resource_type, and net_dev)
    elif len(command_params) == 4:
        node_name = command_params[1]  # Extract the node name from the parameters
        resource_type = command_params[2]  # Extract the resource type from the parameters
        net_dev = command_params[3]  # Extract the network device from the parameters
        
        # If the resource type is 'bandwidth', retrieve bandwidth utilization and corresponding image
        if resource_type == 'bandwidth':
            bandwidth_util_text = lab_util.bandwidth_util(node_name, net_dev)  # Get bandwidth utilization text
            bandwidth_util_image = lab_util.get_grafana_dashboard(resource_type, node_name, net_dev=net_dev)  # Get the Grafana dashboard image for bandwidth
            bot.reply_to(message, bandwidth_util_text)  # Send the bandwidth utilization text back to the user
            with open(bandwidth_util_image, 'rb') as image:  # Open the image file
                bot.send_photo(message.chat.id, image)  # Send the bandwidth utilization image to the user
        
        # If the resource type is not recognized, send the help message
        else:
            bot.reply_to(message, help_message())

    # If the number of parameters is not 3 or 4, send the help message
    else:
        bot.reply_to(message, help_message())

bot.polling()

To test the Telegram bot, run the bot script

cd bot_util
python3 bot.py

Open Telegram, send the command to bot as message. For example, first we send the /start and /help command to show the help message

Bot replying with Help Message

Then, we try to send command to query CPU utilization from os-compute01 node and retrieve top 3 instances with highest CPU utilization inside of it

/node_util os-compute01 cpu

The bot send expected result with CPU utilization Grafana panel

CPU utilization query result from the Bot

Next, we try to send command to query Memory utilization from os-compute01 node and retrieve top 3 instances with highest Memory utilization inside of it

/node_util os-compute01 memory

The bot send expected result with Memory utilization Grafana panel

Memory utilization query result from the Bot

Last, we try to send command to query Bandwidth utilization of enp1s0 network interface from os-compute01 node and retrieve top 3 instances with highest Bandwidth utilization inside of it. Note that instances bandwidth utilization is not network interface specific

/node_util os-compute01 bandwidth enp1s0

The bot send expected result with Bandwidth Utilization Grafana panel image

Bandwidth utilization query result from the Bot

Author:
Kevin Timoteus Sirait — PT. Boer Technology | Medium | LinkedIn

Deploy Monitoring Server Using Zabbix With External PostgreSQL-16

Btech Engineering — Wed, 02 Oct 2024 03:20:47 GMT

In this scenario, the administrator wants to monitor the database server and router core and report monthly. The administrator planned to add a monitoring server, and Zabbix was chosen as the monitoring system.

Topology

Lab Specification:

Ubuntu Server 24.04 LTS(Zabbix Only Support LTS Version)
Zabbix Server 7.0
Zabbix Agent
Postgresql-16
NGINX
SNMP

Lab Requirement:

Zabbix, Ubuntu Server 24.04 LTS, PostgreSQL-16, snmpwalk, zabbix agent.

Lab Technical Configuration

PostgreSQL As Database Server

> Installing PostgreSQL-16

Adding PostgreSQL-16 repository

First of all, add the postgreSQL repository to your system and update it.

apt install curl ca-certificates
install -d /usr/share/postgresql-common/pgdg
curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc
sudo sh -c 'echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
apt update

Installing PostgreSQL-16 Packages

Okay, now start installing the database on your system.

sudo apt -y install postgresql-16

> Creating Zabbix Database, User, And Password

Login into PostgreSQL-16 System

After finishing installing, try to login on to the database system using Postgres user and try to use the database;

su - postgres
psql

Creating user database

Create a database and grant it to the user;

CREATE USER zabbix PASSWORD 'password';
CREATE DATABASE zabbix;
ALTER DATABASE zabbix OWNER TO zabbix;
\l
\q

> Exposing PostgreSQL To Allow Zabbix Access

Configure postgreSQL main config

PostgreSQL's default connection is just for localhost; you need to change it so that the connection can be allowed. Anyway, you can use a subnet mask like 192.168.0.0/24(for network range) or 192.168.0.1/32(for single host). For production, I recommend using a range or single network to allow the connection.

nano /etc/postgresql/16/main/postgresql.conf
---
listen_addresses = '*'

Configure expose rule

This is another security rule that allows you to determine where the database is and who can connect to it.

For labs, I will allow all. Again, for production, I will recommend that you allow and list the users and database that can access it.

nano /etc/postgresql/16/main/pg_hba.conf
---
host    all             all             0.0.0.0/0               md5

> Applying & Checking PostgreSQL Services

After configuration, you need to restart and check the service to make sure the service is okay.

systemctl restart postgresql
systemctl status postgresql

Zabbix Server As Monitoring Server

Installing Zabbix Server 7.0

> Adding Zabbix Repository

Adding a repository on the system and updating it.

wget https://repo.zabbix.com/zabbix/7.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_7.0-2+ubuntu24.04_all.deb
dpkg -i zabbix-release_7.0-2+ubuntu24.04_all.deb
apt update

> Installing Zabbix Packages And PostgreSQL Client

Installing Installing Zabbix Server, Frontend, Agent.

apt -y install zabbix-server-pgsql zabbix-frontend-php php8.3-pgsql zabbix-nginx-conf zabbix-sql-scripts zabbix-agent

Installing PostgreSQL Client On Zabbix Server.

apt install -y postgresql-client-common postgresql-client-16

> Exporting Zabbix Server Database Script Into PostgreSQL

Export Zabbix Database From Script To Database Server

Starting installing the Zabbix database using the Zabbix script from the default Zabbix installation.

zcat /usr/share/zabbix-sql-scripts/postgresql/server.sql.gz | psql -h ip_server -U zabbix -d zabbix

Checking zabbix database content

After adding the database, you must check again whether the database is installed or not on the database and ensure the table has content.

psql -h 192.168.11.99 -U zabbix -d zabbix
\c zabbix
\dt

2. Configure Zabbix Server & Connecting Into PostgreSQL

> Connecting Zabbix To PostgreSQL

Configure the Zabbix server to connect to the database.

nano /etc/zabbix/zabbix_server.conf
---
DBHost=ip_server
DBName=zabbix
DBUser=zabbix
DBPassword=password

> Configure NGINX For Zabbix Dashboard

Configure nginx, copy Zabbix nginx configuration from Zabbix installation and copy it to the nginx-enabled folder.

cd /etc/nginx/sites-enabled
rm default
ln -s /etc/zabbix/nginx.conf zabbix.conf

> Starting Zabbix Server Services

After configuring it, restart services zabbix, nginx, and php.

systemctl restart zabbix-server zabbix-agent nginx php8.3-fpm
systemctl enable zabbix-server zabbix-agent nginx php8.3-fpm

> Accessing & Configure Zabbix From Website

Configuring Zabbix Server

After restarting the service, go to the website and configure it.

http://your_ip_server

Accessing Zabbix Server

Username: Admin Password: zabbix

Author:

Muhammad Huda Fiqri — PT.Boer Technology

Reference:

Zabbix Server Installation Documentation
PostgreSQL Installation Documentation

Enhancing Kubernetes Observability with Pixie

Btech Engineering — Wed, 17 Jul 2024 10:04:43 GMT

Introduction

In the dynamic world of Kubernetes, observability is crucial. Pixie, an open-source observability tool, allows developers to monitor their Kubernetes applications seamlessly.

Leveraging eBPF, Pixie captures telemetry data automatically, eliminating the need for manual instrumentation. This post delves into Pixie’s features, components, advantages, and setup guide to help you get started with this powerful tool.

What is Pixie?

Pixie provides real-time visibility into your Kubernetes clusters. Pixie covers everything from high-level overviews of service maps and application traffic to detailed insights like pod states and flame graphs. The core concept involves Kubernetes clusters sending metrics via Vizier to Pixie Cloud, which processes the data through the Pixie API and generates insightful dashboard views.

Key Components

Pixie Edge Module (PEM): Pixie’s agent, installed per node, uses eBPF to collect data stored locally on the node.

Vizier: Installed per cluster, Vizier is responsible for query execution and managing PEMs.

Pixie Cloud: Handles user management, authentication, and data proxying.

Pixie CLI: Deploys Pixie, runs queries, and manages resources like API keys.

Pixie Client API: This provides programmatic access to Pixie for integrations, Slackbots, and custom user logic that require Pixie data.

Pixie vs. Other Cloud-Native Monitoring Tools

Setup Pixie Server

Hardware Requirements

CPU : x86–64 architecture, >4 cores
Memory: 16 GB for workers

Note: Pixie uses Elastic, requesting a limit of 700m vCPU and 8 GB RAM (excluding other pods).

Software Prerequisites

Kubernetes Cluster v1.21+
MetalLB
Default StorageClass
mkcert
kustomize

Step 1: Clone Pixie Repository

git clone https://github.com/pixie-io/pixie.git
cd pixie

Step 2: Switch to Production-Ready Branch/Tag

export LATEST_CLOUD_RELEASE=$(git tag | grep 'release/cloud' | sort -r | head -n 1 | awk -F/ '{print $NF}')

Step 3: Update Image Version in Kustomization

perl -pi -e "s|newTag: latest|newTag: \"${LATEST_CLOUD_RELEASE}\"|g" k8s/cloud/public/kustomization.yaml

Step 4 (Optional): Change the default domain and use CA

Modify the following files if you wish to change the default domain or use a CA-authorized certificate:

k8s/cloud/public/proxy_envoy.yaml
k8s/cloud/public/domain_config.yaml
scripts/create_cloud_secrets.sh

Add mkcert to the local trust CA:

mkcert -install

Step 5: Create Namespace

kubectl create namespace plc

Step 6: Create Secrets File

./scripts/create_cloud_secrets.sh

Step 7: Deploy Elastic and Postgres

kustomize build k8s/cloud_deps/base/elastic/operator | kubectl apply -f -

kustomize build k8s/cloud_deps/public | kubectl apply -f -

Step 8: Deploy Pixie Labs

kustomize build k8s/cloud/public/ | kubectl apply -f -

Step 9: Check Pods in plc Namespace

kubectl -n plc get pods

Result

Step 10: Check the External IP of Service and Setup Hosts

kubectl -n plc get svc

Add the IP and domain to `/etc/hosts`:

192.168.x.x dev.withpixie.dev work.dev.withpixie.dev

Step 11: Access the Pixie Site

Open `work.dev.withpixie.dev` or your custom domain.

Step 12: Login

Use the default credentials: admin@default.com and password: admin.

Note: Refer to Pixie Documentation for more details.

Setting Up Pixie Client

Hardware Requirements

Free Memory: ≥ 6GB RAM

Software Prerequisites

Kubernetes Cluster
Pixie Server in plc namespace

Step 1: Install Pixie CLI

bash -c "$(curl -fsSL https://work.dev.withpixie.dev/install.sh)"

Step 2: Set PL_CLOUD_ADDR

export PL_CLOUD_ADDR=dev.withpixie.dev

Step 3: Authenticate Pixie Client

px auth login

Step 4: Deploy Pixie Client

px deploy - dev_cloud_namespace plc - pem_memory_limit=1Gi

Note: The minimum pem_memory_limit is 1GB.

Step 5: Open Pixie Website

Once all pods are running, the data will be automatically displayed and updated.

Monitoring

Network Monitoring

With Pixie, you can monitor network traffic using the `px/net_flow_graph` script. This script displays connectivity between pods and clusters, showing data sent and received.

Key Metrics:

FROM_ENTITY: Source pod
TO_ENTITY: Target pod/service
BYTES_SENT: Total data sent
BYTES_RECV: Total data received
BYTES_TOTAL: Sum of sent and received data

Infrastructure Health Monitoring

Monitor resource usage of nodes and pods using the `px/nodes` script. This script provides insights into CPU usage, network traffic, and data traffic.

Detailed Node Monitoring

You can view detailed usage grouped by pod, namespace, or service by clicking on a node.

Service Performance Monitoring

Monitor service performance using the `px/service` script. It displays HTTP performance metrics and traffic data.

Traffic Insights

The script also shows the traffic sent to the service and the request paths.

Database Query Profiling

Pixie supports database performance monitoring for MySQL and PostgreSQL. Use the `px/mysql_stats` script for database usage graphs and `px/mysql_data` for query monitoring.

Key Metrics:

Source and destination of database calls
Executed queries
Query performance

Example of Monitoring Database MySQL

Example of Monitoring Database MySQL for each query

Request Tracing

Pixie can trace HTTP requests using the `px/http_data_filtered` script. This script monitors request speed, data size, and latency based on the requested path.

Example of Request Tracing HTTP on sock-shop/catalogue service

Conclusion

Pixie offers a comprehensive solution for Kubernetes observability, providing detailed insights into your clusters’ performance and health. With its easy deployment, powerful monitoring capabilities, and user-friendly interface, Pixie is an invaluable tool for any Kubernetes environment.

Explore Pixie and enhance your Kubernetes observability today! For more information, visit the Pixie documentation.

Author :

Farih Nazihullah, DevSecOps Specialist | LinkedIn

Choosing the Right Machine Type on Google Cloud Platform (GCP): Comparison, Benchmark, and Cost

Btech Engineering — Fri, 12 Jul 2024 03:47:05 GMT

In this fast-paced digital era, choosing the right cloud infrastructure is an important decision for every organization. Google Cloud Platform (GCP) offers various machine types that can be customized to suit your workload needs. However, with so many options available, how can you choose the most suitable machine type?

This article will discuss three main aspects when choosing a machine type in GCP: comparison, benchmark, and cost. Through comparisons, we will see the key differences between the various machine types offered by GCP. Benchmarking will help us understand the performance of each machine type based on specific workloads. Lastly, cost analysis will allow us to consider budget efficiency in the long run.

By understanding these three factors, you will have a comprehensive guide to make better decisions in selecting machine types at GCP, to optimize performance and cost for your organization’s specific needs. Let’s start by looking at the comparisons between the machine types available at GCP.

Generation Table

2rd Generation (N2, N2D, C2, C2D)

N2 (Intel Cascade Lake)

The N2 series is suitable for workloads that utilize high clock frequencies, providing higher performance per thread. This benefits applications that require high responsiveness and fast computing.
Designed to deliver high CPU performance.
Ideal for workloads that require maximum computing power, such as high-traffic web applications, game servers, or complex data analysis.

N2D (AMD EPYC Rome)

They are made to provide a more affordable option without significantly compromising performance.
Suitable for workloads such as application development, testing, or development environments that do not require maximum CPU power.

C2

C2 VMs run on 2nd generation Intel Xeon Scalable processors (Cascade Lake) that offer a sustained single-core maximum turbo frequency of up to 3.9 GHz. C2 offers VMs with 4 to 60 vCPUs and 4 GB of memory per vCPU.
The C2 machine series provides full transparency into the underlying server platform architecture, so you can improve performance. The machines in this series offer significantly more computing power and are generally more robust for compute-intensive workloads compared to the N1 high CPU machines.
More suitable for applications that require high turbo frequency and good responsiveness. Ideal for compute-intensive workloads that require large computing power.

C2D (Intel Cascade Lake)

C2D VMs run on 3rd generation AMD EPYC Milan processors and offer increased frequency up to 3.5 GHz. C2D VMs are flexible in size between 2 to 112 vCPUs and 2 to 8 GB of memory per vCPU.
The C2D series of machines provides the largest VM size and is best suited for high-performance computing (HPC). The C2D series also has the largest highest level cache (LLC) cache per available core.
More suitable for high-performance computing (HPC) and applications that require multiple vCPUs with higher VM size and configuration flexibility. Supports larger memory per vCPU and has the highest level cache (LLC) per core.

Performance

{N, C}2: Provides single-thread performance, suitable for applications that require fast computation and high responsiveness.
{N, C}2D: Provides better multi-threaded performance with more vCPUs and memory, ideal for applications with high CPU intensity and high-performance computing workloads.

3rd Generation (C3, C3D)

C3 (Intel Skylake)

C3 VMs are powered by 4th generation Intel Xeon Scalable processors (codenamed Sapphire Rapids), DDR5, and Titanium memory. The C3 machine type is optimized for the underlying NUMA architecture to provide optimal, reliable, and consistent performance.

C3D (Intel Cascade Lake)

C3D VMs are powered with 4th generation AMD EPYC™ (Genoa) processors with a maximum frequency of 3.7 GHz. The C3D engine type is optimized for the underlying hardware architecture to deliver optimal, reliable, and consistent performance.
The high CPU configuration offers the lowest price per performance for compute-bound workloads that do not require large amounts of memory.

Comparison Tables Benchmark

Recommendation Budget

Summary

C series has limitations in terms of regional availability and does not support custom specifications.
Best Performance CPU: For applications that require high-performance CPUs, the C3D series with 4th generation AMD EPYC™ (Genoa) processors offers optimal performance at the best price-per-performance for workloads that require a lot of CPU and little memory.
Web Server Applications: If your web server application is more CPU-intensive and requires high responsiveness, choose the C3 (Intel Sapphire Rapids) or C3D (AMD Genoa) series VM. However, if you need cost efficiency with a high number of vCPUs, choose N2D (AMD EPYC Rome).

Author

Reference

Unlocking the Future of Container Management: Discover the Power of RKE2

Btech Engineering — Mon, 08 Jul 2024 08:43:23 GMT

Have you ever explored tools beyond traditional Kubernetes for managing your containers? Meet RKE2, an innovative solution for orchestrating containerized applications within Kubernetes clusters.

RKE2, also known as RKE Government or Rancher Kubernetes Engine 2, brings enhanced security, simplicity, and efficiency to your container management needs.

Compare to RKE1 & K3S

RKE2 combines the best of both worlds from the 1.x version of RKE (hereafter referred to as RKE1) and K3s.

From K3s, it inherits the usability, ease-of-operations, and deployment model.

RKE1 inherits close alignment with upstream Kubernetes. In some places, K3s have diverged from upstream Kubernetes to optimize for edge deployments, but RKE1 and RKE2 can stay closely aligned with upstream.

Importantly, RKE2 does not rely on Docker as RKE1 does. RKE1 leveraged Docker for deploying and managing the control plane components and the container runtime for Kubernetes. RKE2 launches control plane components as static pods, managed by the kubelet. The embedded container runtime is containerd.

Key features of RKE2

Ease-of-operations
Data Center Optimization
Scalability

Table of content

Topology of Lab
Deploying new RKE2 Cluster
Accessing & Knowing more
Managing RKE2 Certificate
Integrating with external storage (NFS)
Create a sample container app
Backup & Restore Cluster
Rancher UI & How it Works with RKE2

Let's get into the Lab!

Topology of Lab

The topologi consist 3 node of RKE2 Server and 1 NFS Server that we’re gonna use to implement external storage solution scenario later, why we choose three RKE2 Server? There’s 2 type of RKE2 Node, Server & Agent. When you build RKE2 Server, it basically acts as a controller/master node, but when you install RKE2 Agent, it will act as a worker node. But in today’s lab i’m gonna show you how High Availability Topology Kubernetes environment, would be easly deploy using RKE2, as its key feature “ease of operation” but unfortunately we’re not gonna test its handle of failure, at another topics maybe?

Deploying New RKE2 Cluster

Pre-Installation

Mapping IP Address to domain

cat << EOF >> /etc/hosts
172.16.90.10 rke2-node1 rke2-node1.btech.id
172.16.90.20 rke2-node2 rke2-node2.btech.id
172.16.90.30 rke2-node3 rke2-node3.btech.id
EOF

2. Make sure each node can remote passwordless

#exec this on every nodes
ssh-keygen -t rsa

#exec these on each node
ssh-copy-id rke2-node1
ssh-copy-id rke2-node2
ssh-copy-id rke2-node3

3. Disable AppArmor,firewalld & swap to avoid any conflicts and incompatible while installing RKE2

systemctl disable apparmor.service
systemctl disable firewalld.service
systemctl stop apparmor.service
systemctl stop firewalld.service

systemctl disable swap.target
swapoff -a

Installing & Configuring RKE2 Server

Downloading & Installing RKE2 Server

curl -sfL https://get.rke2.io | sh -

2. Configure RKE2 Server on first node for joining cluster to other nodes. It takes 2 minutes to bring service up, but it depends on your server spesification.

mkdir -p /etc/rancher/rke2
nano /etc/rancher/rke2/config.yaml

---
token: my-shared-secret-token
tls-san:
  - rke2-node1.btech.id
  - rke2-node2.btech.id
  - rke2-node3.btech.id

3. Enable & verify service on the first node

systemctl enable --now rke2-server.service
journalctl -xe

4. Verify cluster nodes, we can see that only first node appear there, its because we haven’t configure on second & third node to join the cluster, so on the next step we’ll do it.

5. Configuring also on second & third nodes to join the cluster & verify

mkdir -p /etc/rancher/rke2
nano /etc/rancher/rke2/config.yaml

server: https://rke2-node1.btech.id:9345
token: my-shared-secret-token
tls-san:
  - rke2-node1.btech.id
  - rke2-node2.btech.id
  - rke2-node3.btech.id

6. After configuring the second & third servers, we can verify the back member of the RKE2 cluster. See different?

Accessing & Knowing more

Accessing the cluster

You can access the rke2 cluster after installation by exporting the environment variable of RKE2 Kubeconfig stores in /etc/rancher/rke2/rke2.yaml by default kubectl also included by RKE2, so you have to move the binary to the right place.

#copy kubectl binary to executble path
cp /var/lib/rancher/rke2/bin/kubectl /usr/local/bin

#add env variable that needed by kubectl to connect to RKE2 Cluster and rke2 binary.

export PATH=$PATH:/opt/rke2/bin:/var/lib/rancher/rke2/bin
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml

Verify cluster component status & client version

2. By default, ingress controller & network component are installed on RKE2 server, so you don’t have to install it anymore, and there you can see that network component installed is canal, in simplify, its combines two CNI networking project (Flannel & Calico) making its more powerfull!

kubectl get pods -A | grep ingress
kubectl get pods -A | grep canal

Managing RKE2 Certificate

You can check the certificate expiration time & also rotate it manually

#for checking
rke2 certificate check

2. Check the certificate before renewing expires on 25 June 2025, 14:05 WIB (After converting to UTC+7)

3. Let's try to rotate/renew

#execute on all nodes
#for renewing
systemctl stop rke2-server 
rke2 certificate rotate
systemctl start rke2-server

4. Verification As you can see in the picture below, the certificate expires on 30 June 2025, 21:00 WIB (After converting to UTC+7)

Integrating with external storage (NFS)

Install nfs-server

apt -y install nfs-kernel-server

2. Create shared folder

mkdir -p /data/nfs-share
chmod 777 /data/nfs-share

3. Configure nfs-server to share folder

nano /etc/exports
---
/data/nfs-share 172.16.90.0/24(rw,sync,no_subtree_check,no_root_squash)

4. Restart services

sudo exportfs -av
systemctl restart nfs-server
systemctl status nfs-server

5. Then test mounting & create some files to some node (rke2-node-1)

apt install nfs-common -y
mkdir -p /mnt/nfs-clientshare
sudo mount -t nfs 192.168.100.10:/data/nfs-share /mnt/nfs_clientshare
touch index.php /mnt/nfs_clientshare

6. When nfs is successfully configured, let's deploy the nfs provisioner on the controller node (rke2-node-1), which enables you to use nfs external storage easily for your application.

#make sure to install nfs-client on all kubernetes node
apt install nfs-common -y
# install helm first if you have not before.
curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
sudo apt-get install apt-transport-https --yes
echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm

7. Deploy nfs provisioner using helm

helm install nfs-cluster-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
    --set nfs.server=192.168.100.10 \
    --set nfs.path=/data/nfs-share \
    --set storageClass.name=nfs-cluster-client \
    --set storageClass.provisionerName=cluster.local/nfs-cluster-provisioner

8. Make sure nfs provisioner pods are running (Here We using the default namespace; better to use a different one)

9. Create PVC (Persistent Volume Claim)

# pvc
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: sample-nfs-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: nfs-cluster-client
  resources:
    requests:
      storage: 10Gi

10. Apply it & verify, when you see picture below, you can see Persistent Volume created automatically, its advantage of using nfs provisioner, so you don’t have to define PV first. Also, you can see in our shared nfs folder, there’s a default-blabla folder. Its folder will store the files/content created by our pods when consuming the NFS share volume.

So that’s all for the nfs provisioner, I will move to some application creation steps to see if the nfs-server will work with our application.

Create a sample container app

Create a deployment application that will consume volume from the external NFS servers.

# deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: web-nginx
  name: nfs-nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-nginx
  template:
    metadata:
      labels:
        app: web-nginx
    spec:
      volumes:
        - name: nfs-nginx
          persistentVolumeClaim:
            claimName: sample-nfs-pvc
      containers:
        - image: nginx
          name: nginx
          volumeMounts:
            - name: nfs-nginx
              mountPath: /usr/share/nginx/html

2. Create services

apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    app: web-nginx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

3. Create ingress outside the cluster using the domain.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: sample-ingress
spec:
  ingressClassName: nginx
  rules:
    - host: rke2-test.btech.id
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: nginx-service
                port:
                  number: 80

4. For verification, let’s get into one of the nginx containers we’ve created before, then test to create a file on /usr/share/nginx/html to see if the NFS server will copy the content on the shared folder.

That’s all for testing our application, which consumes an external nfs-server share filesystem. It works appropriately so we can dynamically change the content on the nfs share folder without disturbing the container.

Backup & Restore Cluster

Rancher provides functionality to back up our cluster, allowing us to restore it in the event of an incident. So, let's get into it!

Prerequisite

The new cluster is similar to the old cluster.
New cluster configured (Share all ssh keys, update system, mapping its new domain to /etc/hosts)

We have 3 new servers similar to older ones, and here’s the domain.

#from old node
172.16.90.10 new-rke2-node-1.btech.id new-rke2-test.btech.id
172.16.90.20 new-rke2-node-2.btech.id 
172.16.90.30 new-rke2-node-3.btech.id

2. Then for backup our old cluster, here’s the following folder that we have to backup.

- /var/lib/rancher/rke2/server/cred
- /var/lib/rancher/rke2/server/tls
- /var/lib/rancher/rke2/server/token
- /etc/rancher/*
- Snapshot (/var/lib/rancher/rke2/server/db/snapshots)

3. Before backing up that folder, let’s create a snapshot.

rke2 etcd-snapshot save --name backup-restore-rke2-testing

mkdir -p /root/backup

cp /var/lib/rancher/rke2/server/db/snapshots/backup-restore-rke2-testing-rke2-node-1-1720013739 /root/backup/
cp /var/lib/rancher/rke2/server/token /root/backup/
cp -r /etc/rancher /root/backup/
cp -r /var/lib/rancher/rke2/server/cred /root/backup/
cp -r /var/lib/rancher/rke2/server/tls /root/backup/

4. Then transfer /root/backup on old nodes to the new node

#from old node
scp -r /root/backup root@172.16.90.11:/root/

5. Then restore the configuration files we backup before

#from new node
mkdir -p /var/lib/rancher/rke2/server
mkdir -p /etc/rancher

cp -r /root/backup/rancher/* /etc/rancher/
cp -r /root/backup/token /var/lib/rancher/rke2/server/token
cp -r /root/backup/cred /var/lib/rancher/rke2/server/
cp -r /root/backup/tls /var/lib/rancher/rke2/server/

6. Next, we will install the rke2 cluster using an old backup. First, on new-rke2-node-1, install rke2 server.

nano /etc/rancher/rke2/config.yaml
---
token: my-shared-secret
tls-san:
  - new-rke2-node-1.btech.id
  - new-rke2-node-2.btech.id
  - new-rke2-node-3.btech.id
---
#save & exit ->

curl -sfL https://get.rke2.io | sh -

7. Stop rke2 server & restore backup data.

systemctl stop rke2-server
rke2 server \
  --cluster-reset \
  --cluster-reset-restore-path=/root/backup/backup-restore-rke2-testing-rke2-node-1-1720013739

8. Start after restoration done

systemctl start rke2-server

9. On the other new node, follow the following instructions below:

curl -sfL https://get.rke2.io | sh -

nano /etc/rancher/rke2/config.yaml
---
server: https://new-rke2-node-1.btech.id:9345
token: my-shared-secret
tls-san:
  - new-rke2-node-1.btech.id
  - new-rke2-node-2.btech.id
  - new-rke2-node-3.btech.id
  
systemctl enable --now rke2-server.service

10. After all service rke2 running on the new server, then verify it using the following command.

cp /var/lib/rancher/rke2/bin/kubectl /usr/local/bin
export PATH=$PATH:/opt/rke2/bin:/var/lib/rancher/rke2/bin
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
kubectl get componentstatuses
kubectl get nodes

11. Let's take a look at our application. You can see that the domain of ingress is still using the old one, so we have to change it.

kubectl edit ingress sample-ingress

before

after

11. Verify at the end. Perfect!

So we have backed up and restored our RKE2 Cluster, and as you can see, the application still works properly. Even though we have adjusted a few things, it's to make sure we have a guarantee of our server availability when any incident comes.

Rancher UI & How it Works with RKE2

Rancher is a comprehensive container management platform for production environments, enabling organizations to run Kubernetes clusters efficiently.

It simplifies Kubernetes deployment and management, meets IT requirements, and empowers DevOps teams. You can import your existing Kubernetes cluster running internally or using a cloud provider and manage it in one platform with an intuitive user interface.

Let’s take a look at how it works with RKE2.

Prerequisite

Kubernetes cluster
Ingress Controller
Helm tools
Cert manager (Installed in Kubernetes)

We will use a new cluster from the last restore, so install Helm first.

curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
sudo apt-get install apt-transport-https --yes
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm

2. Install cert-manager

#install cert manager

# If you have installed the CRDs manually instead of with the `--set installCRDs=true` option added to your Helm install command, you should upgrade your CRD resources before upgrading the Helm chart:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download//cert-manager.crds.yaml

# Add the Jetstack Helm repository
helm repo add jetstack https://charts.jetstack.io

# Update your local Helm chart repository cache
helm repo update

# Install the cert-manager Helm chart
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true


kubectl get pods --namespace cert-manager

3. Deploy rancher management/UI

helm install rancher rancher-stable/rancher \
  --namespace cattle-system \
  --set hostname=rancher-rke2-test.btech.id \
  --set bootstrapPassword=btechid

kubectl get pods -n cattle-system

4. Access the web dashboard:

5. You can manage your existing cluster. So you can operate them easily.

6. But we only have 1 existing cluster, how do we import external cluster? Go to home dashboard -> Import

7. Give it a name, and you can specify the user that would be able to operate this cluster, but in this case, we’re gonna let it default.

8. Then, you’ll see this picture below. Copy the second code. We always get an error with the first code about a certificate authority issue because we’re using a self-signed certificate, so just copy the second one and paste it to the controller node of your existing cluster.

9. Check the container agent created on the cattle-system namespace. If you see a pod that is error or in a crashloopback state and sees the following error log, do the following steps.

10. Edit deployment of cattle agent

kubectl edit deployment -n cattle-system cattle-cluster-agent

11. Save it and you’ll see its running now.

12. Then you can return to the Rancher management UI and see that you can now easily manage your imported cluster from the UI.

With this guide, you can master Kubernetes with RKE2 and Rancher. You now have the tools to streamline your container operations, from setting up your cluster to integrating storage, managing certificates, and more. Dive into the world of Kubernetes with confidence, and enjoy the simplicity and power that Rancher brings to your DevOps journey.

Happy containerizing folks!

see you on the following topic. Ciao!

Marvin — PT. Boer Technology.

Bonus

In March 2024, the Btech team embarked on an exciting journey with one of Indonesia’s prominent state-owned enterprises (BUMN). Their mission is to revolutionize the enterprise’s development lifecycle. With the implementation of Rancher management and an RKE2 cluster, Btech showcased its cutting-edge expertise, laying the foundation for a more efficient and scalable infrastructure. This bold move signifies a significant step towards modernizing the company’s IT operations, embracing the future of technology with open arms.

The Btech team is implementing a detailed plan to transition from a traditional development process to a modern DevSecOps environment using GitLab CI. This plan is not just a strategy; it represents a commitment to being agile, secure, and continuously innovative. The youthful and forward-thinking Btech team is leading this transformation with a clear vision and unwavering dedication to thriving in the constantly changing digital landscape.

Migrating Virtual Machines from VMware to AWS using AWS Application Migration Service

Btech Engineering — Wed, 26 Jun 2024 02:58:47 GMT

Many companies prioritize migrating virtual machine workloads from on-premises environments, such as VMware, to cloud environments to take advantage of cloud computing's benefits, including scalability, flexibility, and cost efficiency.

However, this migration process can be very complex and time-consuming. Fortunately, one solution is available: AWS Application Migration Service.

AWS Application Migration Service (AWS MGN) is a highly automated lift-and-shift (rehost) solution that simplifies physical, virtual, and cloud migration to AWS without compatibility issues, performance disruption, or long cutover windows. AWS MGN uses "Agents" to replicate source resources to AWS, installed directly on the VMs to be migrated.

Setup VM on VMware

Step 1 — VM to be migrated

Step 2 — Example of data and service for migration verification

Migrate VM to AWS

Step 1 — Go to AWS MGN

Step 2 — Create IAM with AWS MGN permission

Step 3 — Create access key for users has been created

Step 4 — Add source server on AWS MGN

Step 5 — Run command on VMware VM

Step 6 — Verify source server

Step 7 — Monitor the progress of the replication

Step 8 — Launch a test instance

Step 9 — Mark test instances as ready for cutover

Step 10 — Launch a cutover

Step 11 — Initiate cutover and finalize migration

Verify Migration

Step 1 — Allocate Elastic IP to migrated VM

Step 2 — Verify data and service

Migrating virtual machines from VMware to AWS using AWS Application Migration Service offers a range of benefits, including a more straightforward migration process, minimal disruption, and improved operational efficiency.

It not only eases the migration process but also ensures virtual machine workloads run safely and efficiently on AWS. If you are considering migrating virtual machines in an on-premise environment to AWS, AWS Application Migration Service is the solution you can use.

By Muhammad Alfian Tirta, CX Team Btech

Our Tagline

# Together is Better & Continuous Learning