Stories by Caleb Epelle on Medium

Predictive Maintenance on Cement Kiln Machinery

Caleb Epelle — Mon, 26 Jan 2026 12:55:22 GMT

In 2026, I started a job at a Cement Manufacturing Company, and I saw how simple failures of critical equipment could cause financial losses and product losses.

So I decided to build a system that could predict failures before they actually occur in order to save cost and valuable products.

I tested multiple models on this project, but settled on 2.

By using these models, I believe manufacturers can be better positioned to maintain their equipment before they develop critical problems, as they would have a sort of “foresight”.

I was able to achieve the following results and identify and rank the most influential variables (features) involved in machinery failure.

Performance Metrics of Random Forest Model

Performance Metrics of XGBoost Model

Performance Metrics of Random Forest & XGBoost Models

With this, Plant Managers, Inspectors, and Maintenance Personnel would have insight into the “health status” of their equipment and machines.

****************************************************************************

Now for the technical bit.

I downloaded the data from Kaggle

It contained 10,000 observations.

Dataset

I imported all necessary libraries and loaded the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve


df = pd.read_csv("C:/Users/USER/Desktop/port/predictive_maintenance.csv")
print("DATA LOADED")
print(df.head())
print(df)

I cleaned the data, converted numbers that were strings into numbers, dropped irrelevant features and encoded features for easy processing.

df = df.iloc[:, 0].str.split("\t", expand=True)

df.columns = [
    "UDI",
    "Product ID",
    "Type",
    "Air temperature K",
    "Process temperature K",
    "Rotational speed rpm",
    "Torque Nm",
    "Tool wear min",
    "Target",
    "Failure Type"
]

df["Failure Type"] = df["Failure Type"].str.replace("...", "", regex=False)
print("CLEANED DF")
print(df.head())

df = df.drop(columns=["Failure Type", "UDI", "Product ID"])
print("DROPPED COLUMNS DF")
print(df.head())
print(df.describe())
print(df.value_counts())

df["Type"], uniques = pd.factorize(df["Type"])
print("ENCODED COLUMNS DF")
print(df.head())

Then I split the data into training and testing sets and scaled them for easy training, made sure that both the features and labels were of the numeric type, and oversampled the data for uniformity in training data.

x = df.drop(columns=["Target"])
print("FEATURES")
print(x)

y = df["Target"]
print("TARGET")
print(y)

# Ensure x is numeric
for col in x.columns:
    x[col] = pd.to_numeric(x[col], errors='coerce')  # converts invalid values to NaN

# Ensure y is numeric
y = y.astype(int)

# Splitting data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

# Scaling for neural networks and LSTM
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# Oversampling for uniformity
smote = SMOTE(random_state=42)

print("Original y_train distribution:")
print(y_train.value_counts())

# Apply SMOTE to training data only
x_train, y_train = smote.fit_resample(x_train, y_train)

# Check the new distribution
print("Resampled y_train distribution:")
print(pd.Series(y_train).value_counts())

I used 4 models:

Random Forest Classifier
XG Boost
Deep Learning Neural Network
LSTM

After fitting and training these models, I evaluated them to see how well they performed against each other.

print(f"*******************RANDOM FOREST Classification Report*******************")
print(classification_report(y_test, y_pred_rf))  

print(f"*******************XGBOOST Classification Report*******************")
print(classification_report(y_test, y_pred_xgb)) 

print(f"*******************DEEP NEURAL NETWORK Classification Report*******************")
print(classification_report(y_test, y_pred_nn))

print(f"*******************LSTM Classification Report*******************")
print(classification_report(y_test, y_pred_lstm))

Model Performance of Random Forest, XGBoost, Deep Learning Neural Network and LSTM Models

Seeing that my desired model would predict when a failure is about to occur, the event where it would predict a “no failure” when there is in fact a “failure (false negative) would be a more serious problem compared to the event where it would predict a “failure” and there is in fact “no failure” (false positive).

So I decided to focus my model appraisal on the “RECALL” metric, which is best suited for “FALSE NEGATIVES” as opposed to “PRECISION”, which is best suited for “FALSE POSITIVES”.

From this, I chose the models with the best recall:

Random Forest
XGBoost

Confusion Matrix, AUC-ROC & Feature Importance Plots of Random Forest Model

Confusion Matrix, AUC-ROC & Feature Importance Plots of XGBoost Model

From these models, I intend to further refine and increase their baseline recalls by finetuning their hyperparameters.

****************************************************************************

Now that I had narrowed down my models and analyzed the data to find out what factors contribute the most to equipment failure, I carried out predictions on random data to determine if there would be equipment failure.

Predictions from Random Forest and XGBoost Models

From the pictures above, it is clear that there are a few false negatives and that the average probability of failure on these false negatives is 30% but the lowest is around 10%

One solution to solving the false negatives problem could be lowering the positive detection probability from 50% to 10%.

This would make the model predictions VERY STRICT and would decrease the chances of the model predicting “no failure” when there is in fact a “failure”.

threshold = 0.1   # lower than 0.5 = fewer missed failures

# Convert probabilities to predictions
y_pred_rf_thresh = (y_prob_rf >= threshold).astype(int)

print("Failure probabilities:")
print(y_prob_rf)

print(f"Prediction of failure using threshold = {threshold}:")
print(y_pred_rf_thresh)

By “tightening” the prediction parameters on the Random Forest model, I was able to eliminate the model’s false-negative problem by increasing the recall from 73.5% to 95.9% while maintaining an accuracy of 86.8%, which would lead to a system better equipped to predict equipment failure beforehand.

In the future, I hope to do the following:

FUTURE VERSION UPGRADES

Model hyperparameters to be finetuned
The model will monitor real-time events to know when failure is imminent (possibly by using Autoencoders, which are good for systems without data from use [new equipment])
The model should detect anomalies using autoencoders
The model should be able to discern the type of failure

THANK YOU FOR READING THIS FAR!!

AI Job Application Automation

Caleb Epelle — Fri, 02 Jan 2026 17:08:53 GMT

AI Agent Job Application Automation

In 2024, I left my job in the oil and gas industry to pursue a life and a job in the tech industry.

I quickly realized that getting these jobs was not so easy and the application process was monotonous, tedious, boring and tiring.

So I created an AI Agent to handle all my job applications and schedule interviews.

This was the result.

Automating my job application process

Customer Churn Prediction

Caleb Epelle — Fri, 31 Oct 2025 04:37:12 GMT

I started a business a little while after I left school and I saw and felt how disheartening it is when customers and clients leave or when they don’t patronize you as much as they used to (or as much you’d like them to). So, I used AI to predict what kind of customers are likely to leave you (Churn), your service or your product as well as identify factors, qualities and traits that heavily influence client retention.

I used a couple models to predict this. By comparing the predictions of these models, I believe business owners can improve their customer retention as well as customer satisfaction.

The data used here was from a telecom company. It comprised of variables like; products and services the customer paid for, their total bills as well as whether they had churned or not.

By carrying out my analysis and prediction, I was able to identify the key factors that contributed to customer retention.

Bar plot showing the 10 most important factors that contribute to customer retention across all models

With this, business owners would have insight into their customer’s minds to know what they consider “important”.

What services to offer and cut to improve customer retention

Now for the technical bit.

I imported all necessary libraries and loaded the dataset.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_curve,
    auc
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

df = pd.read_csv("C:/Users/USER/Desktop/port/WA_Fn-UseC_-Telco-Customer-Churn.csv")
print("Data loaded successfully!")
print(df.head())

Cleaned the data, converted numbers that were strings into numbers, removed irrelevant features, filled in missing values and encoded features for easy processing.

df["TotalCharges"] = df["TotalCharges"].replace(" ", np.nan)
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].median())

if "customerID" in df.columns:
    df = df.drop("customerID", axis=1)

cat_cols = df.select_dtypes(include=["object"]).columns.tolist()

for col in cat_cols:
    if df[col].nunique() == 2:
        df[col] = df[col].map({"Yes": 1, "No": 0})
    else:
        df = pd.get_dummies(df, columns=[col], drop_first=True)

Then I split the data into training and testing sets and scaled the for easy training.

X = df.drop("Churn", axis=1)
y = df["Churn"]

X = X.fillna(X.median())

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train = X_train.fillna(X_train.median())
X_test = X_test.fillna(X_test.median())

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

I used 3 models:

Random Forest Classifier
Logistic Regression
XG Boost

After fitting and training these models, I evaluated them to see how well they performed against each other.

def evaluate_model(name, y_true, y_pred):
    print(f"******************** {name} Performance ********************")
    print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
    print(classification_report(y_true, y_pred))

evaluate_model("Random Forest", y_test, rf_pred)
evaluate_model("Logistic Regression", y_test, lr_pred)
evaluate_model("XGBoost", y_test, xgb_pred)

Results from model evaluation

I also made confusion matrices and ROC-AUC curves to see which model had better success dealing with positives and negatives, both true and false, and to analyze model performance.

# RANDOM FOREST CONFUSION MATRIX
cm = confusion_matrix(y_test, rf_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# XGBOOST CONFUSION MATRIX
cm = confusion_matrix(y_test, xgb_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - XG Boost")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# LINEAR REGRESSION CONFUSION MATRIX
cm = confusion_matrix(y_test, lr_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - Linear Regression")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# ROC-AUC
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_prob)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_prob)
xgb_fpr, xgb_tpr, _ = roc_curve(y_test, xgb_prob)

plt.figure(figsize=(8,6))
plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUC = {auc(rf_fpr, rf_tpr):.3f})")
plt.plot(lr_fpr, lr_tpr, label=f"Logistic Regression (AUC = {auc(lr_fpr, lr_tpr):.3f})")
plt.plot(xgb_fpr, xgb_tpr, label=f"XGBoost (AUC = {auc(xgb_fpr, xgb_tpr):.3f})")
plt.plot([0,1], [0,1], 'k--', label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves Comparison")
plt.legend()
plt.show()

Confusion matrix for each model, showing how well they make predictions

ROC-AUC curve for all models

From these evaluations I saw that the models were good and that The Logistic Regression model seemed to be making the best predictions.

I carried out feature importance tests on each model to get the features that contributed the most to the prediction of the model as I saw that this could help provide valuable insight into what makes a customer leave.

# XGBoost Feature Importance
xgb_importance = pd.Series(xgb.feature_importances_, index=X.columns).sort_values(ascending=False)
top_features_xgb = xgb_importance.head(15)

plt.figure(figsize=(10,6))
sns.barplot(x=top_features_xgb.values, y=top_features_xgb.index, palette="viridis")
plt.title("Top 15 Feature Importances (XGBoost)")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

# Random Forest Feature Importance
rf_importance = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
top_features_rf = rf_importance.head(15)

plt.figure(figsize=(10,6))
sns.barplot(x=top_features_rf.values, y=top_features_rf.index, palette="coolwarm")
plt.title("Top 15 Feature Importances (Random Forest)")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

# Logistic Regression Feature Importance
lr_importance = pd.Series(lr.coef_[0], index=X.columns).abs().sort_values(ascending=False)
top_features_lr = lr_importance.head(15)

plt.figure(figsize=(10,6))
sns.barplot(x=top_features_lr.values, y=top_features_lr.index, palette="magma")
plt.title("Top 15 Feature Importances (Logistic Regression)")
plt.xlabel("Coefficient Magnitude")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

Bar plot showing 15 most influential features in each model

Then I condensed this list by getting the top 10 most influential features across all three models.

no_feat = 10

# Random Forest importances
rf_importance = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)

# XGBoost importances 
xgb_importance = pd.Series(xgb.feature_importances_, index=X.columns).sort_values(ascending=False)

# Logistic Regression importances 
lr_importance = pd.Series(abs(lr.coef_[0]), index=X.columns).sort_values(ascending=False)

# Combine top 20 features from each model
combined_features = pd.concat([rf_importance.head(20), xgb_importance.head(20),lr_importance.head(20)])

# Remove duplicates and sort by mean importance
combined_mean_importance = combined_features.groupby(combined_features.index).mean().sort_values(ascending=False)

# Select top 10 overall
top_features = combined_mean_importance.head(no_feat)

print("Top Overall Important Features (Across All Models):")
print(top_features)

plt.figure(figsize=(10,6))
sns.barplot(x=top_features.values, y=top_features.index, palette="magma")
plt.title(f"Top {no_feat} Most Important Features (Combined Models)")
plt.xlabel("Average Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

Bar plot showing 10 most influential features across all models

Now that I had analyzed the data to find out why people left, I carried out predictions on random data to know if a person would leave and what the chances of that person leaving were.

random_idx = np.random.randint(0, len(X_test))
sample = X_test.iloc[random_idx:random_idx+1]

# Ensure there are no NaNs
sample = sample.fillna(X_train.median())

# Scale for LR & RF
sample_scaled = scaler.transform(sample)

# Make predictions
pred_rf = rf.predict(sample_scaled)[0]
pred_xgb = xgb.predict(sample)[0]

# Logistic Regression (handle any NaNs)
if np.isnan(sample_scaled).any():
    print("NaNs found in sample_scaled — filling with 0 for Logistic Regression.")
    sample_scaled = np.nan_to_num(sample_scaled, nan=0.0)

pred_lr = lr.predict(sample_scaled)[0]

prob_rf = rf.predict_proba(sample_scaled)[0][1] * 100
prob_xgb = xgb.predict_proba(sample)[0][1] * 100
prob_lr = lr.predict_proba(sample_scaled)[0][1] * 100

prob_rf = round(prob_rf, 2)
prob_xgb = round(prob_xgb, 2)
prob_lr = round(prob_lr, 2)

# Show actual value from test set
actual_value = y_test.iloc[random_idx]
print(f"Actual Value (1=Churn, 0=No): {actual_value}")

# shows the chances of the customer leaving/churnning btwn 0 and 1
print(f"Random Forest Prediction (1=Churn, 0=No):, {pred_rf}, Probability: {prob_rf}% chance of leaving")
print(f"XGBoost Prediction (1=Churn, 0=No):, {pred_xgb}, Probability: {prob_xgb}% chance of leaving")
print(f"Logistic Regression Prediction (1=Churn, 0=No):, {pred_lr}, Probability: {prob_lr}% chance of leaving")

# OUPUT RESULT
"""

Actual Value (1=Churn, 0=No): 0

Random Forest Prediction (1=Churn, 0=No):, 0, Probability: 1.0% chance of leaving
XGBoost Prediction (1=Churn, 0=No):, 0, Probability: 0.6000000238418579% chance of leaving
Logistic Regression Prediction (1=Churn, 0=No):, 0, Probability: 0.56% chance of leaving

"""

With this, anyone would be able to know:

What services to offer and cut to improve customer retention
Why their customers left
What makes their customers stay or leave
Whether a certain customer would leave
What to do to make customers stay
(look at feature importance plot [FIRST PLOT])

THANK YOU FOR READING THIS FAR!!

AI Quality Control and Inspection

Caleb Epelle — Tue, 28 Oct 2025 09:53:25 GMT

I worked as an Inspection Engineer in the Oil and Gas Industry.

One day, on an oil rig, I saw a team inspect a structure that was at a height difficult to reach. They used a drone to do this.

It got my team and I talking about the possibility of machines replacing us in the near future. Some said it was possible, others said it wasn’t, others said it would take time. (can you guess what I said?)

I decided to try something out.

I wanted to see if I could use machines to help us rather than to replace us (just like the team with the drone did)

A key aspect of my job as an inspector was to confirm if equipment met the standards they claimed to be designed after. One repetitive task we often underwent was measuring.

We measured things like webbing slings, wire rope slings, cargo carrying units (CCUs), chain hoists and many more.

Cargo Carrying Unit and Webbing Slings

Chain Hoist and Wire Rope Sling

I decided to take on that part of the job with Artificial Intelligence.

I designed a Computer Vision System that could get the dimensions of any object it saw.

This was the result.

Result displaying the length, width and area or an object

I was able to accurately measure the dimensions of an object and display these values to the user.

This could help cut down on the time spent inspecting individual items which would in turn improve the efficiency of an inspector.

To do this I had to generate live video feed

cam = True#this could be used to toggle camera on/off(true = on)

# static image
ruler = cv2.imread("pictures/ruler 2.jpg")

cap1 = cv2.VideoCapture(0)
frame_width = 640
frame_height = 480
cap1.set(3,frame_width)#width
cap1.set(4,frame_height)#height
cap1.set(10,100)#brightness/exposure

while True:
    if cam:
        success, img = cap1.read()
    else:
        img = ruler

    cv2.imshow("video",img)
    if cv2.waitKey(1) & 0xFF ==ord('s'):
        break

Find contours/edges of objects of objects using the canny edge detection and a variable filter for different number of shapes

img, final_contours = functions.get_contours(
    img, show_canny=False, draw=True, min_area=1000, filter=4
)

Flatten the Image so it can be measured

imgwarp = functions.get_warp(img, biggest, 360, 480)

Carry out the measurement

ordered_points = functions.order_point(biggest)
obj_width = round(
    (functions.find_length(ordered_points[0][0] // scale,
                           ordered_points[1][0] // scale) / 10), 1
)
obj_length = round(
    (functions.find_length(ordered_points[0][0] // scale,
                           ordered_points[2][0] // scale) / 10), 1
)
obj_area = obj_length * obj_width

and finally, raw lines for smooth user interface

cv2.polylines(img, [biggest], True, (0, 255, 0), 2)
cv2.putText(img, f'Width: {obj_width} in', (x, y - 10),
            cv2.FONT_HERSHEY_COMPLEX, 0.8, (0, 255, 0), 2)
cv2.putText(img, f'Length: {obj_length} in', (x, y + h + 30),
            cv2.FONT_HERSHEY_COMPLEX, 0.8, (0, 255, 0), 2)
cv2.putText(img, f'Area: {obj_area} in²', (x, y + h + 60),
            cv2.FONT_HERSHEY_COMPLEX, 0.8, (0, 255, 0), 2)

I believe something like this could really help inspectors do their jobs better and faster.

Thank you for reading this far!

Automated Data Logging

Caleb Epelle — Tue, 28 Oct 2025 09:50:25 GMT

AI Safety Officer

Caleb Epelle — Tue, 28 Oct 2025 09:48:43 GMT

I have worked in the Engineering Industry for about 6 years. Throughout those years, I witnessed and heard about various incidents that could have been prevented if someone had been wearing the correct PPE (Personal Protective Equipment).

I created a system that tracks individuals and identifies whether they are properly dressed in PPE before entering the field to work.

This system can be paired with my previous project on access control (https://medium.com/@calebepelle5/hands-free-access-a-new-standard-e6427630a366) to create a more robust system that denies access to anyone not properly dressed for the task at hand.

Hands-Free Access: A New Standard

picture by Vitaliy Yashar

This was the result

Successful identification of an individual with and without PPE (limited to coveralls/vests), respectively

The first picture above shows an individual without protective equipment, while the second shows an individual with protective equipment.

To be able to do this, I had to create a custom object detection model that was trained on protective equipment commonly used in the engineering industry.

Training data with over 1,000 training images

from ultralytics import YOLO  

data_yaml_path = "C:/Users/USER/Desktop/port/yolo/PPE_DATASET/data.yaml" 
base_model = "yolov8n.pt" # chosen for its speed in real-time detection

epochs = 40
img_size = 480
batch_size = 16
device = "cpu"

# LOAD BASE MODEL
model = YOLO(base_model)

# MODEL TRAINING
model.train(
    data=data_yaml_path,
    epochs=epochs,
    imgsz=img_size,
    batch=batch_size,
    device=device,
    verbose=True,
)
trained_model_path = "runs/detect/train5/weights/best.pt"

print(f"Training finished! Trained model saved at: {trained_model_path}")

I immediately ran into a problem.

The model was overfitted.

It only recognized protective equipment, which led it to classify anything and everything as such.

Flawed Detections

To fix this, I had to train the model on everyday objects as well.

Objects like shirts, trousers, glasses, and so on.

I was able to get the following results

Performance curves showing that the model leans more towards 1.0(100%)

Confusion matrix showing suitable model performance

Plots of showing the model’s loss reduction as well as precision and recall increment over a training period of 40 epochs

Key elements of my code are below.

I set the confidence ratio to 51% to reduce false predictions

I limited my test to the “vest” class because I didn’t have a helmet lying around.

confidence_threshold = 0.51   # Minimum confidence to show detection
ppe_class = "Vest"       # Class name for PPE coverall
person_class = "Person"      # Class name for person

Then I used an overlap function to check when the object and person are in the same “vicinity”.

def boxes_overlap(boxA, boxB):
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])
    inter_area = max(0, xB - xA) * max(0, yB - yA)
    if inter_area == 0:
        return False
    boxA_area = (boxA[2] - boxA[0]) * (boxA[3] - boxA[1])
    boxB_area = (boxB[2] - boxB[0]) * (boxB[3] - boxB[1])
    iou = inter_area / float(boxA_area + boxB_area - inter_area)
    return iou > 0.1  # Adjust for more precision and to avoid fakeouts

I also added bounding boxes around objects for improved user interaction and interface.

def draw_detections(img, detections):
    """Draw bounding boxes and perform per-person PPE check"""
    persons = [d for d in detections if d["class_name"] == person_class]
    ppes = [d for d in detections if d["class_name"] == ppe_class]

    for det in detections:
        x1, y1, x2, y2 = det['box']
        cls = det['class_name']
        conf = det['confidence']

        if cls == person_class:
            color = (255, 255, 0)  # Cyan for person
        elif cls == ppe_class:
            color = (0, 255, 0)    # Green for PPE
        else:
            color = (0, 0, 255)    # Red for others

        cv2.rectangle(img, (x1, y1), (x2, y2), color, 2)
        cv2.putText(img, f"{cls} {int(conf*100)}%", (x1, y1-10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)

Finally, I ran the object detection model on the camera output

results = model.predict(img_display, conf=confidence_threshold, verbose=False)
detections = parse_yolo_results(results)
img_display = draw_detections(img_display, detections)

In the future, I intend to link this to a magnetic lock system that only permits individuals when they are properly dressed.

I believe this can help reduce incidents in the workplace, as complacency tends to set in when there has been a lack of incidents.

This system can help keep workers “on their toes” as they will know that they will be denied access without putting on their PPE.

The system can also help monitor workers that take off their PPE after access has been granted. A safety officer can then radio that worker or the team to help prevent an incident

I also believe that this system will not replace Safety Officers but will rather give them a bird’s eye view of the site. Enabling them to spot risks while absent or far away.

THANK YOU FOR MAKING IT THIS FAR!!!!!!!!!

Business Insights with Artificial Intelligence

Caleb Epelle — Mon, 27 Oct 2025 05:05:22 GMT

I started a business a little while after I left school and I quickly found out how daunting it could be to please clients. So, I thought, “why not have AI help me with that?” and I did just that.

I set out to use sentiment analysis as a tool to identify 2 things

What my clients LOVE (things/standards to maintain and/or improve on)
What my clients HATE (things/standards to eliminate and/or improve)

For sample data, I scraped the data of a product on Jumia’s website using the Beautiful Soup library to get data like the “reviews”, “number of stars” and so on.

Excel spreadsheet of scraped data

On carrying out some Exploratory Data Analysis I discovered that the product I chose was a good product, BUT it still had room to improve.

Image showing the distribution of product rating

Using the Natural Language ToolKit library, I began carrying out my sentiment analysis.

example = (df["review"][499]).lower()
print(example)

#output: it’s good sleek and works fine

Computers can’t understand punctuations so i went on to tokenize the review. This basically breaks the sentence into it’s component words and punctuations and classifies each word to the Part-of-Speech such as:

PRP: personal Pronoun, JJ: Adjective and so on.

tokens = nltk.word_tokenize(example)
tokens[:200]
 
# output: ['it', ''', 's', 'good', 'sleek', 'and', 'works', 'fine']

tagged = nltk.pos_tag(tokens)
tagged[:200]

"""
output: 
[('it', 'PRP'),
 ('’', 'VBZ'),
 ('s', 'RB'),
 ('good', 'JJ'),
 ('sleek', 'NN'),
 ('and', 'CC'),
 ('works', 'VBZ'),
 ('fine', 'JJ')
]
"""

entities = nltk.chunk.ne_chunk(tagged)
entities.pprint()

#output: (S it/PRP ’/VBZ s/RB good/JJ sleek/NN and/CC works/VBZ fine/JJ)

Then using 3 methods; VADER, ROBERTA and TRANSFORMER, I was able to get the overall sentiment of the review. I settled on using the transformer model solely as I observed that it had the highest accuracy and was the fastest to use.

from transformers import pipeline

sent_pipeline = pipeline("sentiment-analysis")

print(sent_pipeline('Power surge fried all my ports'))
print(sent_pipeline('Power surge fried all my ports')[0])
print(sent_pipeline('Power surge fried all my ports')[0]["label"])

"""
output: [{'label': 'NEGATIVE', 'score': 0.9846639037132263}]
{'label': 'NEGATIVE', 'score': 0.9846639037132263}
NEGATIVE
"""

# applying transformer pipeline to "review" column and making a new column for transformer output
results_df["transformer"] =(results_df['review']).apply(lambda x: sent_pipeline(x)[0]["label"])
results_df

# isolating transformer output, review and stars cloums
pipeline_df = results_df[["review", "transformer", "stars"]]
pipeline_df

Isolated transformer output

The next step was to isolate the positive and negative reviews in order to identify areas of improvement.

#NEGATIVE REVIEWS
negatives = results_df[results_df["transformer"] == "NEGATIVE"]

negatives = negatives[["Id", "name", "review", "stars", "transformer"]]
negatives

#POSITIVE REVIEWS
positives= results_df[results_df["transformer"] == "POSITIVE"]

positives = positives[["Id", "name", "review", "stars", "transformer"]]
positives

Negative reviews

Positive reviews

I then made a spreadsheet with the sentiments and reviews and a filter function. This made it easy for users who may not be Data Analyst sto share and utilize this data.

Negative and Positive filters being used in spreadsheet

This spreadsheet would help provide actionable insight into the pain points of product users.

CONCLUSION

Businesses can streamline their progress and growth as well as make themselves more efficient by using Artificial Intelligence in the form of Sentiment Analysis.

Price Prediction of the Financial Markets

Caleb Epelle — Mon, 27 Oct 2025 05:04:24 GMT

In 2020, I learned how to trade the forex markets, and in 2024, I left my job to develop my skills in machine learning. Along the line I found myself in need of money, so I fell back on what I knew about the forex market and started trading. As my knowledge and skill in machine learning grew, I thought to myself:
“Why not let the computer do this?”

So, I did

and it was quite the journey.

I was able to achieve a system capable of predicting the price of an asset 1 day in advance (with some degree of error of course)

Prediction of price data 1 day prior (28/10/2025 7:01pm), 1 day after (29/10/2025 5:54pm) and 1 week after (04/11/2025 8:11am)

The first picture above shows the real time price in green with the predicted price in black. The prediction indicated that the price of the asset would go down by the next day.

The second picture shows that the real time price in green actually did go down (although not to the degree indicated by the prediction), it would still be able to give an individual a directional bias for the day ahead or even some form of directive on what action to take in the present (whether to buy, sell or wait).

The third picture shows that the real time price in green actually did go down well beyond the initial prediction (to a degree that surprised even me)

However, the model did not start out this way.

When I started, I had the strangest results

Noisy Predictions

These results sent me down a spiral of variable manipulations and loops to automate the process of variable manipulation to be able to find the best set of variables and hyperparameters to accomplish this task.

In the end I had a bit of an epiphany.

I set out to improve 2 things:

Increase the memory of the system
Reduce the prediction timeline to 1 point ahead

The previous models had poor memory and would try to predict up to 100 points into the future. This made them decompose or repeat previous predictions which led to decomposition and failure.

After accomplishing these 2 things, I was finally able to achieve the results I showed at first.

I learnt a few things from this project:

Sometimes less is more
Never give up

note: the code for this endeavor is kept private for specific reasons

Thank you for reading this far!

Effortless Attendance Tracking: Facial Recognition & Excel Insights

Caleb Epelle — Mon, 27 Oct 2025 02:59:39 GMT

Project

Creating a sign-in/attendance system that uses facial recognition to know the attendance of people and then autonomously log this attendance data into an excel spreadsheet daily. The excel spreadsheet then uses charts to track individual attendance performance by visual representation of the data.

Inspiration:

I got this idea when i was in a meeting and i noticed someone having to write down the names of people who were in attendance. I thought it would be more effective to have an automated way of taking this attendance and tracking individuals to know their consistency per season.

DESIGN STEPS

Webcam Display on Background
Create Database
Get Facial Landmark Encodings of Database Specimen
Compare Facial Landmark Encodings of Database Specimen with that from Webcam
Display Individual
Update Excel Spreadsheet
Visualize Data

THE CODE

STEP 1: WEBCAM DISPLAY ON BACKGROUND

First thing that had to be done was to add a background image and create 2 sections on this background image: one for my live video output and the other for the image that would be displayed on facial recognition. (this code was written using python, using the computer camera as the video source)

# image/display set up
while True:
    success, img = cap.read()

    img_small = cv2.resize(img, (0,0), None, 0.25, 0.25)#scales image down to 1/4th the original size to reduce computational power
    img_small = cv2.cvtColor(img_small, cv2.COLOR_BGR2RGB)

    face_current_frame = face_recognition.face_locations(img_small)#picks out only the face in the video
    encode_currrent_frame = face_recognition.face_encodings(img_small,face_current_frame)#does encodings on face from camera

    if face_current_frame == []:
        img_background = cv2.imread("C:/Users/USER/Desktop/CHECK2/resources/background1.png")  # [vertical,horizontal]says that the space defined should show the picture chosen
        img_background[100:100 + 512, 735:735 + 421] = cv2.resize(image_mode_list[3], dsize=(421, 512),interpolation=cv2.INTER_CUBIC)  # [vertical,horizontal]says that the space defined should show the picture chosen

    img_background[162:162+480, 55:55+640] = img #says that the space defined should show the webcam footage
    img_background[162:162+480, 55:55+640] = cv2.flip(img,1)#horizontal flip = 1, vertical flip = 0
    imgg = cv2.flip(img,1)

    cv2.imshow("background",img_background)

    if cv2.waitKey(1) & 0xFF == ord('s'):
        break

Output showing live feed and section for confirmation image

STEP 2: CREATE DATABASE

A database of multiple people with different pictures of them was created so the encodings for their faces could be made (these were taken with consent). The folders were named after each person to make it easy to update the database when adding new members.

Database for multiple people. (each folder has multiple images for each person)

STEP 3: GET FACIAL LANDMARK ENCODINGS OF DATABASE SPECIMEN

A function was made to be able to get the facial landmarks (similar to a fingerprint for the face) and save them to an encode file.

Stock mage of facial landmarks

def find_encodings(image_list):
    encode_list = []

    for img in image_list:
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        encodings = face_recognition.face_encodings(img)
        if len(encodings) > 0:
            encode_list.append(encodings[0])  # ✅ take the first face only
        else:
            print("No face found in one of the images, skipping.")

    for i, enc in enumerate(encode_list):
        print("TESTING ENCODE LIST")# OUTPUT: 0  (128,)
        print(i, type(enc), enc.shape)

    return encode_list

print("ENCODING STARTED...")
encode_list_known = find_encodings(image_list)
print("ENCODE LIST KNOWN")
print(encode_list_known)
encode_list_known_with_IDs = [encode_list_known, worker_IDs]
print("ENCODE LIST KNOWN WITH IDs")
print(encode_list_known_with_IDs)

print("ENCODING COMPLETE")

file = open("C:/Users/USER/Desktop/CHECK2/venv/encodefilee.p", 'wb')

pickle.dump(encode_list_known_with_IDs, file)
file.close()
print("FILE SAVED")

encoding file

STEP 4: COMPARE FACIAL LANDMARK ENCODINGS OF DATABASE SPECIMEN WITH THAT OF WEBCAM

The live feed was then used to get the facial landmark encodings of whoever is visible in the feed. The code then compares the live feed encodings with the encodings stored in the file to find the closest match.

# face matching section
for encode_face, face_location in zip(encode_currrent_frame, face_current_frame):
    matches = face_recognition.compare_faces(encode_list_known, encode_face)
    facee_distance = face_recognition.face_distance(encode_list_known, encode_face)#lower face distance gives best matches

    matches_index = np.argmin(facee_distance)#gives the index no. of which picture has the best match with face on camera
    matches_index_list.append(matches_index)

    if matches == False:
        pass

    if matches[matches_index]:
        ID = worker_IDs[matches_index]
        print("KNOWN FACE DETECTED")#
        print("ID: ", worker_IDs[matches_index])# #gives the ID or name of the worker(that the image is saved as)
        time() # GIVES TIME OF DETECTION FOR LOGGING PURPOSES
        img_background[100:100 + 512, 735:735 + 421] = cv2.resize(image_worker_list[matches_index], dsize = (421,512), interpolation = cv2.INTER_CUBIC)  # [vertical,horizontal]says that the space defined should show the picture chosen

STEP 5: DISPLAY INDIVIDUAL

After face has been matched to a face in the database, the ID photograph of the recognized person is displayed.

# face matching section
for encode_face, face_location in zip(encode_currrent_frame, face_current_frame):
    matches = face_recognition.compare_faces(encode_list_known, encode_face)
    facee_distance = face_recognition.face_distance(encode_list_known, encode_face)#lower face distance gives best matches

    matches_index = np.argmin(facee_distance)#gives the index no. of which picture has the best match with face on camera
    matches_index_list.append(matches_index)

    if matches == False:
        pass

    if matches[matches_index]:
        ID = worker_IDs[matches_index]
        print("KNOWN FACE DETECTED")#
        print("ID: ", worker_IDs[matches_index])# #gives the ID or name of the worker(that the image is saved as)
        time() # GIVES TIME OF DETECTION FOR LOGGING PURPOSES
        img_background[100:100 + 512, 735:735 + 421] = cv2.resize(image_worker_list[matches_index], dsize = (421,512), interpolation = cv2.INTER_CUBIC)  # [vertical,horizontal]says that the space defined should show the picture chosen

Database picture is displayed

STEP 5: UPDATE EXCEL SPREADSHEET

A spreadsheet that marks the absence and presence of an individual per day was made. This spreadsheet value would be updated to “1” when a face is detected and a sum of the values of present and absent days is gotten.

spreadsheet for attendance showing the dispersion of attendance across individuals


attendance_data = pd.read_excel('C:/Users/USER/Desktop/CHECK2/List1.xlsx')

# making status of person in attendance to equal "1"
attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]], ['TODAY']] = 1

print("SPECIFIC DATA")
print(attendance_data[attendance_data.WORKER_NAME == ID])
print(attendance_data[attendance_data.WORKER_NAME == ID]['TODAY'])
print(attendance_data)

# ***********************************************************************************************
attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]], ['PRESENT']] = (attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]]]).iloc[:, 1:-3].sum(axis=1, numeric_only=True)
print(attendance_data)

print("no. of days:", num_of_days)
print("CALC:", (num_of_days - ((attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]], ['PRESENT']]))))
print("CALC1:", (num_of_days - ((attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]], ['PRESENT']]).iloc[0])))
calc = (num_of_days - ((attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]], ['PRESENT']]).iloc[0])).iloc[0]
print("test 1", (num_of_days - ((attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]], ['PRESENT']]).iloc[0])).iloc[0])

attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]], ['ABSENTT']] = calc
chart_data.loc[[chart_data[chart_data.WORKER_NAME == ID].index[0]], ['ABSENTT']] = calc
chart_data.loc[[chart_data[chart_data.WORKER_NAME == ID].index[0]], ['PRESENT']] = (attendance_data.loc[[attendance_data[attendance_data.WORKER_NAME == ID].index[0]]]).iloc[:, 1:-2].sum(axis=1, numeric_only=True)

new_attendance = attendance_data
new_attendance.to_excel('C:/Users/USER/Desktop/CHECK2/List1.xlsx', index=None)
print(attendance_data)
new_chart = chart_data
new_chart.to_excel('C:/Users/USER/Desktop/CHECK2/Chart1.xlsx', index=None)
print(chart_data)

STEP 6: VISUALIZE DATA

A second spreadsheet is made from the values of the previous spreadsheet that will be used to make a chart for assessment.

spreadsheet used for chart

chart made to assess attendance with a red dashed line to represent minimum requirement

second chart used for easier to assess visualization

FINAL PRODUCT

After all processes were completed, a fully autonomous attendance tracking system was built, capable of logging and visualizing attendance data.

Thank you for reading this far.

My Attempt at Algorithmic Trading

Caleb Epelle — Mon, 08 Sep 2025 18:11:14 GMT

I started Learning to trade the forex markets in 2020, and in 2024, I left my job to develop skills in machine learning in an attempt to get into a new field of work.

One day, on my self development journey, I thought to myself:

"why not combine these two things"

So I did

and

Here are my results

1 Month Trading Report

In the short space of a month I had increased my initial deposit by 76%, moving my balance from $500 to $880.
(The code for this endeavor is kept private for specific reasons).

But this was just the initial testing phase.
I kept going.
I kept testing my model on days and weeks and months and years worth of data from the forex markets.
I even made models to predict future prices to help in my manual trading(post coming soon).

But.

The more data I tested it on,
The worse the results got.

Either little to no gain was made or the model accrued losses and eventually lost the capacity to trade.

A ranging account balance before a sudden loss

A continuously decreasing account balance

At one point in time, the model was so bad that it even managed to achieve an exponentially decreasing account balance.

But these setbacks didn’t shake me.
I had seen positive results on less data, so I knew I could get positive results on more data.
I just needed to fine tune my model

And then, I got it.

An exponentially increasing model

I was able to achieve a profit of 154% , increasing my deposit of $500 to $1270 in the time span of 6 months. Averaging about 25% each month.

Exponentially increasing account balance

What I learnt from this was the crucial parts of data science and machine learning :
1. Understanding and Knowing how to work with and manipulate data
2. Analyzing data

3. Understanding problems
4. Working through problems
5. Never giving up

note: the model still has its drawbacks. It is still being finetuned to achieve more optimal results.