Skip to content

jjpp43/HackAI2026

Repository files navigation

OSU Campus Energy Investment Prioritization Tool

A data-driven decision-support platform for identifying which campus buildings should be prioritized for energy-efficiency capital investment. Built for the OSU AI Hackathon 2025.

The tool analyzes 60 days of smart meter data (September–October 2025) across 286 OSU campus buildings, combines it with weather and building metadata, and produces a transparent, ranked shortlist with plain-language explanations and estimated savings opportunities.


Project Structure

osu_energy/
├── app.py                  # 4-page Streamlit application
├── data_loader.py          # Data ingestion, joining, and daily aggregation
├── feature_engineering.py  # Investment signal computation
├── models.py               # Portfolio clustering and per-building time-series regression
├── scoring.py              # Composite score, normalization, ranking, action recommendations
├── utils.py                # Shared utility functions (infrastructure filter)
├── requirements.txt        # Python dependencies
└── PRD.md                  # Product Requirements Document

Requirements

  • Python 3.10+
  • Dependencies:
streamlit>=1.35.0
pandas>=2.0.0
numpy>=1.26.0
scikit-learn>=1.4.0
plotly>=5.20.0
scipy>=1.13.0

Install with:

pip install -r requirements.txt

Data Setup

Place the following four files in your ~/Downloads/ directory before running the app:

File Description
meter-data-sept-2025.csv 15-minute interval meter readings, September 2025
meter-data-oct-2025.csv 15-minute interval meter readings, October 2025
weather-sept-oct-2025.csv Hourly weather data from Open-Meteo API
building_metadata.csv SIMS building database (size, age, location)

The data pipeline joins all three sources automatically on startup. No manual preprocessing required.


Running the App

streamlit run app.py

The first load takes approximately 30 seconds while the full pipeline runs. Subsequent page navigation is instant due to Streamlit's @st.cache_data caching.


Application Pages

1. Portfolio Overview

A campus-wide snapshot of energy performance across all 286 buildings.

  • Summary metrics: buildings analyzed, meter readings processed, data window
  • Top 20 buildings by priority score, color-coded by confidence tier
  • Energy intensity vs. building age scatter plot (bubble size = gross square footage)
  • Portfolio cluster map: KMeans archetypes projected to 2D via PCA

2. Building Prioritization

The primary decision-support page. Produces a ranked shortlist with adjustable signal weights.

  • Signal weight sliders (auto-renormalized to 100%)
  • Utility selector (ELECTRICITY, HEAT, GAS)
  • Infrastructure building filter (enabled by default)
  • Ranked table with priority score, confidence tier, age, area, and recommended next step
  • Expandable "Why This Building?" section per building — signal breakdown chart and plain-language explanation
  • Estimated savings opportunity (assumes 20% load reduction at $0.13/kWh for electricity)

3. Building Deep Dive

Per-building time-series analysis.

  • 60-day time-series of actual vs. weather-predicted energy use
  • Temperature overlay on secondary axis
  • Isolation Forest anomaly detection — flags statistically unusual days on the chart
  • Signal scorecard and recommended action

4. Methodology & Limitations

Full transparency page documenting signals, models, assumptions, and limitations.

  • Signal definitions and model descriptions
  • Confidence tier criteria
  • KMeans elbow curve (k=2 through k=10) justifying k=5
  • Weight sensitivity analysis: Spearman rank correlation across 300 random weight sets
  • Stated assumptions and known limitations

Data Pipeline

Stage 1 — Ingestion (data_loader.py)

  • Loads and concatenates September and October meter CSVs
  • Filters to energy utilities only: ELECTRICITY, HEAT, GAS, STEAM
  • Joins meter data to hourly weather on truncated timestamp
  • Joins meter data to building metadata on SIMS building number
  • Computes vintage_age = 2025 − construction_year

Stage 2 — Daily Aggregation (data_loader.py)

  • Aggregates 15-minute readings to one row per building, utility, and date
  • Uses readingwindowsum (the true daily total across all 96 intervals) — not readingvalue
  • Applies IQR outlier clipping per building+utility (3×IQR fence) to remove sensor faults
  • Computes daily_kwh_per_sqft for size-normalized comparisons

Stage 3 — Signal Engineering (feature_engineering.py)

Five signals are computed for each building × utility combination:

Signal What It Measures
Energy Intensity (kWh/sqft) Baseline consumption normalized for building size
Unexplained Deviation (RMSE) Energy use not explained by weather or building age
Peer Group Excess (z-score) How far above similar-sized, similar-aged buildings
Weather Sensitivity (kWh/sqft/°F) Strength of HVAC response to temperature
Load Variability (CV) Erratic or unstable daily consumption patterns

Each building also receives a confidence tier:

Tier Criteria
High ≥ 45 days of data, R² > 0.3, < 10% missing readings
Medium ≥ 25 days of data OR R² > 0.1
Low Anything below Medium

A plausibility check removes buildings with a median daily intensity above 50 kWh/sqft/day — a threshold that indicates a likely unit error (e.g., Wh logged as kWh) rather than genuine consumption.

Stage 4 — Portfolio Clustering (models.py)

  • KMeans (k=5) on all five normalized signals, one row per building
  • PCA (2 components) for visualization
  • Cluster archetypes assigned by centroid ranking:
    • High Load + Weather-Driven (Priority)
    • High Baseline Load (Priority)
    • Erratic / Unstable Load (Investigate)
    • Efficient / Low Load (Reference)
    • Moderate Load (Monitor)

Stage 5 — Scoring (scoring.py)

Each signal is min-max normalized to [0, 1] within utility type. Infrastructure buildings are excluded from the normalization scale to prevent them from compressing campus building scores.

Composite score formula:

score = 100 × (
    0.30 × norm(energy_intensity)
  + 0.25 × norm(unexplained_deviation)
  + 0.20 × norm(peer_excess)
  + 0.15 × norm(|weather_sensitivity|)
  + 0.10 × norm(load_variability)
)

Weights are user-adjustable via the Building Prioritization sliders.

Action recommendations by score:

Score Recommendation
>= 70 (High/Medium confidence) Full Energy Audit
>= 50 Targeted Investigation
>= 30 Monitor
< 30 No Immediate Action

AI / ML Methods

Method Library Purpose
OLS Regression (per building) sklearn.linear_model.LinearRegression Model expected energy from weather + age; extract RMSE and sensitivity
KMeans — Peer Grouping sklearn.cluster.KMeans (k=5) Group buildings by size and age for fair peer comparison
KMeans — Portfolio Clustering sklearn.cluster.KMeans (k=5) Group buildings by all 5 signals into behavioral archetypes
PCA sklearn.decomposition.PCA (n=2) Project portfolio clusters to 2D for visualization
Isolation Forest sklearn.ensemble.IsolationForest (contamination=0.10) Flag anomalous days per building in the Deep Dive
Min-Max Normalization Manual (per utility) Normalize signals to [0, 1] before scoring
Spearman Correlation scipy.stats.spearmanr Validate ranking stability across 300 random weight sets

Key Design Decisions

Why OLS regression instead of a more complex model? Interpretability. OLS coefficients have a direct meaning — a facilities manager can understand that a building's energy use changes by X kWh/sqft per degree Fahrenheit. A black-box model would produce better predictions but cannot explain its reasoning.

Why normalize by square footage? Raw kWh always favors large buildings. kWh/sqft makes comparisons fair regardless of building size.

Why score utilities separately? Electricity and heat operate on different absolute scales and cannot be meaningfully compared directly. Each utility is scored independently; scores are then averaged at the building level.

Why exclude infrastructure from normalization? Substations and chiller plants consume energy at orders of magnitude higher than campus buildings. Including them in the min-max scale would compress all campus building scores into a narrow band near zero.


Limitations

  • 60-day window only: September–October is a shoulder season. Full-year data would capture heating and cooling cycles more completely.
  • No occupancy or use-type data: A research lab and a lecture hall of the same size and age are treated identically. The peer grouping partially mitigates this but does not fully resolve it.
  • Single weather station: All buildings use the same campus weather feed. Microclimatic variation is not captured.
  • Savings estimates are illustrative: The 20% load reduction assumption and $0.13/kWh rate are representative starting points, not guarantees. Dollar figures are shown for ELECTRICITY only.
  • No discount rate on 3-year savings: The 3-year projection is a simple linear extrapolation and should not be used as a capital budgeting input without further financial analysis.

Data Source

Smart meter data provided by the OSU Energy Research Data Hub for the OSU AI Hackathon 2025. Weather data sourced from the Open-Meteo API. Building metadata from the OSU SIMS database.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages