Open In App

Calculating RMSE Using Scikit-learn

Last Updated : 17 Sep, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Root Mean Square Error is a metrics used for evaluating the accuracy of regression models. It measures the average size of the errors between predicted and actual values by taking the square root of the mean of squared differences. RMSE helps determine how close the model’s predictions are to real outcomes with lower values indicating better prediction accuracy.

The formula for RMSE is:

\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2}

Here,

  • n: total number of data points
  • y_i : represents the actual (observed) value for the i ^ {th} data point.
  • \hat{y}_i represents the predicted value for the i ^ {th} data point.

Implementing RMSE Using Scikit-learn

We will use the California Housing dataset (an in-built dataset in Scikit-learn) to predict house prices using Linear Regression and then calculate the Root Mean Square Error (RMSE).

1. Import Required Libraries

We will import numpy, pandas and scikit learn for this.

Python
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

2. Load the Dataset

Here we will load califonia housing dataset from scikit learn:

  • fetch_california_housing(as_frame=True): loads the dataset as a pandas DataFrame.
  • data.frame: gives the complete dataset with features and target.
  • drop('MedHouseVal', axis=1): removes target column to get only features (X).
  • df['MedHouseVal']: target column containing house values (y).
Python
data = fetch_california_housing(as_frame=True)

df = data.frame

X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

3. Split the Dataset

Here we will split data into 80% training and 20% testing data.

  • random_state=42: ensures reproducibility of results.
Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Train the Linear Regression Model

Here we will train a linear regression model:

  • LinearRegression(): creates a linear regression model object.
  • model.fit(X_train, y_train): trains the model on training data.
Python
model = LinearRegression()
model.fit(X_train, y_train)

Output:

Screenshot-2025-09-17-113921
Model Training

5. Make Predictions and Calculate RMSE

  • model.predict(X_test): generates predictions for the test data.
  • mean_squared_error(y_test, y_pred): computes mean squared error.
  • np.sqrt(...): takes the square root of MSE to get RMSE.
Python
y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Root Mean Square Error (RMSE):", rmse)

Output:

Root Mean Square Error (RMSE): 0.7455813830127763

A lower RMSE indicates that the model’s predictions are closer to the real values. In this case, an RMSE of 0.74 shows that the Linear Regression model is performing reasonably well on this dataset.

Advantages of RMSE

RMSE is preferred over other metrics like Mean Absolute Error (MAE) because it penalizes larger errors more significantly. This makes it sensitive to outliers, which can be beneficial when large errors are particularly undesirable. It helps in:

  • Penalizes Large Errors: RMSE gives higher weight to larger errors by squaring them making it useful when large mistakes are especially undesirable.
  • Intuitive Interpretation: It represents the average size of prediction errors in the same units as the target variable making results easy to understand.
  • Highlights Significant Errors: Its sensitivity to large discrepancies helps identify major prediction issues that may need further investigation.
  • Consistent Scale RMSE is expressed in the same scale as the predicted values which makes interpretation straightforward in real-world contexts.
  • Good for Model Comparison It acts as a reliable benchmark as lower RMSE indicates better model performance making it suitable for comparing different models.
  • Widely Accepted: RMSE is a standard metric used across fields which helps maintain consistency when reporting model performance.

Explore