Calculating RMSE Using Scikit-learn
Last Updated :
17 Sep, 2025
Root Mean Square Error is a metrics used for evaluating the accuracy of regression models. It measures the average size of the errors between predicted and actual values by taking the square root of the mean of squared differences. RMSE helps determine how close the model’s predictions are to real outcomes with lower values indicating better prediction accuracy.
The formula for RMSE is:
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2}
Here,
- n: total number of data points
- y_i : represents the actual (observed) value for the i ^ {th} data point.
- \hat{y}_i represents the predicted value for the i ^ {th} data point.
Implementing RMSE Using Scikit-learn
We will use the California Housing dataset (an in-built dataset in Scikit-learn) to predict house prices using Linear Regression and then calculate the Root Mean Square Error (RMSE).
1. Import Required Libraries
We will import numpy, pandas and scikit learn for this.
Python
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
2. Load the Dataset
Here we will load califonia housing dataset from scikit learn:
- fetch_california_housing(as_frame=True): loads the dataset as a pandas DataFrame.
- data.frame: gives the complete dataset with features and target.
- drop('MedHouseVal', axis=1): removes target column to get only features (X).
- df['MedHouseVal']: target column containing house values (y).
Python
data = fetch_california_housing(as_frame=True)
df = data.frame
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']
3. Split the Dataset
Here we will split data into 80% training and 20% testing data.
- random_state=42: ensures reproducibility of results.
Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
4. Train the Linear Regression Model
Here we will train a linear regression model:
- LinearRegression(): creates a linear regression model object.
- model.fit(X_train, y_train): trains the model on training data.
Python
model = LinearRegression()
model.fit(X_train, y_train)
Output:
Model Training5. Make Predictions and Calculate RMSE
- model.predict(X_test): generates predictions for the test data.
- mean_squared_error(y_test, y_pred): computes mean squared error.
- np.sqrt(...): takes the square root of MSE to get RMSE.
Python
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Square Error (RMSE):", rmse)
Output:
Root Mean Square Error (RMSE): 0.7455813830127763
A lower RMSE indicates that the model’s predictions are closer to the real values. In this case, an RMSE of 0.74 shows that the Linear Regression model is performing reasonably well on this dataset.
Advantages of RMSE
RMSE is preferred over other metrics like Mean Absolute Error (MAE) because it penalizes larger errors more significantly. This makes it sensitive to outliers, which can be beneficial when large errors are particularly undesirable. It helps in:
- Penalizes Large Errors: RMSE gives higher weight to larger errors by squaring them making it useful when large mistakes are especially undesirable.
- Intuitive Interpretation: It represents the average size of prediction errors in the same units as the target variable making results easy to understand.
- Highlights Significant Errors: Its sensitivity to large discrepancies helps identify major prediction issues that may need further investigation.
- Consistent Scale RMSE is expressed in the same scale as the predicted values which makes interpretation straightforward in real-world contexts.
- Good for Model Comparison It acts as a reliable benchmark as lower RMSE indicates better model performance making it suitable for comparing different models.
- Widely Accepted: RMSE is a standard metric used across fields which helps maintain consistency when reporting model performance.
Explore
Machine Learning Basics
Python for Machine Learning
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advanced Techniques
Machine Learning Practice