What is Query Optimization?

Query optimization is the process of finding the most efficient way to execute a database query.

When you write a SQL query, you’re basically telling the database what data you want, but the database has to figure out how to actually retrieve it. That’s the main job of the query optimizer. The query optimizer is a dedicated component of the database management system (DBMS) that evaluates various possible execution paths and selects the most efficient one.

But there are also things that we can do to help the query optimizer, such as writing efficient SQL, properly indexing tables, maintaining up-to-date statistics, etc.

Understanding how the optimizer works and knowing how to steer it toward better execution plans is what we mean by query optimization.

How the Query Optimizer Works

When you submit a query, the database doesn’t just blindly execute it as written. Instead, it goes through several steps:

The optimizer first parses your SQL to understand what you’re asking for.
Then it generates multiple possible execution plans. These are different ways it could retrieve and process the data. Each plan might use different indexes, join methods, or access patterns.
The optimizer estimates the cost of each plan based on factors like the number of rows it needs to scan, available indexes, and data distribution.
Finally, it picks the plan with the lowest estimated cost.

This all happens in milliseconds, which is pretty remarkable when you consider how many possibilities the optimizer might be evaluating.

It’s a bit like a navigation app that calculates the best route based on real-time traffic. But with the query optimizer, it’s calculating the the most efficient query plan.

Why Query Optimization Matters

Bad queries can absolutely tank your application’s performance if you’re not careful. A query that takes 10 seconds instead of 100 milliseconds might not seem like a big deal when you’re testing with a small dataset, but multiply that across thousands of users hitting your database simultaneously, and you’ve got a serious problem.

The optimizer can help by:

Reducing the amount of data scanned by using indexes effectively
Choosing efficient join algorithms based on table sizes
Reordering operations to filter data early and reduce intermediate result sets
Deciding when to use parallel processing

Even a well-written query can perform poorly if the optimizer makes bad decisions, which is why understanding optimization is important.

Viewing Query Execution Plans

Different database management systems (DBMSs) provide different ways to see what the optimizer is doing. Understanding how to view execution plans in your specific database is essential for diagnosing performance issues.

MySQL uses the EXPLAIN command to show the query plan, as does PostgreSQL:

EXPLAIN SELECT * FROM users WHERE premium_member = true;

For more detailed information including actual execution times and row counts, use EXPLAIN ANALYZE, which actually runs the query:

EXPLAIN ANALYZE SELECT * FROM users WHERE premium_member = true;

SQL Server takes a different approach. You can enable the execution plan display before running your query:

SET SHOWPLAN_ALL ON;
GO
SELECT * FROM users WHERE premium_member = 1;
GO
SET SHOWPLAN_ALL OFF;
GO

Alternatively, use SET STATISTICS PROFILE ON for actual execution statistics.

Oracle uses a two-step process. First, you generate the plan, then you query it:

EXPLAIN PLAN FOR
SELECT * FROM users WHERE premium_member = 1;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

Beyond command-line tools, most DBMSs offer graphical interfaces that display execution plans visually. SQL Server Management Studio, pgAdmin, MySQL Workbench, and Oracle SQL Developer all have built-in plan viewers that show query execution as flowcharts or tree diagrams. These GUI tools often make it easier to spot bottlenecks by highlighting expensive operations with visual cues like thicker lines or warning icons.

The execution plan output varies between databases, but they all show similar information, such as which tables are being accessed, what indexes are being used, the order of operations, and estimated costs. Learning to read these plans for your specific database is one of the most valuable skills for query optimization.

Example of Checking the Execution Plan

Suppose you’re working with the following database for a music streaming service with users, songs, and listening history:

-- Create sample tables
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50),
    country VARCHAR(50),
    premium_member BOOLEAN
);

CREATE TABLE songs (
    song_id INT PRIMARY KEY,
    title VARCHAR(100),
    artist VARCHAR(100),
    duration_seconds INT,
    genre VARCHAR(50)
);

CREATE TABLE listening_history (
    history_id INT PRIMARY KEY,
    user_id INT,
    song_id INT,
    played_at TIMESTAMP,
    completed BOOLEAN,
    FOREIGN KEY (user_id) REFERENCES users(user_id),
    FOREIGN KEY (song_id) REFERENCES songs(song_id)
);

-- Insert sample data
INSERT INTO users VALUES
(1, 'alex_music', 'USA', true),
(2, 'beats_lover', 'Canada', false),
(3, 'melody_fan', 'UK', true),
(4, 'rhythm_king', 'USA', false),
(5, 'sound_seeker', 'Germany', true);

INSERT INTO songs VALUES
(101, 'Midnight Drive', 'The Wanderers', 245, 'Rock'),
(102, 'Digital Dreams', 'Synth Collective', 198, 'Electronic'),
(103, 'Ocean Waves', 'Calm Vibes', 312, 'Ambient'),
(104, 'City Lights', 'Urban Poets', 187, 'Hip Hop'),
(105, 'Mountain Echo', 'The Wanderers', 276, 'Rock');

INSERT INTO listening_history VALUES
(1, 1, 101, '2024-11-20 14:30:00', true),
(2, 1, 102, '2024-11-20 14:35:00', true),
(3, 2, 101, '2024-11-21 09:15:00', false),
(4, 3, 103, '2024-11-21 16:45:00', true),
(5, 1, 104, '2024-11-22 12:00:00', true),
(6, 4, 102, '2024-11-22 18:30:00', true),
(7, 5, 105, '2024-11-23 10:15:00', true),
(8, 2, 103, '2024-11-23 20:00:00', true),
(9, 3, 101, '2024-11-24 11:30:00', true);

Now let’s say you want to find all premium users who listened to rock songs. Here’s a straightforward query:

SELECT DISTINCT u.username, s.title
FROM users u
JOIN listening_history lh ON u.user_id = lh.user_id
JOIN songs s ON lh.song_id = s.song_id
WHERE u.premium_member = true
  AND s.genre = 'Rock'
  AND lh.completed = true;

The optimizer has several decisions to make here. Should it start by filtering users for premium members, or start with rock songs? Which tables should it join first?

The optimizer considers multiple factors. These could include:

the size of each table
the selectivity of the filters (what percentage of rows match)
available indexes
the cost of different join algorithms
statistics about data distribution.

For example, if only 5% of users are premium but 40% of songs are rock, and there’s a good index on the users table, filtering users first might be the better choice. But with a small dataset like our example, the optimizer might choose a different approach since all the options are relatively cheap. The execution plan you see will vary based on your specific data characteristics and available indexes.

Let’s go ahead and find out which execution plan is chosen for our query.

PostgreSQL’s Execution Plan

MySQL and PostgreSQL both provide the EXPLAIN command to view the query plan:

EXPLAIN SELECT DISTINCT u.username, s.title
FROM users u
JOIN listening_history lh ON u.user_id = lh.user_id
JOIN songs s ON lh.song_id = s.song_id
WHERE u.premium_member = true
  AND s.genre = 'Rock'
  AND lh.completed = true;

Here’s the output for PostgreSQL:

                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Unique  (cost=41.82..41.85 rows=3 width=336)
   ->  Sort  (cost=41.82..41.83 rows=3 width=336)
         Sort Key: u.username, s.title
         ->  Nested Loop  (cost=11.79..41.80 rows=3 width=336)
               ->  Hash Join  (cost=11.64..40.46 rows=6 width=222)
                     Hash Cond: (lh.song_id = s.song_id)
                     ->  Seq Scan on listening_history lh  (cost=0.00..26.60 rows=830 width=8)
                           Filter: completed
                     ->  Hash  (cost=11.62..11.62 rows=1 width=222)
                           ->  Seq Scan on songs s  (cost=0.00..11.62 rows=1 width=222)
                                 Filter: ((genre)::text = 'Rock'::text)
               ->  Index Scan using users_pkey on users u  (cost=0.15..0.22 rows=1 width=122)
                     Index Cond: (user_id = lh.user_id)
                     Filter: premium_member
(14 rows)

This shows you exactly how the database plans to execute your query. What order it’ll join tables, which indexes it’ll use, and estimated costs.

Reading PostgreSQL’s query plan can take a bit of getting used to if you haven’t done it before. You need to read bottom-up and inside-out (most indented operations happen first).

So looking at the execution plan, PostgreSQL is:

Seq Scan on songs s (most indented, happens first) – filters for genre = 'Rock'
Hash – builds a hash table from those rock songs
Seq Scan on listening_history lh – filters for completed = true
Hash Join – joins the listening history with the hash table of rock songs
Index Scan on users u – for each result, looks up the user and filters for premium_member = true
Nested Loop – performs the join between the hash join result and users
Sort – sorts the results by username and title
Unique – removes duplicates (for the DISTINCT)

MySQL’s Execution Plan

Here’s the output for MySQL:

+----+-------------+-------+------------+--------+-----------------+---------+---------+-----------------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type   | possible_keys   | key     | key_len | ref                               | rows | filtered | Extra                        |
+----+-------------+-------+------------+--------+-----------------+---------+---------+-----------------------------------+------+----------+------------------------------+
|  1 | SIMPLE      | u     | NULL       | ALL    | PRIMARY         | NULL    | NULL    | NULL                              |    5 |    20.00 | Using where; Using temporary |
|  1 | SIMPLE      | lh    | NULL       | ref    | user_id,song_id | user_id | 5       | sandbox_db.u.user_id  |    1 |    11.11 | Using where                  |
|  1 | SIMPLE      | s     | NULL       | eq_ref | PRIMARY         | PRIMARY | 4       | db_445xmv67t_445yrmzdp.lh.song_id |    1 |    20.00 | Using where                  |
+----+-------------+-------+------------+--------+-----------------+---------+---------+-----------------------------------+------+----------+------------------------------+

MySQL’s execution plan shows a completely different approach from PostgreSQL.

With MySQL execution plans, we read top-down (tables are listed in join order):

users (u) – full table scan (type: ALL), filtering for premium_member = true (the 20.00% filtered indicates the WHERE condition)
listening_history (lh) – uses the user_id index to find matching rows for each user from step 1, filters for completed = true (11.11% filtered)
songs (s) – uses PRIMARY key lookup (type: eq_ref) for each song_id from step 2, filters for genre = 'Rock' (20.00% filtered)

The id column being 1 for all rows indicates they’re all part of the same SELECT, and the order listed is the join order.

Two Different Approaches

So each DBMS took a different approach.

PostgreSQL’s approach:

Filter songs for genre = 'Rock'
Build a hash table from those songs
Scan listening_history for completed = true
Hash join listening_history with the rock songs
For each result, look up the user and filter for premium_member = true

MySQL’s approach:

Scan users and filter for premium_member = true
For each premium user, look up their listening history (where completed = true)
For each listening record, look up the song and filter for genre = 'Rock'

So PostgreSQL starts by filtering songs, while MySQL starts by filtering users. They’re processing the same query in essentially opposite directions.

This is a perfect illustration of how different database systems can choose different execution plans for the exact same query, even with identical data. And if we’d had more data in the tables, we might’ve seen different query plans.

However, the important thing to remember is that both DBMSs will return the same result. Both will return the correct data, just via different paths.

Helping the Optimizer Do Its Job

The query optimizer is sophisticated, but it relies on certain pieces of information to make good decisions. While it can analyze your query and examine your data, it doesn’t automatically know which columns you’ll search on most frequently, or whether your data distribution has changed since it last checked. Without this information, even a smart optimizer can choose inefficient execution plans.

You have control over several factors that directly influence the optimizer’s decisions. Creating appropriate indexes, maintaining up-to-date statistics, and writing clear queries can dramatically improve performance. Most databases also support query hints, which are explicit instructions that let you override the optimizer’s choices (although these should be used sparingly and only when you’ve identified a specific case where the optimizer consistently makes poor decisions).

Here are some of the main areas that you can help with query optimization:

Create indexes. If you’re frequently filtering or joining on certain columns, create indexes on them. For our music streaming example, indexes on users.premium_member, songs.genre, and the foreign key columns would significantly speed up the query above (especially when there’s a lot of data).
Keep statistics updated. Databases maintain statistics about data distribution, which can help the query optimizer make choices about which approach to take for a given query. Statistics include things like how many rows are in each table, how many distinct values are in each column, and what the data ranges are. If your data changes significantly and statistics are stale, the optimizer might make poor choices. Most databases have commands like ANALYZE, ANALYZE TABLE, or UPDATE STATISTICS to refresh this information.
Write clear queries. Sometimes the way you write a query can accidentally prevent the optimizer from doing its job. Using functions on indexed columns in WHERE clauses can disable index usage. For example, WHERE YEAR(played_at) = 2024 might not use an index on played_at, but WHERE played_at >= '2024-01-01' AND played_at < '2025-01-01' can.
Use query hints cautiously. Most databases allow you to provide hints that influence or override the optimizer’s decisions. This includes things like forcing a specific index or join method. Some DBMSs also have commands (like SQL Server’s FORCEPLAN) that let you override the query optimizer. Basically, you’re telling the optimizer how to do its job. However, these should be a last resort. They’re brittle because they lock in assumptions about your data that might not hold true as it changes over time. A hint that improves performance today might make things worse next month when your data distribution shifts. It’s almost always better to help the optimizer by adding indexes or updating statistics rather than bypassing it with hints. If you do use hints, document why you added them and review them periodically to ensure they’re still beneficial.

When the Optimizer Gets It Wrong

Despite its sophistication, the optimizer sometimes chooses a suboptimal plan. This can happen when statistics are outdated, when the query is very complex, or when the optimizer’s cost model doesn’t match reality for your specific workload.

As mentioned, most databases let you provide hints to influence the optimizer’s decisions, but this should be a last resort. It’s usually better to fix the root cause than to rely on hints that might become outdated as your data changes.

Query optimization is an ongoing process. As your data grows and changes, what worked yesterday might not work tomorrow. Monitoring query performance and understanding how the optimizer makes decisions will help you keep your database running smoothly.