Skip to main content
  1. Posts/

MySQL Optimizer Statistics: Why Your Queries Choose the Wrong Index

·1600 words·8 mins
NeatGuyCoding
Author
NeatGuyCoding
Table of Contents

Recently, I encountered another slow query issue that caught my attention. After a quick investigation, it turned out to be yet another case of MySQL’s optimizer making inaccurate estimations in its query planning. Here’s the problematic SQL:

select * from t_pay_record
WHERE
((
	user_id = 'user_id1' 
	AND is_del = 0 
)) 
ORDER BY
	id DESC 
	LIMIT 20

This query took a whopping 20 minutes to return results! However, when we switched to a different user_id, the execution was lightning fast. From our production environment observations, most users experienced normal performance. We even tested with users who had similar data distribution patterns, and those queries ran smoothly too.

Let’s start by examining the EXPLAIN output for our original problematic SQL:

+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+---------+---------+------+-------+----------+-------------+
| id | select_type | table        | partitions | type  | possible_keys                                                                           | key     | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+---------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | t_pay_record | NULL       | index | idx_user_id,idx_user_status_pay,idx_user_id_trade_code_status_amount_create_time_is_del | PRIMARY | 8       | NULL | 22593 |     0.01 | Using where |
+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+---------+---------+------+-------+----------+-------------+

Now, when we tested with users having similar data distribution but normal response times, we got different EXPLAIN results. Some showed:

+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+---------------------------------------------------------+---------+------+-------+----------+-------------+
| id | select_type | table        | partitions | type  | possible_keys                                                                           | key                                                     | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+---------------------------------------------------------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | t_pay_record | NULL       | index | idx_user_id,idx_user_status_pay,idx_user_id_trade_code_status_amount_create_time_is_del | idx_user_id_trade_code_status_amount_create_time_is_del | 195     | NULL | 107561|     10.00| Using where |
+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+---------------------------------------------------------+---------+------+-------+----------+-------------+

While others displayed:

+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+-------------+---------+------+-------+----------+-------------+
| id | select_type | table        | partitions | type  | possible_keys                                                                           | key         | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+-------------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | t_pay_record | NULL       | index | idx_user_id,idx_user_status_pay,idx_user_id_trade_code_status_amount_create_time_is_del | idx_user_id | 195     | NULL |  87514|     10.00| Using where |
+----+-------------+--------------+------------+-------+-----------------------------------------------------------------------------------------+-------------+---------+------+-------+----------+-------------+

Based on these observations, it’s clear that the query was using the wrong index. But why would MySQL choose an inappropriate index? This happens due to multiple interconnected factors, and this article will dive deep into analyzing these causes while providing practical solutions.

Analyzing MySQL Slow Queries
#

In my previous articles, I’ve mentioned that SQL optimization typically relies on three essential tools:

  1. EXPLAIN: This provides a surface-level analysis without actually executing the SQL. While it might not always be completely accurate or detailed, it can reveal critical issues.

  2. PROFILING: Enabled with set profiling = 1, this tool samples SQL execution and breaks down the query into different phases with their respective timing. It requires actual SQL execution and success, though the phase breakdown isn’t always granular enough. It’s mainly useful for identifying and avoiding certain phases (like preventing in-memory sorting).

  3. OPTIMIZER TRACE: This tool provides a detailed view of every step the optimizer takes, requiring actual SQL execution and success. MySQL’s optimizer considers numerous factors through multiple iterations, making its configuration quite complex. While default settings work fine in most scenarios, special cases require manual intervention.

It’s worth noting that across different MySQL versions, EXPLAIN and OPTIMIZER TRACE results may vary due to inherent design limitations in MySQL. EXPLAIN tends to be closer to the actual execution result, while OPTIMIZER TRACE acts like checkpoint sampling throughout the process. During MySQL’s continuous development iterations, some inconsistencies are inevitable.

For our specific SQL case, EXPLAIN already reveals that it’s using the wrong index. However, to understand why this happens, we need to dig deeper with OPTIMIZER TRACE. Before we proceed with that analysis, let me explain MySQL’s InnoDB query optimizer statistics configuration.

MySQL InnoDB Optimizer Statistics Configuration
#

Official documentation: https://dev.mysql.com/doc/refman/8.0/en/innodb-persistent-stats.html

To optimize user SQL queries, MySQL performs SQL parsing, rewriting, and query plan optimization. For the InnoDB engine, creating a query plan involves analyzing:

  1. The cost of a full table scan
  2. Which indexes are available for the WHERE and ORDER BY conditions
  3. The query cost for each potential index
  4. Selecting and executing the plan with the lowest cost

Each index’s query cost is determined through InnoDB optimizer statistics. This data is collected by sampling table and index data—not through complete collection, but through statistical sampling. Several configuration parameters control this process:

  1. innodb_stats_persistent: This global variable controls whether statistics are persisted by default (ON by default). We generally can’t accept in-memory storage since database restarts would require table re-analysis, slowing startup times. Individual table control uses STATS_PERSISTENT (in CREATE TABLE and ALTER TABLE statements).

  2. innodb_stats_auto_recalc: This global variable controls automatic updates by default (ON by default), triggering background asynchronous updates when more than 10% of table rows are modified. Individual table control uses STATS_AUTO_RECALC (in CREATE TABLE and ALTER TABLE statements).

  3. innodb_stats_persistent_sample_pages: This global variable controls the number of pages sampled by default (20 by default). Each update randomly samples 20 pages from the table and each index to estimate query costs for each index and full table scans. Individual table control uses STATS_SAMPLE_PAGES (in CREATE TABLE and ALTER TABLE statements).

Root Cause Analysis of the Slowest SQL Execution
#

From our previous EXPLAIN results, we know the final query used the PRIMARY key index. This means the entire SQL execution process involved: traversing each row in the table in reverse primary key order until finding 20 matching records. Given the execution time, we know this process examined many records before collecting 20 matches, making it extremely inefficient. But why did this happen?

Looking at our SQL statement, in the second step mentioned earlier, the considered indexes include those related to user_id and is_del from the WHERE conditions (as shown in EXPLAIN: idx_user_id,idx_user_status_pay,idx_user_id_trade_code_status_amount_create_time_is_del), as well as the id index from the ORDER BY condition (the primary key index). Suppose the randomly sampled pages looked like this:

image

The blue sections represent sampled pages, with each index in the table sampling 20 pages by default. Assuming our sampling results match the diagram, other indexes were sampled relatively evenly, leading the optimizer to estimate that using other indexes would require scanning tens of thousands of rows. However, the last page sampled from the primary key happened to contain all records for this specific user at the end. Since the statement includes LIMIT 20, if there happen to be 20 records at the end (all meeting the WHERE conditions), the optimizer would conclude that scanning backwards through the primary key for 20 records would be most efficient. This causes the optimizer to believe primary key scanning has the lowest cost. In reality, this isn’t true because sampling data doesn’t represent the full picture—there might be many, many records that don’t belong to this user afterward, especially in large tables.

If we remove the LIMIT clause, EXPLAIN shows the correct index is chosen, because without limiting results, the primary key index would need to scan the entire table, making it impossible to have lower cost than user_id-related indexes.

Why Different user_ids with Normal Execution Times Show Different Index Choices
#

Similarly, since all index optimizer statistics are randomly sampled, as tables grow larger and indexes expand, plus the potential addition of more complex indexes, this amplifies the variance in analyzing index costs with different parameters (different user_ids in our case).

This brings up another issue you might encounter: when adding a composite index on top of existing indexes (for example, originally having only idx_user_id, then adding idx_user_status_pay), SQL queries that previously only searched by user_id might sometimes use idx_user_id and sometimes use idx_user_status_pay. Using idx_user_status_pay will likely be slower than using idx_user_id. Therefore, adding new composite indexes might slow down other business SQL queries that weren’t meant to be optimized by the composite index, so this should be done carefully.

Problems This Design Creates as Data Volume Grows and Tables Become More Complex
#

  1. Since statistics aren’t updated in real-time but only when modified rows exceed a certain percentage, and statistics are sampled rather than comprehensive, these statistics can be quite inaccurate when table data volumes are large.

  2. Since statistics are inherently inaccurate, complex table designs with diverse data types, many fields, and especially various composite indexes make statistics even more inaccurate.

  3. As a side note: MySQL tables shouldn’t be too large and need proper horizontal partitioning, while fields shouldn’t be too numerous, requiring good vertical partitioning. Indexes shouldn’t be added carelessly—adding too many exacerbates statistical inaccuracy, leading to wrong index choices.

  4. Manual ANALYZE TABLE adds read locks to tables, blocking updates and transactions. This can’t be used on critical online business tables. Consider scheduling ANALYZE for critical business tables during low-traffic periods.

  5. Relying on automatic table refresh mechanisms makes parameters difficult to adjust (mainly the STATS_SAMPLE_PAGES parameter—we generally won’t change STATS_PERSISTENT since we can’t accept in-memory storage due to restart delays, and we won’t disable STATS_AUTO_RECALC as it would make optimizer analysis increasingly inaccurate). It’s hard to predict optimal values. Business growth and user behavior causing data skew are also unpredictable. Using ALTER TABLE to modify a specific table’s STATS_SAMPLE_PAGES has the same effect as ANALYZE TABLE—adding read locks and blocking updates and transactions. This can’t be used on critical online business tables, so it’s best to estimate large table scales from the beginning, though this is challenging.

Conclusions and Recommendations
#

In summary, for production tables with large data volumes, I recommend proactively controlling each table’s data volume through database and table partitioning. However, business growth and product requirements continuously iterate and become more complex, making it difficult to guarantee we won’t end up with large tables having complex indexes. In such situations, we need to appropriately increase STATS_SAMPLE_PAGES while using FORCE INDEX to guide critical user-triggered queries toward the correct indexes. This prevents the issues described in this article where inaccurate MySQL optimizer statistics cause certain user IDs to use wrong indexes.

Related

A Hidden Production Issue Discovered Through SQL Optimization
·1101 words·6 mins
When our operations team brought us a complex SQL query that was taking forever to execute, we thought it was just a performance issue. Little did we know, this investigation would uncover a deeply hidden character encoding mismatch that had been silently causing full table scans in our production database.
A Peculiar Bug Hunt: When Exceptions Lose Their Voice
·1195 words·6 mins
A deep dive into a mysterious production issue where exception logs mysteriously disappeared, leading us through Arthas debugging, Log4j2 internals, and the discovery that an exception’s getMessage() method was itself throwing exceptions due to Guava-Guice version incompatibility.