Overview
Profiling, analyzing and then fixing queries is likely the most oft-repeated part of a job of a DBA and one that keeps evolving, as new features are added to the application new queries pop up that need to be analyzed and fixed. And there are not too many tools out there that can make your life easy. However, there is one such tool, pt-query-digest (from Percona Toolkit) which provides you with all the data points you need to attack the right query in the right way. But vanilla MySQL does have its limitations, it reports only a subset of stats, however if you compare that to Percona server, it reports extra stats such as information about the queries’ execution plan (which includes things like whether Query cache was used or not, if Filesort was used, whether tmp table was created in memory or on disk, if full scan was done, etc) as well as InnoDB statistics (such as IO read operations, the number of unique pages the query accessed, the length of time query waited for row locks, etc). So you can see there is a plethora of useful information reported by Percona Server. Another great thing about Percona Server is the ability to enable logging atomically, not just for new connections as in MySQL. This is very helpful for measurement as otherwise we might not catch some long running connections.
Now let’s get started.
Before We Start!
But before we start, make sure you have enabled slow query logging and set a low enough value for long_query_time. We normally use a value of long_query_time=0, because if you set it to some other value say 0.1 seconds, it will miss all queries shorter than that which well may be majority of your workload. So if you want to analyze what causes load on your server you better use a value of 0. But remember that you would only want to set it to 0, for a period of time that allows you to gather enough statistics about the queries, after that time period remember to set it back to a value greater than 0, because otherwise you can have a really large log file generated. Another thing that we normally do is set the variable log_slow_verbosity to ‘full’, this variable is available in Percona Server and allows us to log all the extra stats I mentioned above in the overview section.
So say if you were using the vanilla MySQL server, you would see an entry like this in the slow query log:
# Time: 111229 3:02:26 # User@Host: msandbox[msandbox] @ localhost [] # Query_time: 2.365434 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 655360 use test; SET timestamp=1325145746; select count(*) from auto_inc;
Compare that to Percona Server with log_slow_verbosity=full:
# Time: 111229 3:11:26 # User@Host: msandbox[msandbox] @ localhost [] # Thread_id: 1 Schema: test Last_errno: 0 Killed: 0 # Query_time: 0.117904 Lock_time: 0.002886 Rows_sent: 1 Rows_examined: 655360 Rows_affected: 0 Rows_read: 655361 # Bytes_sent: 68 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # InnoDB_trx_id: F00 # QC_Hit: No Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: No Filesort_on_disk: No Merge_passes: 0 # InnoDB_IO_r_ops: 984 InnoDB_IO_r_bytes: 16121856 InnoDB_IO_r_wait: 0.001414 # InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000 # InnoDB_pages_distinct: 973 SET timestamp=1325146286; select count(*) from auto_inc;
Note that logging all queries in this fashion as opposed to the general query log, enables us to have the statistics available after the query is actually executed, while no such statistics are available for queries that are logged using the general query log.
Installing pt-query-digest tool (as well as other tools from Percona Toolkit) is very easy, and is explained here at this link.
Now before we move forward, I would like to point out that the results shown in this blog post are from the queries that I have gathered from one of the Percona Server instance running on my personal Amazon micro instance.
Using pt-query-digest
Using pt-query-digest is pretty straight forward:
pt-query-digest /path/to/slow-query.log
Note that executing pt-query-digest can be pretty CPU and memory consuming, so ideally you would want to download the "slow query log" to another machine and run it there.
Analyzing pt-query-digest Output
Now let's see what output it returns. The first part of the output is an overall summary:
# 8.1s user time, 60ms system time, 26.23M rss, 62.49M vsz # Current date: Thu Dec 29 07:09:32 2011 # Hostname: somehost.net # Files: slow-query.log.1 # Overall: 20.08k total, 167 unique, 16.04 QPS, 0.01x concurrency ________ # Time range: 2011-12-28 18:42:47 to 19:03:39 # Attribute total min max avg 95% stddev median # ============ ======= ======= ======= ======= ======= ======= ======= # Exec time 8s 1us 44ms 403us 541us 2ms 98us # Lock time 968ms 0 11ms 48us 119us 134us 36us # Rows sent 105.76k 0 1000 5.39 9.83 32.69 0 # Rows examine 539.46k 0 15.65k 27.52 34.95 312.56 0 # Rows affecte 1.34k 0 65 0.07 0 1.35 0 # Rows read 105.76k 0 1000 5.39 9.83 32.69 0 # Bytes sent 46.63M 11 191.38k 2.38k 6.63k 11.24k 202.40 # Merge passes 0 0 0 0 0 0 0 # Tmp tables 1.37k 0 61 0.07 0 0.91 0 # Tmp disk tbl 490 0 10 0.02 0 0.20 0 # Tmp tbl size 72.52M 0 496.09k 3.70k 0 34.01k 0 # Query size 3.50M 13 2.00k 182.86 346.17 154.34 84.10 # InnoDB: # IO r bytes 96.00k 0 32.00k 20.86 0 816.04 0 # IO r ops 6 0 2 0.00 0 0.05 0 # IO r wait 64ms 0 26ms 13us 0 530us 0 # pages distin 28.96k 0 48 6.29 38.53 10.74 1.96 # queue wait 0 0 0 0 0 0 0 # rec lock wai 0 0 0 0 0 0 0 # Boolean: # Filesort 4% yes, 95% no # Filesort on 0% yes, 99% no # Full scan 4% yes, 95% no # QC Hit 0% yes, 99% no # Tmp table 4% yes, 95% no # Tmp table on 2% yes, 97% no
It tells you that a total of 20.08k queries were captured which are actually invocations of 167 different queries. Following that there are summaries of various data points such as the total query execution time and the average query execution time, the number of tmp tables created in memory vs on-disk, percentage of queries that needed full scan, InnoDB IO stats, etc. One thing I suggest here is that, you should probably give more importance to the times/values reported in the 95% (95th percentile) column as that gives us more accurate understanding. Now, for example it is shown here that every query is reading approximately 38.53 distinct InnoDB pages (meaning 616.48K of data), however, 95% of the times InnoDB r ops is 0, which means it accesses these pages in memory. What it says though is that, if this query would run on a cold MySQL instance, then it would require reading nearly 40 pages from disk.
Let's analyze next part of the output produced by pt-query-digest.
# Profile # Rank Query ID Response time Calls R/Call Apdx V/M Item # ==== ================== ============= ===== ====== ==== ===== ========== # 1 0x92F3B1B361FB0E5B 4.0522 50.0% 312 0.0130 1.00 0.00 SELECT wp_options # 2 0xE71D28F50D128F0F 0.8312 10.3% 6412 0.0001 1.00 0.00 SELECT poller_output poller_item # 3 0x211901BF2E1C351E 0.6811 8.4% 6416 0.0001 1.00 0.00 SELECT poller_time # 4 0xA766EE8F7AB39063 0.2805 3.5% 149 0.0019 1.00 0.00 SELECT wp_terms wp_term_taxonomy wp_term_relationships # 5 0xA3EEB63EFBA42E9B 0.1999 2.5% 51 0.0039 1.00 0.00 SELECT UNION wp_pp_daily_summary wp_pp_hourly_summary wp_pp_hits wp_posts # 6 0x94350EA2AB8AAC34 0.1956 2.4% 89 0.0022 1.00 0.01 UPDATE wp_options # 7 0x7AEDF19FDD3A33F1 0.1381 1.7% 909 0.0002 1.00 0.00 SELECT wp_options # 8 0x4C16888631FD8EDB 0.1160 1.4% 5 0.0232 1.00 0.00 SELECT film # 9 0xCFC0642B5BBD9AC7 0.0987 1.2% 50 0.0020 1.00 0.01 SELECT UNION wp_pp_daily_summary wp_pp_hourly_summary wp_pp_hits # 10 0x88BA308B9C0EB583 0.0905 1.1% 4 0.0226 1.00 0.01 SELECT poller_item # 11 0xD0A520C9DB2D6AC7 0.0850 1.0% 125 0.0007 1.00 0.00 SELECT wp_links wp_term_relationships wp_term_taxonomy # 12 0x30DA85C940E0D491 0.0835 1.0% 542 0.0002 1.00 0.00 SELECT wp_posts # 13 0x8A52FE35D340A347 0.0767 0.9% 4 0.0192 1.00 0.00 TRUNCATE TABLE poller_time # 14 0x3E84BF7C0C2A3005 0.0624 0.8% 272 0.0002 1.00 0.00 SELECT wp_postmeta # 15 0xA01053DA94ED829E 0.0567 0.7% 213 0.0003 1.00 0.00 SELECT data_template_rrd data_input_fields # 16 0xBE797E1DD5E4222F 0.0524 0.6% 79 0.0007 1.00 0.00 SELECT wp_posts # 17 0xF8EC4434E0061E89 0.0475 0.6% 62 0.0008 1.00 0.00 SELECT wp_terms wp_term_taxonomy # 18 0xCDFFAD848B0C1D52 0.0465 0.6% 9 0.0052 1.00 0.01 SELECT wp_posts wp_term_relationships # 19 0x5DE709416871BF99 0.0454 0.6% 260 0.0002 1.00 0.00 DELETE poller_output # 20 0x428A588445FE580B 0.0449 0.6% 260 0.0002 1.00 0.00 INSERT poller_output # MISC 0xMISC 0.8137 10.0% 3853 0.0002 NS 0.0 <147 ITEMS>
The above part of the output ranks the queries and shows the top queries with largest impact - longest sum of run time which typically (not always) shows queries causing highest load on the server. As we can see here the one causing the highest load is the SELECT wp_options
query, this is basically a unique way of identifying the query and simply implies that this is a SELECT
query executed against the wp_options
table. Another thing to note is the last line in the output the # MISC part, it tells you how much of "load" is not covered by top queries, we have 10% in MISC which means that by reviewing these top 20 queries we essentially reviewed most of the load.
Now let's take a look at the most important part of the output:
# Query 1: 0.26 QPS, 0.00x concurrency, ID 0x92F3B1B361FB0E5B at byte 14081299 # This item is included in the report because it matches --limit. # Scores: Apdex = 1.00 [1.0], V/M = 0.00 # Query_time sparkline: | _^ | # Time range: 2011-12-28 18:42:47 to 19:03:10 # Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Count 1 312 # Exec time 50 4s 5ms 25ms 13ms 20ms 4ms 12ms # Lock time 3 32ms 43us 163us 103us 131us 19us 98us # Rows sent 59 62.41k 203 231 204.82 202.40 3.99 202.40 # Rows examine 13 73.63k 238 296 241.67 246.02 10.15 234.30 # Rows affecte 0 0 0 0 0 0 0 0 # Rows read 59 62.41k 203 231 204.82 202.40 3.99 202.40 # Bytes sent 53 24.85M 46.52k 84.36k 81.56k 83.83k 7.31k 79.83k # Merge passes 0 0 0 0 0 0 0 0 # Tmp tables 0 0 0 0 0 0 0 0 # Tmp disk tbl 0 0 0 0 0 0 0 0 # Tmp tbl size 0 0 0 0 0 0 0 0 # Query size 0 21.63k 71 71 71 71 0 71 # InnoDB: # IO r bytes 0 0 0 0 0 0 0 0 # IO r ops 0 0 0 0 0 0 0 0 # IO r wait 0 0 0 0 0 0 0 0 # pages distin 40 11.77k 34 44 38.62 38.53 1.87 38.53 # queue wait 0 0 0 0 0 0 0 0 # rec lock wai 0 0 0 0 0 0 0 0 # Boolean: # Full scan 100% yes, 0% no # String: # Databases wp_blog_one (264/84%), wp_blog_tw… (36/11%)... 1 more # Hosts # InnoDB trxID 86B40B (1/0%), 86B430 (1/0%), 86B44A (1/0%)... 309 more # Last errno 0 # Users wp_blog_one (264/84%), wp_blog_two (36/11%)... 1 more # Query_time distribution # 1us # 10us # 100us # 1ms ################# # 10ms ################################################################ # 100ms # 1s # 10s+ # Tables # SHOW TABLE STATUS FROM `wp_blog_one ` LIKE 'wp_options'\G # SHOW CREATE TABLE `wp_blog_one `.`wp_options`\G # EXPLAIN /*!50100 PARTITIONS*/ SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'\G
This is the actual part of the output dealing with analysis of the query that is taking up the longest sum of run time, query ranked #1. The first row in the table above shows the Count of number of times this query was executed. Now let's take a look at the values in the 95% column, we can see that this query is taking up 20ms 95% of the times and is sending 202 rows and 83.83k of data per query while its also examining 246 rows for every query. Another important thing that is shown here is that every query is reading approximately 38.53 distinct InnoDB pages (meaning 616.48k of data). While you can also see that this query is doing a full scan every time its run. The "Databases" section of the output also shows the name of the databases where this query was executed. Next the "Query_time distribution" section shows how this query times mostly, which you can see majority of the time lies in the range >= 10ms and < 100ms. The "Tables" section lists the queries that you can use to gather more data about the underlying tables involved and the query execution plan used by MySQL.
The end result might be that you end up limiting the number of results returned by the query, by using a LIMIT clause or by filtering based on the option_name
column, or you might even compress the values stored in the option_value
column so that less data is read and sent.
Let’s analyze another query, this time query ranked #4 by pt-query-digest.
# Query 4: 0.12 QPS, 0.00x concurrency, ID 0xA766EE8F7AB39063 at byte 4001761 # This item is included in the report because it matches --limit. # Scores: Apdex = 1.00 [1.0], V/M = 0.00 # Query_time sparkline: | .^_ | # Time range: 2011-12-28 18:42:47 to 19:02:57 # Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Count 0 149 # Exec time 3 281ms 534us 17ms 2ms 4ms 2ms 1ms # Lock time 1 14ms 56us 179us 92us 159us 29us 80us # Rows sent 8 9.01k 0 216 61.90 202.40 59.80 24.84 # Rows examine 8 45.05k 0 1.05k 309.59 1012.63 299.07 124.25 # Rows affecte 0 0 0 0 0 0 0 0 # Rows read 8 9.01k 0 216 61.90 202.40 59.80 24.84 # Bytes sent 1 622.17k 694 13.27k 4.18k 11.91k 3.35k 2.06k # Merge passes 0 0 0 0 0 0 0 0 # Tmp tables 10 149 1 1 1 1 0 1 # Tmp disk tbl 30 149 1 1 1 1 0 1 # Tmp tbl size 16 12.00M 0 287.72k 82.47k 270.35k 79.86k 33.17k # Query size 1 45.68k 286 345 313.91 329.68 13.06 313.99 # InnoDB: # IO r bytes 0 0 0 0 0 0 0 0 # IO r ops 0 0 0 0 0 0 0 0 # IO r wait 0 0 0 0 0 0 0 0 # pages distin 2 683 2 7 4.58 6.98 1.29 4.96 # queue wait 0 0 0 0 0 0 0 0 # rec lock wai 0 0 0 0 0 0 0 0 # Boolean: # Filesort 100% yes, 0% no # Tmp table 100% yes, 0% no # Tmp table on 100% yes, 0% no # String: # Databases wp_blog_one (105/70%), wp_blog_tw... (34/22%)... 1 more # Hosts # InnoDB trxID 86B40F (1/0%), 86B429 (1/0%), 86B434 (1/0%)... 146 more # Last errno 0 # Users wp_blog_one (105/70%), wp_blog_two (34/22%)... 1 more # Query_time distribution # 1us # 10us # 100us ################### # 1ms ################################################################ # 10ms # # 100ms # 1s # 10s+ # Tables # SHOW TABLE STATUS FROM `wp_blog_one ` LIKE 'wp_terms'\G # SHOW CREATE TABLE `wp_blog_one `.`wp_terms`\G # SHOW TABLE STATUS FROM `wp_blog_one ` LIKE 'wp_term_taxonomy'\G # SHOW CREATE TABLE `wp_blog_one `.`wp_term_taxonomy`\G # SHOW TABLE STATUS FROM `wp_blog_one ` LIKE 'wp_term_relationships'\G # SHOW CREATE TABLE `wp_blog_one `.`wp_term_relationships`\G # EXPLAIN /*!50100 PARTITIONS*/ SELECT t.*, tt.*, tr.object_id FROM wp_terms AS t INNER JOIN wp_term_taxonomy AS tt ON tt.term_id = t.term_id INNER JOIN wp_term_relationships AS tr ON tr.term_taxonomy_id = tt.term_taxonomy_id WHERE tt.taxonomy IN ('category', 'post_tag', 'post_format') AND tr.object_id IN (733) ORDER BY t.name ASC\G
Let’s again take a look at the 95% column in the above output. The query execution time is 4ms, it sends 202 rows for which it has to examine 1012 rows (per every query), interesting here is that this query needs to do Filesort 100% of the times and also needs to create on-disk tmp tables every time its executed. Those are the two important things that you would probably like to fix with respect to this query. The tmp table size needed per query is 270.35k, which is not much considering the fact that tmp_tbl_size variable is set to 32M on the server, so on-disk tables are probably being created because of blob columns being accessed by the query. So a quick fix here could be to instead of selecting every column from all the tables involved in the query, probably selecting only the needed columns which could exclude the blob ones.
Conclusion
The only conclusion, I can make out is “Get yourself Percona Server, turn on log_slow_verbosity and start using pt-query-digest”, your job of identifying queries producing most load will be all the more simpler then.
The post Identifying the load with the help of pt-query-digest and Percona Server appeared first on MySQL Performance Blog.