Appearance
MySQL 监控与运维
监控体系概述
数据库监控是确保 MySQL 系统稳定运行的重要手段,通过实时收集、分析和展示各种性能指标,帮助运维人员及时发现问题、预防故障,并为性能优化提供数据支撑。完善的监控体系不仅能够提高系统可用性,还能显著降低运维成本和业务风险。
监控的重要性
故障预防与快速响应
早期预警机制:
- 通过监控关键指标的变化趋势,在问题恶化前发出预警
- 识别性能瓶颈和资源不足的早期征象
- 预测系统容量需求,提前进行扩容规划
- 发现异常访问模式和潜在的安全威胁
故障快速定位:
- 提供详细的性能数据和历史趋势分析
- 帮助快速识别故障根因和影响范围
- 缩短平均故障恢复时间(MTTR)
- 减少故障对业务的影响时间和范围
性能优化指导:
- 识别系统性能瓶颈和优化机会
- 提供数据驱动的优化决策支持
- 验证优化措施的效果
- 建立性能基线和对比标准
业务价值体现
服务质量保障:
- 确保数据库服务的稳定性和可用性
- 维护用户体验和客户满意度
- 支撑业务的持续增长和发展
- 降低因系统故障导致的业务损失
运维效率提升:
- 自动化监控减少人工巡检工作量
- 智能告警避免无效报警和漏报
- 统一监控平台提高运维效率
- 历史数据分析支持运维决策
成本控制优化:
- 合理规划资源配置,避免过度投资
- 及时发现资源浪费和优化机会
- 延长硬件设备的使用寿命
- 降低因故障导致的损失和恢复成本
监控层次架构
基础设施监控
硬件资源监控:
- CPU 使用率:整体使用率、各核心负载分布、上下文切换频率
- 内存使用情况:物理内存、虚拟内存、缓存和缓冲区使用
- 磁盘 I/O 性能:读写速度、IOPS、队列深度、响应时间
- 网络流量:带宽使用率、数据包丢失率、连接数统计
操作系统监控:
- 系统负载:平均负载、进程数量、线程状态分布
- 文件系统:磁盘空间使用率、inode 使用情况
- 系统进程:MySQL 进程状态、资源占用情况
- 系统日志:内核日志、系统错误信息
数据库服务监控
连接管理监控:
- 连接数统计:当前连接数、最大连接数、连接使用率
- 连接质量:连接建立速度、连接超时情况
- 用户会话:活跃会话数、长时间运行的会话
- 连接池状态:连接池大小、空闲连接数、等待队列长度
查询性能监控:
- 查询执行统计:查询数量、平均执行时间、慢查询数量
- SQL 语句分析:最频繁的查询、最耗时的查询
- 索引使用情况:索引命中率、全表扫描次数
- 锁等待情况:锁等待时间、死锁发生频率
存储引擎监控:
- InnoDB 状态:缓冲池命中率、日志写入速度、页面读写统计
- 事务处理:事务数量、回滚率、长事务监控
- 表空间使用:数据文件大小、自动扩展情况
- 复制状态:主从延迟、复制错误、二进制日志状态
应用层监控
业务指标监控:
- 关键业务操作:订单处理量、用户登录次数、支付成功率
- 数据一致性:数据校验结果、主从数据一致性检查
- 业务流程监控:关键业务流程的执行时间和成功率
- 用户体验指标:页面响应时间、接口调用成功率
应用性能监控:
- 数据库连接池:连接获取时间、连接泄漏检测
- ORM 性能:SQL 生成效率、N+1 查询检测
- 缓存效果:缓存命中率、缓存更新频率
- 异常监控:数据库异常、连接超时、查询错误
监控指标体系
可用性指标
服务可用性:
- 服务状态:MySQL 服务是否正常运行
- 连接可用性:是否能够正常建立数据库连接
- 响应能力:基本查询的响应时间
- 故障恢复时间:从故障发生到服务恢复的时间
数据可用性:
- 数据完整性:数据是否完整无损
- 数据一致性:主从数据是否一致
- 备份可用性:备份文件是否可用
- 恢复能力:数据恢复的速度和成功率
性能指标
吞吐量指标:
- QPS(每秒查询数):系统每秒处理的查询数量
- TPS(每秒事务数):系统每秒处理的事务数量
- 连接吞吐量:每秒建立和关闭的连接数
- 数据传输量:每秒传输的数据量
响应时间指标:
- 平均响应时间:查询的平均执行时间
- 95% 响应时间:95% 的查询在此时间内完成
- 99% 响应时间:99% 的查询在此时间内完成
- 最大响应时间:最长的查询执行时间
资源利用率指标:
- CPU 利用率:数据库进程的 CPU 使用情况
- 内存利用率:缓冲池和其他内存结构的使用情况
- 磁盘利用率:磁盘空间和 I/O 资源的使用情况
- 网络利用率:网络带宽的使用情况
错误指标
错误率统计:
- 查询错误率:失败查询占总查询的比例
- 连接错误率:连接失败占总连接尝试的比例
- 事务回滚率:回滚事务占总事务的比例
- 复制错误率:复制过程中发生错误的频率
异常事件监控:
- 慢查询数量:执行时间超过阈值的查询数量
- 死锁发生次数:系统中死锁的发生频率
- 连接超时次数:连接建立或查询超时的次数
- 内存不足事件:内存分配失败的次数
性能监控
关键性能指标
查询性能指标
查询执行统计:
MySQL 提供了丰富的性能统计信息,通过 SHOW STATUS
命令可以查看各种查询执行统计:
sql
-- 查看查询执行统计
SHOW GLOBAL STATUS LIKE 'Com_%';
-- 重要指标说明:
-- Com_select: SELECT 语句执行次数
-- Com_insert: INSERT 语句执行次数
-- Com_update: UPDATE 语句执行次数
-- Com_delete: DELETE 语句执行次数
-- Com_commit: COMMIT 操作次数
-- Com_rollback: ROLLBACK 操作次数
-- 查看查询缓存统计
SHOW GLOBAL STATUS LIKE 'Qcache_%';
-- 查看慢查询统计
SHOW GLOBAL STATUS LIKE 'Slow_queries';
-- 查看连接统计
SHOW GLOBAL STATUS LIKE 'Connections';
SHOW GLOBAL STATUS LIKE 'Max_used_connections';
SHOW GLOBAL STATUS LIKE 'Threads_%';
性能模式监控:
MySQL 5.5+ 引入的 Performance Schema 提供了更详细的性能监控信息:
sql
-- 启用性能模式
SET GLOBAL performance_schema = ON;
-- 查看最耗时的 SQL 语句
SELECT
DIGEST_TEXT,
COUNT_STAR as exec_count,
AVG_TIMER_WAIT/1000000000 as avg_time_sec,
MAX_TIMER_WAIT/1000000000 as max_time_sec,
SUM_TIMER_WAIT/1000000000 as total_time_sec
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
-- 查看表的 I/O 统计
SELECT
OBJECT_SCHEMA,
OBJECT_NAME,
COUNT_READ,
COUNT_WRITE,
COUNT_FETCH,
COUNT_INSERT,
COUNT_UPDATE,
COUNT_DELETE
FROM performance_schema.table_io_waits_summary_by_table
WHERE OBJECT_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema')
ORDER BY COUNT_READ + COUNT_WRITE DESC
LIMIT 10;
-- 查看索引使用统计
SELECT
OBJECT_SCHEMA,
OBJECT_NAME,
INDEX_NAME,
COUNT_FETCH,
COUNT_INSERT,
COUNT_UPDATE,
COUNT_DELETE
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE OBJECT_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema')
ORDER BY COUNT_FETCH DESC
LIMIT 10;
实时性能监控:
sql
-- 查看当前正在执行的查询
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
INFO
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND != 'Sleep'
ORDER BY TIME DESC;
-- 查看锁等待情况
SELECT
r.trx_id waiting_trx_id,
r.trx_mysql_thread_id waiting_thread,
r.trx_query waiting_query,
b.trx_id blocking_trx_id,
b.trx_mysql_thread_id blocking_thread,
b.trx_query blocking_query
FROM information_schema.innodb_lock_waits w
INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
-- 查看 InnoDB 状态
SHOW ENGINE INNODB STATUS\G
资源使用监控
内存使用监控:
sql
-- 查看缓冲池状态
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_%';
-- 重要指标:
-- Innodb_buffer_pool_pages_total: 缓冲池总页数
-- Innodb_buffer_pool_pages_free: 空闲页数
-- Innodb_buffer_pool_pages_data: 数据页数
-- Innodb_buffer_pool_pages_dirty: 脏页数
-- Innodb_buffer_pool_read_requests: 读请求次数
-- Innodb_buffer_pool_reads: 物理读次数
-- 计算缓冲池命中率
SELECT
ROUND(
(1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)) * 100, 2
) AS buffer_pool_hit_rate
FROM (
SELECT
VARIABLE_VALUE AS Innodb_buffer_pool_reads
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) AS reads,
(
SELECT
VARIABLE_VALUE AS Innodb_buffer_pool_read_requests
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) AS requests;
-- 查看内存使用详情(MySQL 5.7+)
SELECT
EVENT_NAME,
CURRENT_NUMBER_OF_BYTES_USED/1024/1024 AS current_mb,
HIGH_NUMBER_OF_BYTES_USED/1024/1024 AS high_mb
FROM performance_schema.memory_summary_global_by_event_name
WHERE CURRENT_NUMBER_OF_BYTES_USED > 0
ORDER BY CURRENT_NUMBER_OF_BYTES_USED DESC
LIMIT 10;
磁盘 I/O 监控:
sql
-- 查看文件 I/O 统计
SELECT
FILE_NAME,
EVENT_NAME,
COUNT_READ,
COUNT_WRITE,
SUM_NUMBER_OF_BYTES_READ/1024/1024 AS read_mb,
SUM_NUMBER_OF_BYTES_WRITE/1024/1024 AS write_mb
FROM performance_schema.file_summary_by_instance
WHERE EVENT_NAME LIKE '%innodb%'
ORDER BY SUM_NUMBER_OF_BYTES_READ + SUM_NUMBER_OF_BYTES_WRITE DESC
LIMIT 10;
-- 查看表空间使用情况
SELECT
TABLE_SCHEMA,
TABLE_NAME,
ROUND(DATA_LENGTH/1024/1024, 2) AS data_mb,
ROUND(INDEX_LENGTH/1024/1024, 2) AS index_mb,
ROUND((DATA_LENGTH + INDEX_LENGTH)/1024/1024, 2) AS total_mb
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema', 'sys')
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC
LIMIT 10;
监控工具
原生监控工具
mysqladmin 工具:
bash
# 查看服务器状态
mysqladmin -u root -p status
# 持续监控服务器状态
mysqladmin -u root -p -i 5 status
# 查看进程列表
mysqladmin -u root -p processlist
# 查看扩展状态
mysqladmin -u root -p extended-status
# 查看变量
mysqladmin -u root -p variables
# 刷新日志
mysqladmin -u root -p flush-logs
# 检查服务器是否存活
mysqladmin -u root -p ping
MySQL Workbench:
- 图形化的性能监控界面
- 实时性能仪表板
- 查询分析和优化建议
- 服务器配置管理
- 用户和权限管理
MySQL Enterprise Monitor:
- 企业级监控解决方案
- 自动化监控和告警
- 性能趋势分析
- 查询分析器
- 配置建议和最佳实践
第三方监控工具
Prometheus + Grafana:
yaml
# prometheus.yml 配置
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'mysql'
static_configs:
- targets: ['localhost:9104']
scrape_interval: 5s
metrics_path: /metrics
bash
# 安装 MySQL Exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.14.0/mysqld_exporter-0.14.0.linux-amd64.tar.gz
tar xvf mysqld_exporter-0.14.0.linux-amd64.tar.gz
# 创建监控用户
mysql -u root -p -e "
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
"
# 启动 MySQL Exporter
./mysqld_exporter --config.my-cnf=/etc/mysql/my.cnf
Zabbix 监控:
xml
<!-- MySQL 监控模板 -->
<template>
<name>MySQL Server</name>
<items>
<item>
<name>MySQL Status</name>
<key>mysql.ping</key>
<type>0</type>
<value_type>3</value_type>
<delay>60</delay>
</item>
<item>
<name>MySQL Queries per second</name>
<key>mysql.qps</key>
<type>0</type>
<value_type>0</value_type>
<delay>60</delay>
</item>
<item>
<name>MySQL Connections</name>
<key>mysql.connections</key>
<type>0</type>
<value_type>3</value_type>
<delay>60</delay>
</item>
</items>
<triggers>
<trigger>
<expression>{MySQL Server:mysql.ping.last()}=0</expression>
<name>MySQL is down</name>
<priority>5</priority>
</trigger>
</triggers>
</template>
Nagios 监控:
bash
# 安装 MySQL 插件
wget https://github.com/nagios-plugins/nagios-plugins/archive/release-2.3.3.tar.gz
tar xzf release-2.3.3.tar.gz
cd nagios-plugins-release-2.3.3
./configure --with-mysql
make && make install
# 配置监控项
define service {
use generic-service
host_name mysql-server
service_description MySQL Status
check_command check_mysql!username!password
normal_check_interval 5
retry_check_interval 1
}
define service {
use generic-service
host_name mysql-server
service_description MySQL Slave Status
check_command check_mysql_slave!username!password
normal_check_interval 5
retry_check_interval 1
}
自定义监控脚本
Python 监控脚本:
python
#!/usr/bin/env python3
import mysql.connector
import time
import json
import logging
from datetime import datetime
class MySQLMonitor:
def __init__(self, config):
self.config = config
self.connection = None
self.logger = self._setup_logger()
def _setup_logger(self):
logger = logging.getLogger('mysql_monitor')
logger.setLevel(logging.INFO)
handler = logging.FileHandler('/var/log/mysql_monitor.log')
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
handler.setFormatter(formatter)
logger.addHandler(handler)
return logger
def connect(self):
try:
self.connection = mysql.connector.connect(
host=self.config['host'],
user=self.config['user'],
password=self.config['password'],
database=self.config['database']
)
return True
except Exception as e:
self.logger.error(f"Connection failed: {e}")
return False
def get_status_variables(self):
cursor = self.connection.cursor()
cursor.execute("SHOW GLOBAL STATUS")
status = {}
for name, value in cursor.fetchall():
try:
status[name] = int(value)
except ValueError:
status[name] = value
cursor.close()
return status
def get_innodb_status(self):
cursor = self.connection.cursor()
cursor.execute("SHOW ENGINE INNODB STATUS")
result = cursor.fetchone()
innodb_status = result[2] if result else ""
cursor.close()
return innodb_status
def get_processlist(self):
cursor = self.connection.cursor(dictionary=True)
cursor.execute("""
SELECT ID, USER, HOST, DB, COMMAND, TIME, STATE, INFO
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND != 'Sleep'
ORDER BY TIME DESC
""")
processes = cursor.fetchall()
cursor.close()
return processes
def get_slow_queries(self):
cursor = self.connection.cursor(dictionary=True)
cursor.execute("""
SELECT
DIGEST_TEXT,
COUNT_STAR as exec_count,
AVG_TIMER_WAIT/1000000000 as avg_time_sec,
MAX_TIMER_WAIT/1000000000 as max_time_sec,
SUM_TIMER_WAIT/1000000000 as total_time_sec
FROM performance_schema.events_statements_summary_by_digest
WHERE AVG_TIMER_WAIT/1000000000 > 1
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10
""")
slow_queries = cursor.fetchall()
cursor.close()
return slow_queries
def calculate_metrics(self, status):
metrics = {
'timestamp': datetime.now().isoformat(),
'connections': {
'current': status.get('Threads_connected', 0),
'max_used': status.get('Max_used_connections', 0),
'total': status.get('Connections', 0)
},
'queries': {
'total': status.get('Queries', 0),
'select': status.get('Com_select', 0),
'insert': status.get('Com_insert', 0),
'update': status.get('Com_update', 0),
'delete': status.get('Com_delete', 0)
},
'innodb': {
'buffer_pool_hit_rate': self._calculate_buffer_pool_hit_rate(status),
'pages_read': status.get('Innodb_buffer_pool_reads', 0),
'pages_written': status.get('Innodb_buffer_pool_pages_flushed', 0)
},
'slow_queries': status.get('Slow_queries', 0)
}
return metrics
def _calculate_buffer_pool_hit_rate(self, status):
reads = status.get('Innodb_buffer_pool_reads', 0)
read_requests = status.get('Innodb_buffer_pool_read_requests', 0)
if read_requests > 0:
hit_rate = (1 - (reads / read_requests)) * 100
return round(hit_rate, 2)
return 0
def monitor(self):
if not self.connect():
return None
try:
status = self.get_status_variables()
metrics = self.calculate_metrics(status)
# 检查告警条件
self._check_alerts(metrics)
# 记录监控数据
self.logger.info(f"Metrics: {json.dumps(metrics)}")
return metrics
except Exception as e:
self.logger.error(f"Monitoring failed: {e}")
return None
finally:
if self.connection:
self.connection.close()
def _check_alerts(self, metrics):
# 连接数告警
if metrics['connections']['current'] > 80:
self.logger.warning(
f"High connection count: {metrics['connections']['current']}"
)
# 缓冲池命中率告警
if metrics['innodb']['buffer_pool_hit_rate'] < 95:
self.logger.warning(
f"Low buffer pool hit rate: {metrics['innodb']['buffer_pool_hit_rate']}%"
)
# 慢查询告警
if metrics['slow_queries'] > 100:
self.logger.warning(
f"High slow query count: {metrics['slow_queries']}"
)
# 使用示例
if __name__ == "__main__":
config = {
'host': 'localhost',
'user': 'monitor_user',
'password': 'monitor_password',
'database': 'mysql'
}
monitor = MySQLMonitor(config)
while True:
metrics = monitor.monitor()
if metrics:
print(json.dumps(metrics, indent=2))
time.sleep(60) # 每分钟监控一次
Shell 监控脚本:
bash
#!/bin/bash
# MySQL 监控脚本
# 配置参数
MYSQL_USER="monitor_user"
MYSQL_PASS="monitor_password"
MYSQL_HOST="localhost"
MYSQL_PORT="3306"
LOG_FILE="/var/log/mysql_monitor.log"
ALERT_EMAIL="admin@example.com"
# 日志函数
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}
# 发送告警
send_alert() {
local subject="$1"
local message="$2"
echo "$message" | mail -s "$subject" $ALERT_EMAIL
log_message "ALERT: $subject - $message"
}
# 检查 MySQL 服务状态
check_mysql_status() {
if ! mysqladmin -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS ping >/dev/null 2>&1; then
send_alert "MySQL Service Down" "MySQL service is not responding on $MYSQL_HOST:$MYSQL_PORT"
return 1
fi
return 0
}
# 检查连接数
check_connections() {
local current_connections=$(mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW STATUS LIKE 'Threads_connected';" | awk 'NR==2 {print $2}')
local max_connections=$(mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW VARIABLES LIKE 'max_connections';" | awk 'NR==2 {print $2}')
local usage_percent=$((current_connections * 100 / max_connections))
log_message "Connections: $current_connections/$max_connections ($usage_percent%)"
if [ $usage_percent -gt 80 ]; then
send_alert "High Connection Usage" "Connection usage is $usage_percent% ($current_connections/$max_connections)"
fi
}
# 检查慢查询
check_slow_queries() {
local slow_queries=$(mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW STATUS LIKE 'Slow_queries';" | awk 'NR==2 {print $2}')
# 读取上次的慢查询数量
local last_slow_queries=0
if [ -f "/tmp/last_slow_queries" ]; then
last_slow_queries=$(cat /tmp/last_slow_queries)
fi
local new_slow_queries=$((slow_queries - last_slow_queries))
log_message "Slow queries: $slow_queries (new: $new_slow_queries)"
if [ $new_slow_queries -gt 10 ]; then
send_alert "High Slow Query Rate" "$new_slow_queries new slow queries detected"
fi
echo $slow_queries > /tmp/last_slow_queries
}
# 检查复制状态
check_replication() {
local slave_status=$(mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW SLAVE STATUS\G" 2>/dev/null)
if [ -n "$slave_status" ]; then
local io_running=$(echo "$slave_status" | grep "Slave_IO_Running:" | awk '{print $2}')
local sql_running=$(echo "$slave_status" | grep "Slave_SQL_Running:" | awk '{print $2}')
local seconds_behind=$(echo "$slave_status" | grep "Seconds_Behind_Master:" | awk '{print $2}')
log_message "Replication - IO: $io_running, SQL: $sql_running, Lag: $seconds_behind seconds"
if [ "$io_running" != "Yes" ] || [ "$sql_running" != "Yes" ]; then
send_alert "Replication Error" "Slave IO: $io_running, SQL: $sql_running"
fi
if [ "$seconds_behind" != "NULL" ] && [ $seconds_behind -gt 300 ]; then
send_alert "High Replication Lag" "Replication lag is $seconds_behind seconds"
fi
fi
}
# 检查磁盘空间
check_disk_space() {
local mysql_datadir=$(mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW VARIABLES LIKE 'datadir';" | awk 'NR==2 {print $2}')
local disk_usage=$(df $mysql_datadir | tail -1 | awk '{print $5}' | sed 's/%//')
log_message "Disk usage: $disk_usage%"
if [ $disk_usage -gt 85 ]; then
send_alert "High Disk Usage" "Disk usage is $disk_usage% for MySQL data directory"
fi
}
# 主监控函数
main() {
log_message "Starting MySQL monitoring check"
# 检查 MySQL 服务状态
if ! check_mysql_status; then
exit 1
fi
# 执行各项检查
check_connections
check_slow_queries
check_replication
check_disk_space
log_message "MySQL monitoring check completed"
}
# 执行主函数
main
日志管理
日志类型
错误日志
错误日志记录了 MySQL 服务器启动、运行和关闭过程中的重要信息,包括错误、警告和注意事项。这是故障诊断的重要信息源。
错误日志配置:
ini
# my.cnf 错误日志配置
[mysqld]
# 启用错误日志
log-error=/var/log/mysql/error.log
# 日志级别设置
log-error-verbosity=3 # 1=错误, 2=错误+警告, 3=错误+警告+信息
# 日志时间戳格式
log-timestamps=SYSTEM # SYSTEM 或 UTC
错误日志分析:
bash
# 查看最近的错误
tail -f /var/log/mysql/error.log
# 搜索特定错误
grep -i "error" /var/log/mysql/error.log
grep -i "warning" /var/log/mysql/error.log
grep -i "crash" /var/log/mysql/error.log
# 统计错误类型
grep -i "error" /var/log/mysql/error.log | awk '{print $4}' | sort | uniq -c | sort -nr
# 查看启动信息
grep -i "ready for connections" /var/log/mysql/error.log
# 查看关闭信息
grep -i "shutdown" /var/log/mysql/error.log
慢查询日志
慢查询日志记录执行时间超过指定阈值的 SQL 语句,是性能优化的重要工具。
慢查询日志配置:
ini
# my.cnf 慢查询日志配置
[mysqld]
# 启用慢查询日志
slow-query-log=1
slow-query-log-file=/var/log/mysql/slow.log
# 慢查询时间阈值(秒)
long-query-time=2
# 记录未使用索引的查询
log-queries-not-using-indexes=1
# 限制未使用索引查询的记录频率
log-throttle-queries-not-using-indexes=10
# 记录管理语句
log-slow-admin-statements=1
# 记录从库上的慢查询
log-slow-slave-statements=1
慢查询日志分析:
bash
# 使用 mysqldumpslow 分析慢查询日志
mysqldumpslow -s t -t 10 /var/log/mysql/slow.log # 按时间排序,显示前10条
mysqldumpslow -s c -t 10 /var/log/mysql/slow.log # 按次数排序,显示前10条
mysqldumpslow -s l -t 10 /var/log/mysql/slow.log # 按锁定时间排序
# 分析特定时间段的慢查询
mysqldumpslow -s t -t 10 -g "2023-12-01" /var/log/mysql/slow.log
# 过滤特定用户的慢查询
mysqldumpslow -s t -t 10 -g "user_name" /var/log/mysql/slow.log
# 使用 pt-query-digest 进行详细分析
pt-query-digest /var/log/mysql/slow.log > slow_query_report.txt
# 实时监控慢查询
tail -f /var/log/mysql/slow.log | grep -A 10 "Query_time"
慢查询分析脚本:
python
#!/usr/bin/env python3
import re
import sys
from collections import defaultdict
from datetime import datetime
class SlowQueryAnalyzer:
def __init__(self, log_file):
self.log_file = log_file
self.queries = []
self.stats = defaultdict(list)
def parse_log(self):
with open(self.log_file, 'r') as f:
content = f.read()
# 分割每个查询记录
query_blocks = re.split(r'# Time: ', content)[1:]
for block in query_blocks:
query_info = self._parse_query_block(block)
if query_info:
self.queries.append(query_info)
self._update_stats(query_info)
def _parse_query_block(self, block):
lines = block.strip().split('\n')
if len(lines) < 3:
return None
# 解析时间
time_match = re.match(r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z)', lines[0])
if not time_match:
return None
timestamp = time_match.group(1)
# 解析查询统计信息
user_host_line = None
query_time_line = None
sql_text = []
for i, line in enumerate(lines[1:], 1):
if line.startswith('# User@Host:'):
user_host_line = line
elif line.startswith('# Query_time:'):
query_time_line = line
elif not line.startswith('#'):
sql_text.extend(lines[i:])
break
if not query_time_line:
return None
# 解析查询时间和其他统计信息
query_time_match = re.search(r'Query_time: ([\d.]+)', query_time_line)
lock_time_match = re.search(r'Lock_time: ([\d.]+)', query_time_line)
rows_sent_match = re.search(r'Rows_sent: (\d+)', query_time_line)
rows_examined_match = re.search(r'Rows_examined: (\d+)', query_time_line)
return {
'timestamp': timestamp,
'query_time': float(query_time_match.group(1)) if query_time_match else 0,
'lock_time': float(lock_time_match.group(1)) if lock_time_match else 0,
'rows_sent': int(rows_sent_match.group(1)) if rows_sent_match else 0,
'rows_examined': int(rows_examined_match.group(1)) if rows_examined_match else 0,
'user_host': user_host_line,
'sql_text': ' '.join(sql_text).strip()
}
def _update_stats(self, query_info):
# 简化 SQL 语句用于分组
normalized_sql = self._normalize_sql(query_info['sql_text'])
self.stats[normalized_sql].append({
'query_time': query_info['query_time'],
'lock_time': query_info['lock_time'],
'rows_sent': query_info['rows_sent'],
'rows_examined': query_info['rows_examined']
})
def _normalize_sql(self, sql):
# 移除具体的值,保留 SQL 结构
sql = re.sub(r'\b\d+\b', 'N', sql)
sql = re.sub(r"'[^']*'", "'S'", sql)
sql = re.sub(r'"[^"]*"', '"S"', sql)
sql = re.sub(r'\s+', ' ', sql)
return sql.strip()[:200] # 限制长度
def generate_report(self):
print("=== 慢查询分析报告 ===")
print(f"总慢查询数量: {len(self.queries)}")
print(f"不同查询模式数量: {len(self.stats)}")
print()
# 按总执行时间排序
sorted_stats = sorted(
self.stats.items(),
key=lambda x: sum(q['query_time'] for q in x[1]),
reverse=True
)
print("=== 最耗时的查询模式(前10个)===")
for i, (sql, queries) in enumerate(sorted_stats[:10], 1):
total_time = sum(q['query_time'] for q in queries)
avg_time = total_time / len(queries)
max_time = max(q['query_time'] for q in queries)
count = len(queries)
print(f"{i}. 执行次数: {count}, 总时间: {total_time:.2f}s, 平均时间: {avg_time:.2f}s, 最大时间: {max_time:.2f}s")
print(f" SQL: {sql[:100]}...")
print()
# 按执行次数排序
sorted_by_count = sorted(
self.stats.items(),
key=lambda x: len(x[1]),
reverse=True
)
print("=== 最频繁的查询模式(前10个)===")
for i, (sql, queries) in enumerate(sorted_by_count[:10], 1):
count = len(queries)
avg_time = sum(q['query_time'] for q in queries) / count
print(f"{i}. 执行次数: {count}, 平均时间: {avg_time:.2f}s")
print(f" SQL: {sql[:100]}...")
print()
# 使用示例
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python3 slow_query_analyzer.py <slow_query_log_file>")
sys.exit(1)
analyzer = SlowQueryAnalyzer(sys.argv[1])
analyzer.parse_log()
analyzer.generate_report()
二进制日志
二进制日志记录了所有修改数据的语句,主要用于复制和数据恢复。
二进制日志配置:
ini
# my.cnf 二进制日志配置
[mysqld]
# 启用二进制日志
log-bin=mysql-bin
# 二进制日志格式
binlog-format=ROW # ROW, STATEMENT, MIXED
# 二进制日志同步策略
sync-binlog=1 # 每次事务提交都同步到磁盘
# 二进制日志过期时间
expire-logs-days=7
# 最大二进制日志文件大小
max-binlog-size=1G
# 二进制日志缓存大小
binlog-cache-size=1M
# 启用 GTID
gtid-mode=ON
enforce-gtid-consistency=ON
二进制日志管理:
sql
-- 查看二进制日志文件
SHOW BINARY LOGS;
-- 查看当前二进制日志状态
SHOW MASTER STATUS;
-- 查看二进制日志事件
SHOW BINLOG EVENTS IN 'mysql-bin.000001';
-- 刷新二进制日志
FLUSH LOGS;
-- 清理二进制日志
PURGE BINARY LOGS TO 'mysql-bin.000010';
PURGE BINARY LOGS BEFORE '2023-12-01 00:00:00';
-- 重置二进制日志(谨慎使用)
RESET MASTER;
二进制日志分析:
bash
# 使用 mysqlbinlog 查看二进制日志内容
mysqlbinlog mysql-bin.000001
# 查看指定时间范围的日志
mysqlbinlog --start-datetime="2023-12-01 10:00:00" \
--stop-datetime="2023-12-01 12:00:00" \
mysql-bin.000001
# 查看指定位置范围的日志
mysqlbinlog --start-position=1000 --stop-position=5000 mysql-bin.000001
# 基于行的日志解析(显示具体的数据变更)
mysqlbinlog --base64-output=DECODE-ROWS -v mysql-bin.000001
# 生成可执行的 SQL 语句
mysqlbinlog mysql-bin.000001 > binlog_statements.sql
# 分析特定数据库的变更
mysqlbinlog --database=mydb mysql-bin.000001
通用查询日志
通用查询日志记录所有的 SQL 语句和连接信息,主要用于审计和调试。
通用查询日志配置:
ini
# my.cnf 通用查询日志配置
[mysqld]
# 启用通用查询日志
general-log=1
general-log-file=/var/log/mysql/general.log
# 或者输出到表中
# log-output=TABLE
动态启用/禁用:
sql
-- 动态启用通用查询日志
SET GLOBAL general_log = 'ON';
SET GLOBAL general_log_file = '/var/log/mysql/general.log';
-- 动态禁用通用查询日志
SET GLOBAL general_log = 'OFF';
-- 查看日志设置
SHOW VARIABLES LIKE 'general_log%';
-- 如果输出到表中,可以查询日志内容
SELECT * FROM mysql.general_log ORDER BY event_time DESC LIMIT 10;
日志轮转
自动日志轮转
使用 logrotate:
bash
# 创建 MySQL 日志轮转配置
sudo vim /etc/logrotate.d/mysql
# 配置内容
/var/log/mysql/*.log {
daily
missingok
rotate 52
compress
delaycompress
notifempty
create 640 mysql mysql
sharedscripts
postrotate
if test -x /usr/bin/mysqladmin && \
/usr/bin/mysqladmin ping &>/dev/null
then
/usr/bin/mysqladmin flush-logs
fi
endscript
}
# 测试配置
sudo logrotate -d /etc/logrotate.d/mysql
# 强制执行轮转
sudo logrotate -f /etc/logrotate.d/mysql
MySQL 内置日志轮转:
sql
-- 刷新所有日志文件
FLUSH LOGS;
-- 刷新错误日志
FLUSH ERROR LOGS;
-- 刷新慢查询日志
FLUSH SLOW LOGS;
-- 刷新通用查询日志
FLUSH GENERAL LOGS;
-- 刷新二进制日志
FLUSH BINARY LOGS;
日志清理脚本
bash
#!/bin/bash
# MySQL 日志清理脚本
# 配置参数
LOG_DIR="/var/log/mysql"
RETENTION_DAYS=30
MYSQL_USER="root"
MYSQL_PASS="password"
# 清理错误日志
clean_error_logs() {
echo "Cleaning error logs older than $RETENTION_DAYS days..."
find $LOG_DIR -name "error.log.*" -mtime +$RETENTION_DAYS -delete
}
# 清理慢查询日志
clean_slow_logs() {
echo "Cleaning slow query logs older than $RETENTION_DAYS days..."
find $LOG_DIR -name "slow.log.*" -mtime +$RETENTION_DAYS -delete
}
# 清理通用查询日志
clean_general_logs() {
echo "Cleaning general logs older than $RETENTION_DAYS days..."
find $LOG_DIR -name "general.log.*" -mtime +$RETENTION_DAYS -delete
}
# 清理二进制日志
clean_binary_logs() {
echo "Cleaning binary logs older than $RETENTION_DAYS days..."
# 计算清理日期
PURGE_DATE=$(date -d "$RETENTION_DAYS days ago" '+%Y-%m-%d %H:%M:%S')
# 执行清理
mysql -u $MYSQL_USER -p$MYSQL_PASS -e \
"PURGE BINARY LOGS BEFORE '$PURGE_DATE';"
}
# 压缩旧日志文件
compress_old_logs() {
echo "Compressing old log files..."
find $LOG_DIR -name "*.log.*" -not -name "*.gz" -mtime +1 -exec gzip {} \;
}
# 主函数
main() {
echo "Starting MySQL log cleanup - $(date)"
clean_error_logs
clean_slow_logs
clean_general_logs
clean_binary_logs
compress_old_logs
echo "MySQL log cleanup completed - $(date)"
}
# 执行清理
main
故障诊断
常见故障类型
连接问题
连接数耗尽:
sql
-- 查看当前连接数
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';
-- 查看连接详情
SELECT
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
COUNT(*) as connection_count
FROM INFORMATION_SCHEMA.PROCESSLIST
GROUP BY USER, HOST, DB, COMMAND, STATE
ORDER BY connection_count DESC;
-- 查找长时间运行的连接
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
INFO
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE TIME > 300 -- 超过5分钟的连接
ORDER BY TIME DESC;
解决方案:
bash
# 临时增加最大连接数
mysql -u root -p -e "SET GLOBAL max_connections = 500;"
# 杀死空闲连接
mysql -u root -p -e "
SELECT CONCAT('KILL ', ID, ';') as kill_command
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep' AND TIME > 3600;
" | grep KILL | mysql -u root -p
# 永久修改配置
echo "max_connections = 500" >> /etc/mysql/my.cnf
sudo systemctl restart mysql
连接超时问题:
sql
-- 查看超时相关参数
SHOW VARIABLES LIKE '%timeout%';
-- 重要参数:
-- connect_timeout: 连接建立超时时间
-- interactive_timeout: 交互式连接超时时间
-- wait_timeout: 非交互式连接超时时间
-- net_read_timeout: 网络读取超时时间
-- net_write_timeout: 网络写入超时时间
-- 调整超时参数
SET GLOBAL wait_timeout = 28800; -- 8小时
SET GLOBAL interactive_timeout = 28800;
SET GLOBAL net_read_timeout = 60;
SET GLOBAL net_write_timeout = 60;
性能问题
慢查询诊断:
sql
-- 查看当前正在执行的慢查询
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
LEFT(INFO, 100) as QUERY_TEXT
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE TIME > 10 AND COMMAND = 'Query'
ORDER BY TIME DESC;
-- 使用 Performance Schema 分析慢查询
SELECT
DIGEST_TEXT,
COUNT_STAR as exec_count,
AVG_TIMER_WAIT/1000000000 as avg_time_sec,
MAX_TIMER_WAIT/1000000000 as max_time_sec,
SUM_TIMER_WAIT/1000000000 as total_time_sec,
SUM_ROWS_EXAMINED/COUNT_STAR as avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE AVG_TIMER_WAIT/1000000000 > 1
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
-- 分析表的访问模式
SELECT
OBJECT_SCHEMA,
OBJECT_NAME,
COUNT_READ,
COUNT_WRITE,
COUNT_FETCH,
SUM_TIMER_WAIT/1000000000 as total_time_sec
FROM performance_schema.table_io_waits_summary_by_table
WHERE OBJECT_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema')
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
锁等待诊断:
sql
-- 查看当前锁等待情况
SELECT
r.trx_id waiting_trx_id,
r.trx_mysql_thread_id waiting_thread,
r.trx_query waiting_query,
b.trx_id blocking_trx_id,
b.trx_mysql_thread_id blocking_thread,
b.trx_query blocking_query,
w.requesting_lock_id,
w.blocking_lock_id
FROM information_schema.innodb_lock_waits w
INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
-- 查看死锁信息
SHOW ENGINE INNODB STATUS\G
-- 查看锁等待超时设置
SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';
-- 查看当前事务
SELECT
trx_id,
trx_state,
trx_started,
trx_requested_lock_id,
trx_wait_started,
trx_weight,
trx_mysql_thread_id,
trx_query
FROM information_schema.innodb_trx
ORDER BY trx_started;
内存问题诊断:
sql
-- 查看内存使用情况
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_%';
-- 计算缓冲池命中率
SELECT
ROUND(
(1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)) * 100, 2
) AS buffer_pool_hit_rate
FROM (
SELECT
VARIABLE_VALUE AS Innodb_buffer_pool_reads
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) AS reads,
(
SELECT
VARIABLE_VALUE AS Innodb_buffer_pool_read_requests
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) AS requests;
-- 查看内存分配详情(MySQL 5.7+)
SELECT
EVENT_NAME,
CURRENT_NUMBER_OF_BYTES_USED/1024/1024 AS current_mb,
HIGH_NUMBER_OF_BYTES_USED/1024/1024 AS high_mb
FROM performance_schema.memory_summary_global_by_event_name
WHERE CURRENT_NUMBER_OF_BYTES_USED > 0
ORDER BY CURRENT_NUMBER_OF_BYTES_USED DESC
LIMIT 20;
-- 查看临时表使用情况
SHOW GLOBAL STATUS LIKE 'Created_tmp_%';
复制问题
主从复制诊断:
sql
-- 在从库上检查复制状态
SHOW SLAVE STATUS\G
-- 重要字段说明:
-- Slave_IO_Running: IO线程是否运行
-- Slave_SQL_Running: SQL线程是否运行
-- Seconds_Behind_Master: 复制延迟秒数
-- Last_IO_Error: 最后的IO错误
-- Last_SQL_Error: 最后的SQL错误
-- Exec_Master_Log_Pos: 执行到的主库日志位置
-- Read_Master_Log_Pos: 读取到的主库日志位置
-- 在主库上检查复制状态
SHOW MASTER STATUS;
SHOW PROCESSLIST;
-- 查看从库连接信息
SELECT
USER,
HOST,
DB,
COMMAND,
TIME,
STATE
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Binlog Dump';
复制延迟分析:
bash
# 监控复制延迟脚本
#!/bin/bash
MYSQL_USER="repl_monitor"
MYSQL_PASS="password"
SLAVE_HOST="slave_server"
MASTER_HOST="master_server"
# 检查复制延迟
check_replication_lag() {
local lag=$(mysql -h $SLAVE_HOST -u $MYSQL_USER -p$MYSQL_PASS -e \
"SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master:" | awk '{print $2}')
if [ "$lag" = "NULL" ]; then
echo "ERROR: Replication is broken"
return 1
elif [ $lag -gt 300 ]; then
echo "WARNING: High replication lag: $lag seconds"
return 2
else
echo "OK: Replication lag: $lag seconds"
return 0
fi
}
# 检查复制线程状态
check_replication_threads() {
local io_running=$(mysql -h $SLAVE_HOST -u $MYSQL_USER -p$MYSQL_PASS -e \
"SHOW SLAVE STATUS\G" | grep "Slave_IO_Running:" | awk '{print $2}')
local sql_running=$(mysql -h $SLAVE_HOST -u $MYSQL_USER -p$MYSQL_PASS -e \
"SHOW SLAVE STATUS\G" | grep "Slave_SQL_Running:" | awk '{print $2}')
if [ "$io_running" != "Yes" ] || [ "$sql_running" != "Yes" ]; then
echo "ERROR: Replication threads not running - IO: $io_running, SQL: $sql_running"
return 1
else
echo "OK: Replication threads running"
return 0
fi
}
# 主函数
main() {
echo "Checking replication status..."
check_replication_threads
check_replication_lag
}
main
诊断工具
系统诊断工具
pt-stalk 工具:
bash
# 安装 Percona Toolkit
sudo apt-get install percona-toolkit
# 使用 pt-stalk 监控 MySQL
pt-stalk --function=processlist --variable=Threads_running --threshold=25 \
--match-command=Query --collect-oprofile --collect-strace \
--collect-tcpdump --collect-gdb
# 当 Threads_running 超过 25 时自动收集诊断信息
pt-summary 工具:
bash
# 生成系统摘要报告
pt-summary
# 生成 MySQL 摘要报告
pt-mysql-summary -- --user=root --password=password
# 生成磁盘摘要报告
pt-diskstats
innotop 工具:
bash
# 安装 innotop
sudo apt-get install innotop
# 启动 innotop
innotop -u root -p password
# 常用快捷键:
# T - 查看 InnoDB 事务
# L - 查看锁等待
# Q - 查看查询列表
# I - 查看 InnoDB I/O 信息
# B - 查看缓冲池信息
性能分析工具
pt-query-digest:
bash
# 分析慢查询日志
pt-query-digest /var/log/mysql/slow.log > slow_query_report.txt
# 分析二进制日志
pt-query-digest --type=binlog mysql-bin.000001
# 分析通用查询日志
pt-query-digest --type=genlog /var/log/mysql/general.log
# 实时分析查询
pt-query-digest --processlist h=localhost,u=root,p=password
# 过滤特定时间段
pt-query-digest --since='2023-12-01 10:00:00' --until='2023-12-01 12:00:00' slow.log
pt-index-usage:
bash
# 分析索引使用情况
pt-index-usage /var/log/mysql/slow.log
# 输出未使用的索引
pt-index-usage --host=localhost --user=root --password=password \
--database=mydb --create-save-results-database
pt-duplicate-key-checker:
bash
# 检查重复索引
pt-duplicate-key-checker --host=localhost --user=root --password=password
# 检查特定数据库
pt-duplicate-key-checker h=localhost,u=root,p=password,D=mydb
故障恢复
数据恢复
基于二进制日志的恢复:
bash
# 场景:需要恢复到特定时间点
# 1. 停止 MySQL 服务
sudo systemctl stop mysql
# 2. 恢复最近的完整备份
mysql -u root -p < /backup/full_backup_2023-12-01.sql
# 3. 应用二进制日志到故障时间点
mysqlbinlog --start-datetime="2023-12-01 00:00:00" \
--stop-datetime="2023-12-01 14:30:00" \
mysql-bin.000001 mysql-bin.000002 | mysql -u root -p
# 4. 启动 MySQL 服务
sudo systemctl start mysql
表级恢复:
sql
-- 恢复单个表
-- 1. 创建临时数据库
CREATE DATABASE temp_recovery;
-- 2. 在临时数据库中恢复表
USE temp_recovery;
source /backup/table_backup.sql;
-- 3. 将数据插入到原表
INSERT INTO production.users SELECT * FROM temp_recovery.users;
-- 4. 清理临时数据库
DROP DATABASE temp_recovery;
复制恢复
重建从库复制:
sql
-- 在主库上
-- 1. 锁定表(可选,用于一致性)
FLUSH TABLES WITH READ LOCK;
-- 2. 记录当前位置
SHOW MASTER STATUS;
-- 3. 创建备份
-- mysqldump -u root -p --all-databases --master-data=2 > master_backup.sql
-- 4. 解锁表
UNLOCK TABLES;
-- 在从库上
-- 1. 停止复制
STOP SLAVE;
-- 2. 恢复备份
-- mysql -u root -p < master_backup.sql
-- 3. 配置复制
CHANGE MASTER TO
MASTER_HOST='master_ip',
MASTER_USER='repl_user',
MASTER_PASSWORD='repl_password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=12345;
-- 4. 启动复制
START SLAVE;
-- 5. 检查复制状态
SHOW SLAVE STATUS\G
跳过复制错误:
sql
-- 跳过单个错误
STOP SLAVE;
SET GLOBAL sql_slave_skip_counter = 1;
START SLAVE;
-- 跳过特定错误类型
STOP SLAVE;
SET GLOBAL slave_skip_errors = '1062,1053';
START SLAVE;
-- 基于 GTID 跳过事务
STOP SLAVE;
SET GTID_NEXT = 'uuid:transaction_id';
BEGIN; COMMIT;
SET GTID_NEXT = 'AUTOMATIC';
START SLAVE;
运维最佳实践
预防性维护
定期维护任务
数据库优化:
sql
-- 分析表统计信息
ANALYZE TABLE table_name;
-- 优化表结构
OPTIMIZE TABLE table_name;
-- 检查表完整性
CHECK TABLE table_name;
-- 修复表
REPAIR TABLE table_name;
-- 批量优化所有表
SELECT CONCAT('OPTIMIZE TABLE ', table_schema, '.', table_name, ';') as optimize_sql
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql', 'performance_schema', 'information_schema', 'sys')
AND engine = 'InnoDB';
索引维护:
sql
-- 查找未使用的索引
SELECT
OBJECT_SCHEMA,
OBJECT_NAME,
INDEX_NAME
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE INDEX_NAME IS NOT NULL
AND INDEX_NAME != 'PRIMARY'
AND COUNT_STAR = 0
AND OBJECT_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema');
-- 查找重复索引
SELECT
a.TABLE_SCHEMA,
a.TABLE_NAME,
a.INDEX_NAME as index1,
b.INDEX_NAME as index2,
a.COLUMN_NAME
FROM information_schema.STATISTICS a
JOIN information_schema.STATISTICS b ON (
a.TABLE_SCHEMA = b.TABLE_SCHEMA
AND a.TABLE_NAME = b.TABLE_NAME
AND a.COLUMN_NAME = b.COLUMN_NAME
AND a.INDEX_NAME != b.INDEX_NAME
)
WHERE a.TABLE_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema')
ORDER BY a.TABLE_SCHEMA, a.TABLE_NAME;
统计信息更新:
bash
#!/bin/bash
# 自动更新统计信息脚本
MYSQL_USER="admin"
MYSQL_PASS="password"
DATABASE="production"
# 获取需要更新统计信息的表
TABLES=$(mysql -u $MYSQL_USER -p$MYSQL_PASS -D $DATABASE -e "
SELECT table_name
FROM information_schema.tables
WHERE table_schema = '$DATABASE'
AND engine = 'InnoDB'
AND (update_time < DATE_SUB(NOW(), INTERVAL 7 DAY) OR update_time IS NULL);
" | tail -n +2)
# 更新统计信息
for table in $TABLES; do
echo "Analyzing table: $table"
mysql -u $MYSQL_USER -p$MYSQL_PASS -D $DATABASE -e "ANALYZE TABLE $table;"
done
echo "Statistics update completed"
容量规划
存储空间监控:
sql
-- 查看数据库大小
SELECT
table_schema AS 'Database',
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'
FROM information_schema.tables
GROUP BY table_schema
ORDER BY SUM(data_length + index_length) DESC;
-- 查看表大小增长趋势
SELECT
TABLE_SCHEMA,
TABLE_NAME,
ROUND(DATA_LENGTH/1024/1024, 2) AS data_mb,
ROUND(INDEX_LENGTH/1024/1024, 2) AS index_mb,
ROUND((DATA_LENGTH + INDEX_LENGTH)/1024/1024, 2) AS total_mb,
TABLE_ROWS,
CREATE_TIME,
UPDATE_TIME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema', 'sys')
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC
LIMIT 20;
-- 预测存储需求
SELECT
TABLE_SCHEMA,
TABLE_NAME,
TABLE_ROWS,
ROUND((DATA_LENGTH + INDEX_LENGTH)/TABLE_ROWS, 2) AS avg_row_size,
ROUND((DATA_LENGTH + INDEX_LENGTH)/1024/1024, 2) AS current_mb,
ROUND((DATA_LENGTH + INDEX_LENGTH)/1024/1024 * 1.5, 2) AS projected_mb_50_growth
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema', 'sys')
AND TABLE_ROWS > 0
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
性能基线建立:
python
#!/usr/bin/env python3
# 性能基线收集脚本
import mysql.connector
import json
import time
from datetime import datetime, timedelta
class PerformanceBaseline:
def __init__(self, config):
self.config = config
self.baseline_data = []
def collect_metrics(self):
connection = mysql.connector.connect(**self.config)
cursor = connection.cursor(dictionary=True)
# 收集关键性能指标
metrics = {
'timestamp': datetime.now().isoformat(),
'queries_per_second': self._get_qps(cursor),
'connections': self._get_connection_stats(cursor),
'innodb_stats': self._get_innodb_stats(cursor),
'slow_queries': self._get_slow_query_stats(cursor),
'table_stats': self._get_table_stats(cursor)
}
cursor.close()
connection.close()
return metrics
def _get_qps(self, cursor):
cursor.execute("SHOW GLOBAL STATUS LIKE 'Queries'")
result = cursor.fetchone()
return int(result['Value']) if result else 0
def _get_connection_stats(self, cursor):
cursor.execute("""
SELECT
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Threads_connected') as current_connections,
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Max_used_connections') as max_used_connections,
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES WHERE VARIABLE_NAME = 'max_connections') as max_connections
""")
return cursor.fetchone()
def _get_innodb_stats(self, cursor):
cursor.execute("""
SELECT
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests') as read_requests,
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') as physical_reads,
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_pages_total') as total_pages,
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_pages_free') as free_pages
""")
return cursor.fetchone()
def _get_slow_query_stats(self, cursor):
cursor.execute("SHOW GLOBAL STATUS LIKE 'Slow_queries'")
result = cursor.fetchone()
return int(result['Value']) if result else 0
def _get_table_stats(self, cursor):
cursor.execute("""
SELECT
COUNT(*) as total_tables,
SUM(TABLE_ROWS) as total_rows,
ROUND(SUM(DATA_LENGTH + INDEX_LENGTH)/1024/1024, 2) as total_size_mb
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema', 'sys')
""")
return cursor.fetchone()
def collect_baseline(self, duration_hours=24, interval_minutes=15):
"""收集指定时间段的性能基线数据"""
end_time = datetime.now() + timedelta(hours=duration_hours)
while datetime.now() < end_time:
try:
metrics = self.collect_metrics()
self.baseline_data.append(metrics)
print(f"Collected metrics at {metrics['timestamp']}")
# 保存到文件
with open('performance_baseline.json', 'w') as f:
json.dump(self.baseline_data, f, indent=2)
time.sleep(interval_minutes * 60)
except Exception as e:
print(f"Error collecting metrics: {e}")
time.sleep(60) # 出错时等待1分钟后重试
def analyze_baseline(self):
"""分析基线数据"""
if not self.baseline_data:
print("No baseline data available")
return
# 计算平均值、最大值、最小值
qps_values = [m['queries_per_second'] for m in self.baseline_data]
conn_values = [int(m['connections']['current_connections']) for m in self.baseline_data]
analysis = {
'collection_period': {
'start': self.baseline_data[0]['timestamp'],
'end': self.baseline_data[-1]['timestamp'],
'samples': len(self.baseline_data)
},
'qps_stats': {
'avg': sum(qps_values) / len(qps_values),
'max': max(qps_values),
'min': min(qps_values)
},
'connection_stats': {
'avg': sum(conn_values) / len(conn_values),
'max': max(conn_values),
'min': min(conn_values)
}
}
print(json.dumps(analysis, indent=2))
return analysis
# 使用示例
if __name__ == "__main__":
config = {
'host': 'localhost',
'user': 'monitor_user',
'password': 'monitor_password',
'database': 'mysql'
}
baseline = PerformanceBaseline(config)
# 收集24小时的基线数据,每15分钟采样一次
baseline.collect_baseline(duration_hours=24, interval_minutes=15)
# 分析基线数据
baseline.analyze_baseline()
自动化运维
自动化脚本
健康检查脚本:
bash
#!/bin/bash
# MySQL 健康检查脚本
# 配置参数
CONFIG_FILE="/etc/mysql-monitor.conf"
LOG_FILE="/var/log/mysql-health-check.log"
ALERT_EMAIL="dba@company.com"
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
# 加载配置
source $CONFIG_FILE
# 日志函数
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}
# 发送告警
send_alert() {
local level="$1"
local message="$2"
# 发送邮件
echo "$message" | mail -s "MySQL Alert [$level]" $ALERT_EMAIL
# 发送 Slack 通知
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"MySQL Alert [$level]: $message\"}" \
$SLACK_WEBHOOK
log "ALERT [$level]: $message"
}
# 检查 MySQL 服务状态
check_service_status() {
if ! systemctl is-active --quiet mysql; then
send_alert "CRITICAL" "MySQL service is not running"
return 1
fi
if ! mysqladmin -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS ping >/dev/null 2>&1; then
send_alert "CRITICAL" "MySQL is not responding to ping"
return 1
fi
log "Service status: OK"
return 0
}
# 检查连接数
check_connections() {
local current=$(mysql -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS -e \
"SHOW STATUS LIKE 'Threads_connected';" | awk 'NR==2 {print $2}')
local max=$(mysql -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS -e \
"SHOW VARIABLES LIKE 'max_connections';" | awk 'NR==2 {print $2}')
local usage=$((current * 100 / max))
if [ $usage -gt 90 ]; then
send_alert "CRITICAL" "Connection usage is $usage% ($current/$max)"
elif [ $usage -gt 80 ]; then
send_alert "WARNING" "Connection usage is $usage% ($current/$max)"
fi
log "Connections: $current/$max ($usage%)"
}
# 检查磁盘空间
check_disk_space() {
local datadir=$(mysql -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS -e \
"SHOW VARIABLES LIKE 'datadir';" | awk 'NR==2 {print $2}')
local usage=$(df $datadir | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $usage -gt 90 ]; then
send_alert "CRITICAL" "Disk usage is $usage% for MySQL data directory"
elif [ $usage -gt 85 ]; then
send_alert "WARNING" "Disk usage is $usage% for MySQL data directory"
fi
log "Disk usage: $usage%"
}
# 检查复制状态
check_replication() {
local slave_status=$(mysql -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS -e \
"SHOW SLAVE STATUS\G" 2>/dev/null)
if [ -n "$slave_status" ]; then
local io_running=$(echo "$slave_status" | grep "Slave_IO_Running:" | awk '{print $2}')
local sql_running=$(echo "$slave_status" | grep "Slave_SQL_Running:" | awk '{print $2}')
local lag=$(echo "$slave_status" | grep "Seconds_Behind_Master:" | awk '{print $2}')
if [ "$io_running" != "Yes" ] || [ "$sql_running" != "Yes" ]; then
send_alert "CRITICAL" "Replication error - IO: $io_running, SQL: $sql_running"
elif [ "$lag" != "NULL" ] && [ $lag -gt 300 ]; then
send_alert "WARNING" "High replication lag: $lag seconds"
fi
log "Replication: IO=$io_running, SQL=$sql_running, Lag=$lag"
fi
}
# 检查慢查询
check_slow_queries() {
local slow_queries=$(mysql -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS -e \
"SHOW STATUS LIKE 'Slow_queries';" | awk 'NR==2 {print $2}')
# 读取上次检查的慢查询数量
local last_count=0
if [ -f "/tmp/last_slow_queries" ]; then
last_count=$(cat /tmp/last_slow_queries)
fi
local new_slow=$((slow_queries - last_count))
if [ $new_slow -gt 50 ]; then
send_alert "WARNING" "$new_slow new slow queries detected"
fi
echo $slow_queries > /tmp/last_slow_queries
log "Slow queries: $slow_queries (new: $new_slow)"
}
# 检查缓冲池命中率
check_buffer_pool() {
local hit_rate=$(mysql -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS -e "
SELECT ROUND(
(1 - (
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') /
(SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')
)) * 100, 2
) AS hit_rate;
" | awk 'NR==2 {print $1}')
if (( $(echo "$hit_rate < 95" | bc -l) )); then
send_alert "WARNING" "Low buffer pool hit rate: $hit_rate%"
fi
log "Buffer pool hit rate: $hit_rate%"
}
# 主检查函数
main() {
log "Starting MySQL health check"
check_service_status || exit 1
check_connections
check_disk_space
check_replication
check_slow_queries
check_buffer_pool
log "MySQL health check completed"
}
# 执行检查
main
配置文件示例:
bash
# /etc/mysql-monitor.conf
MYSQL_HOST="localhost"
MYSQL_USER="monitor_user"
MYSQL_PASS="monitor_password"
MYSQL_PORT="3306"
# 告警阈值
CONNECTION_WARNING_THRESHOLD=80
CONNECTION_CRITICAL_THRESHOLD=90
DISK_WARNING_THRESHOLD=85
DISK_CRITICAL_THRESHOLD=90
REPLICATION_LAG_WARNING=300
SLOW_QUERY_WARNING=50
BUFFER_POOL_WARNING=95
定时任务配置
Crontab 配置:
bash
# 编辑 crontab
crontab -e
# 添加定时任务
# 每5分钟执行健康检查
*/5 * * * * /usr/local/bin/mysql-health-check.sh
# 每小时收集性能指标
0 * * * * /usr/local/bin/mysql-performance-collect.sh
# 每天凌晨2点执行备份
0 2 * * * /usr/local/bin/mysql-backup.sh
# 每周日凌晨3点执行表优化
0 3 * * 0 /usr/local/bin/mysql-optimize-tables.sh
# 每月1号清理旧日志
0 4 1 * * /usr/local/bin/mysql-log-cleanup.sh
Systemd 定时器配置:
ini
# /etc/systemd/system/mysql-health-check.service
[Unit]
Description=MySQL Health Check
After=mysql.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/mysql-health-check.sh
User=mysql
Group=mysql
# /etc/systemd/system/mysql-health-check.timer
[Unit]
Description=Run MySQL Health Check every 5 minutes
Requires=mysql-health-check.service
[Timer]
OnCalendar=*:0/5
Persistent=true
[Install]
WantedBy=timers.target
bash
# 启用定时器
sudo systemctl enable mysql-health-check.timer
sudo systemctl start mysql-health-check.timer
# 查看定时器状态
sudo systemctl status mysql-health-check.timer
sudo systemctl list-timers
总结
MySQL 监控与运维是确保数据库系统稳定运行的关键环节。通过建立完善的监控体系、实施有效的故障诊断流程、遵循运维最佳实践,可以显著提高数据库系统的可用性和性能。
关键要点:
- 全面监控:建立涵盖硬件、操作系统、数据库服务和应用层的多层次监控体系
- 主动预警:设置合理的告警阈值,实现故障的早期发现和预防
- 快速诊断:掌握常用的诊断工具和方法,能够快速定位和解决问题
- 自动化运维:通过脚本和工具实现运维任务的自动化,提高效率和可靠性
- 持续优化:基于监控数据进行性能分析和优化,不断提升系统性能
通过系统性的监控运维实践,可以构建一个高可用、高性能的 MySQL 数据库环境,为业务发展提供可靠的数据支撑。