点击上方“IT那活儿”公众号--专注于企业全栈运维技术分享,不管IT什么活儿,干就完了!!!
本文简介
常见故障案例
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/usr/local/mysql.sock' (2)
ERROR 1045 (28000): Access denied for user 'root'@'localhost'(using password: NO)
mysql>set password for root@localhost=password('123456');
mysql> grant all on *.* to 'root'@'mysql-server' identified by'123456';
ERROR1129 (HY000): Host 'xxx.xxx.xxx.xxx' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'
./blade create network delay --time 3000 --offset 1000 --interface ens33 --local-port 3306 --timeout 30
mysqladmin -uroot -p flush-hosts Enter password
Too many connections
(连接数过多,导致连接不上数据库,业务无法正常进行)
Warning: World-writable config file'/etc/my.cnf'is ignored ERROR! MySQL is running but PID file could not be found
InnoDB: Error: page 14178log sequence number29455369832
InnoDB: is in the future! Current systemlog sequence number29455369832
innodb_force_recovery=4,
org.hibernate.util.JDBCExceptionReporter - The last packet
successfully received from the server was43200 milliseconds
ago.The last packet sent successfully to the server was
43200 milliseconds ago, which is longer than the server
configured value of 'wait_timeout'. You should consider
either expiring and/or testing connection validity before
use in your application, increasing the server configured
valuesfor client timeouts, or using the Connector/J
connection 'autoReconnect=true'to avoid this problem.
Can't open file: 'xxx_forums.MYI'
MySQL server has gone away
Lost connection to MySQL server during query
mysql>callcreate_no_by_day('STUDENT','CREATE_TIME');ERROR1449 (HY000):Theuserspecifiedasadefiner ('TEST_111'@'172.%.%.%') doesnotexist
-
第一种:重建存储过程; -
第二种:在新机器上建立这个用户’TEST_111’@‘172.%.%.%’。 -
第三种:修改定义者的IP地址为新机器。
The slave I/O thread stops because master and slave have
equal MySQL server ids; these ids must be different for
replication to work (or the --replicate-same-server-id
option must be used on slave but this does not always make
sense; please check the manual before using it).
mysql> stop slave;
mysql>set GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
-
mysql> start slave;(跳过错误,保持主从同步,前提是知道某个特定错误不会对数据的一致性造成严重影响)。
setglobal read_only=true;
Error initializing relay log position: I/O error reading the header from the binary log
mysql> CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.xxx',MASTER_LOG_POS=xxx;
解决方案
# -*- coding: utf-8 -*-
import mysql.connector
from mysql.connector import Error
defcheck_database():
connection = None# 初始化connection变量
try:
connection = mysql.connector.connect(
host='localhost',
database='employees',
user='root',
password='123456'
)
if connection.is_connected():
print("Database is up and running")
returnTrue
except Error as e:
print("Error: {}".format(e))
returnFalse
finally:
if connection isnotNoneand connection.is_connected(): # 安全检查
connection.close()
if __name__ == "__main__":
check_database()
#!/bin/bash
SERVICE="mysql"
if ! systemctl is-active --quiet $SERVICE; then
echo"$SERVICE is down! Restarting..."
/etc/init.d/mysqld start
echo"$SERVICE has been restarted."
else
echo"$SERVICE is running."
fi
import logging
logging.basicConfig(filename='db_monitor.log', level=logging.INFO)
deflog_error(message):
logging.error(message)
deflog_recovery(message):
logging.info(message)
# 在检查数据库时记录异常
ifnot check_database():
log_error("Database connection failed!")
import mysql.connector
from mysql.connector import Error
import logging
import os
# 配置日志
logging.basicConfig(filename='db_monitor.log', level=logging.INFO)
defcheck_database():
connection = None# 初始化connection变量
try:
connection = mysql.connector.connect(
host='localhost',
database=os.getenv('DB_NAME', 'employees'), # 使用环境变量获取数据库名
user=os.getenv('DB_USER', 'root'), # 使用环境变量获取用户名
password=os.getenv('DB_PASS', '123456') # 使用环境变量获取密码
)
if connection.is_connected():
print("Database is up and running")
logging.info("Database checked successfully.")
returnTrue
except Error as e:
log_error(f"Error: {e}")
returnFalse
finally:
if connection isnotNoneand connection.is_connected():
connection.close()
deflog_error(message):
logging.error(message)
deflog_recovery(message):
logging.info(message)
defrestart_mysql():
# 先检查MySQL服务状态
status = os.system("/etc/init.d/mysqld status > /dev/null 2>&1")
if status != 0: # 如果服务不活跃
start_status = os.system("/etc/init.d/mysqld start")
if start_status == 0:
log_recovery("MySQL was inactive and has been restarted successfully.")
else:
log_error("Failed to restart MySQL.")
else:
log_recovery("MySQL is active, no action needed.")
if __name__ == "__main__":
ifnot check_database():
print("Trying to restart MySQL...")
restart_mysql()
#!/bin/bash
exec 2>/dev/null
basedir=/usr/local/mysql
datadir=/usr/local/mysql/data
db_user=root
db_passwd=123456
db_post=3306
DATABASE_NAME=employees
#=================登陆时产生的故障=======================
# 配置信息
SOCKET_PATH="/usr/local/mysql/mysql.sock"
LOG_FILE="/usr/local/mysql/warn.txt"
# 尝试连接MySQL
mysql_test=$(mysql -u"$db_user" -p"$db_passwd" -S "$SOCKET_PATH" -e "SELECT 1;" 2>&1)
if [[ "$mysql_test" == *"ERROR 2002 (HY000): Can't connect to local MySQL server through socket '$SOCKET_PATH' (2)"* ]]; then
echo"无法连接到MySQL服务器,检查防火墙状态..."
# 检查并关闭防火墙服务
FIREWALLD_STATUS=$(systemctl is-active firewalld)
if [ "$FIREWALLD_STATUS" = "active" ]; then
echo"防火墙处于开启状态,正在关闭..."
systemctl stop firewalld
echo"防火墙已关闭。"
else
echo"防火墙未开启。"
fi
# 尝试启动MySQL服务
echo"尝试启动MySQL服务..."
/etc/init.d/mysqld start
# 等待几秒确保MySQL有足够的时间启动
sleep 5
# 再次尝试连接MySQL以确认是否成功启动
mysql_test_after_start=$(mysql -u"$db_user" -p"$db_passwd" -S "$SOCKET_PATH" -e "SELECT 1;" 2>&1)
if [[ "$mysql_test_after_start" != *"ERROR 2002 (HY000)"* ]]; then
echo"MySQL服务已成功启动。"
else
echo"MySQL服务启动失败,请检查日志文件或手动处理。"
echo"[$(date '+%Y-%m-%d %H:%M:%S')] - MySQL服务启动失败" >> "$LOG_FILE"
fi
else
echo"MySQL服务正常运行。"
fi
#======================内存==============================
MEMORY_USAGE=$(free | grep Mem | awk '{print $3/$2 * 100.0}');
if (( $(echo"$MEMORY_USAGE > 95" | bc -l) )); then
echo"警告: 内存使用率($MEMORY_USAGE%)超过了95%,开始尝试释放内存..."
# 示例:清除缓存
sync; echo 3 > /proc/sys/vm/drop_caches
echo"已尝试释放内存。"
else
echo"内存使用正常,无需释放。"
fi
#==============Thread_Running阻塞========================
MAX_Thread_Running=100 # Threads_running的最大阈值
AUTO_HEALING_ENABLED=true# 默认开启自动修复
SLOW_QUERY_TIME=10 # 慢查询的时间阈值(秒)
ALLOWED_USERS=("root") #允许的用户列表
# 获取当前Threads_running的数量
Threads_running=$(mysql -u$db_user -p$db_passwd -s -N -e "SHOW GLOBAL STATUS LIKE 'Threads_running';" | awk '{print $2}')
# 如果Threads_running超过阈值,则查找并终止符合条件的慢查询
if [ "$Threads_running" -gt "$MAX_Thread_Running" ] && [ "$AUTO_HEALING_ENABLED" = true ]; then
echo"警告: Threads_running ($Threads_running) 超过了 $MAX_Thread_Running,开始查找慢查询..."
# 查找符合条件的慢查询
SLOW_QUERIES=$(mysql -u$db_user -p$db_passwd -s -N -e \
"SELECT id, user, time, info FROM information_schema.processlist WHERE command='Query' AND info LIKE 'SELECT%' AND time > $SLOW_QUERY_TIME AND user IN ('${ALLOWED_USERS[@]}')")
if [ -n "$SLOW_QUERIES" ]; then
echo"找到慢查询,准备终止它们..."
while IFS= read -r line; do
QUERY_ID=$(echo"$line" | awk '{print $1}')
USER=$(echo"$line" | awk '{print $2}')
TIME=$(echo"$line" | awk '{print $3}')
INFO=$(echo"$line" | cut -d ' ' -f4-)
echo"正在终止查询ID: $QUERY_ID, 用户: $USER, 时间: $TIME, 查询: $INFO"
mysql -u$db_user -p$db_passwd -e "KILL $QUERY_ID;"
done <<< "$SLOW_QUERIES"
else
echo"没有找到符合条件的慢查询。"
fi
else
echo"Threads_running 正常,无需处理。"
fi
#======================连接数============================
# 获取当前max_connections和Threads_running的数量
MAX_CONNECTIONS=$(mysql -u$db_user -p$db_passwd -s -N -e "SELECT @@max_connections;")
CURRENT_CONNECTIONS=$(mysql -u$db_user -p$db_passwd -s -N -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';" | awk '{print $2}')
# 扩容增量
INCREMENT=1000
# 告警阈值状态ID
STATUS_ID=13
# 判断是否需要扩容
if [ "$CURRENT_CONNECTIONS" -ge "$MAX_CONNECTIONS" ] && [ "$Threads_running" -lt 100 ]; then
echo"警告: 当前连接数($CURRENT_CONNECTIONS)已达到最大值($MAX_CONNECTIONS),准备扩容..."
# 计算新的max_connections值
NEW_MAX_CONNECTIONS=$((MAX_CONNECTIONS + INCREMENT))
# 更新max_connections
mysql -u$db_user -p$db_passwd -e "SET GLOBAL max_connections=$NEW_MAX_CONNECTIONS;"
echo"max_connections 已更新为 $NEW_MAX_CONNECTIONS"
# 同步告警阈值
# 根据实际情况修改SQL语句
mysql -u$db_user -p$db_passwd$DATABASE_NAME -e "UPDATE alert_thresholds SET threshold_value=$NEW_MAX_CONNECTIONS WHERE status_id=$STATUS_ID;"
echo"告警阈值已同步更新为 $NEW_MAX_CONNECTIONS"
else
if [ "$CURRENT_CONNECTIONS" -lt "$MAX_CONNECTIONS" ]; then
echo"当前连接数($CURRENT_CONNECTIONS)未达到最大值($MAX_CONNECTIONS),无需扩容。"
elif [ "$Threads_running" -ge 100 ]; then
echo"活动线程数(Threads_running)大于或等于100,不满足扩容前置条件。"
fi
fi
#======================磁盘==============================
DISK_PATH="/usr/local/mysql/"# 指定要监控的磁盘路径
BINLOG_DIR="/usr/local/mysql/"# MySQL BINLOG所在的目录,需根据实际情况调整
WARN_FILE="/usr/local/mysql/warn.txt"# 告警信息存储文件
# 获取指定路径的磁盘使用百分比
DISK_USAGE=$(df $DISK_PATH --output=pcent | tail -1 | tr -d ' %')
# 清理BINLOG的函数
purge_binlogs() {
local cutoff=$1# 截止时间参数
echo"正在清理$cutoff前的所有BINLOG..."
mysql -u$db_user -p$db_passwd -e "PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL $cutoff);"
}
# 记录告警信息的函数
log_warning() {
local message="$1"
echo"$(date '+%Y-%m-%d %H:%M:%S') - $message" >> $WARN_FILE
}
# 判断并执行相应的操作
if [ "$DISK_USAGE" -ge 90 ]; then
echo"警告: 磁盘使用率($DISK_USAGE%)已达到或超过90%,开始清理2小时前的所有BINLOG..."
purge_binlogs '2 hour'
log_warning "磁盘使用率已达到或超过90%"
elif [ "$DISK_USAGE" -ge 85 ]; then
echo"警告: 磁盘使用率($DISK_USAGE%)已达到或超过85%,开始清理1天前的所有BINLOG..."
purge_binlogs '1 day'
log_warning "磁盘使用率已达到或超过85%"
elif [ "$DISK_USAGE" -ge 80 ]; then
echo"警告: 磁盘使用率($DISK_USAGE%)已达到或超过80%,开始清理3天前的所有BINLOG..."
purge_binlogs '3 day'
log_warning "磁盘使用率已达到或超过80%"
else
echo"磁盘使用正常,无需清理BINLOG"
fi
#=================主键自增ID溢出=========================
LOG_FILE="/usr/local/mysql/warn.txt"
THRESHOLD=1000000000 # 主键ID阈值
# 获取当前日期时间
CURRENT_DATETIME=$(date '+%Y-%m-%d %H:%M:%S')
# 创建日志条目头部
LOG_ENTRY="[$CURRENT_DATETIME] - 检查数据库主键ID是否接近上限:\n"
# 查询所有自增主键并检查其值
WARNINGS=$(mysql -u"$db_user" -p"$db_passwd" -e "
SELECT
CONCAT(TABLE_SCHEMA, '.', TABLE_NAME) AS table_name,
COLUMN_NAME AS column_name,
COLUMN_TYPE AS column_type,
AUTO_INCREMENT AS current_value
FROM
INFORMATION_SCHEMA.COLUMNS
INNER JOIN
INFORMATION_SCHEMA.TABLES ON COLUMNS.TABLE_SCHEMA = TABLES.TABLE_SCHEMA AND COLUMNS.TABLE_NAME = TABLES.TABLE_NAME
WHERE
EXTRA='auto_increment'
AND
DATA_TYPE IN ('int', 'integer')
AND
AUTO_INCREMENT > $THRESHOLD;")
# 如果有满足条件的记录,则添加到日志文件,并尝试修改数据类型
if [ -n "$WARNINGS" ]; then
LOG_ENTRY="$LOG_ENTRY$WARNINGS\n"
# 将警告信息写入日志文件
echo -e "$LOG_ENTRY" >> "$LOG_FILE"
echo"警告信息已记录到 $LOG_FILE"
# 处理每一行警告信息
echo"$WARNINGS" | whileread -r line; do
table_name=$(echo"$line" | awk '{print $1}' | tr -d '`')
column_name=$(echo"$line" | awk '{print $2}' | tr -d '`')
# 构造 ALTER TABLE 语句
alter_sql="ALTER TABLE $table_name MODIFY $column_name BIGINT NOT NULL AUTO_INCREMENT;"
# 执行 ALTER TABLE 操作
mysql -u"$db_user" -p"$db_passwd" -e "$alter_sql"
if [ $? -eq 0 ]; then
echo"成功将表 $table_name 的列 $column_name 修改为 BIGINT"
LOG_ENTRY="[$(date '+%Y-%m-%d %H:%M:%S')] - 成功将表 $table_name 的列 $column_name 修改为 BIGINT\n"
echo -e "$LOG_ENTRY" >> "$LOG_FILE"
else
echo"修改表 $table_name 的列 $column_name 失败"
LOG_ENTRY="[$(date '+%Y-%m-%d %H:%M:%S')] - 修改表 $table_name 的列 $column_name 失败\n"
echo -e "$LOG_ENTRY" >> "$LOG_FILE"
fi
done
else
echo"没有发现接近上限的主键ID"
-
检测MySQL是否存在宕机情况,并且在服务宕机时自动重启; -
检测服务端是否保持足够的内存容量,当内存即将占用满时,尝试释放页面缓存、目录项(dentries)和 inode以释放内存; -
应对Thread_running阻塞问题,会尝试获取当前Thread_running的数量,在其超过阈值的时候,查找并终止运行中的慢查询; -
应对连接数过高情况,会获取当前max_connections和Threads_running的数量,并尝试添加最大连接数,在生产环境中应当谨慎对待添加最大连接数的情况,可以设置最大阈值,即将到达阈值时则不再添加最大连接数而是通过webhook、邮件等方式进行告警,人工处理; -
应对磁盘容量不足情况,会检测当前MySQL目录所在磁盘容量,当磁盘不足时按照不同的阈值对bin_log日志进行清理; -
应对主键自增ID溢出情况,会检测数据库主键ID是否即将达到INT上限,在超过设定的阈值后将主键的类型更改为BIGINT类型,并且将过程记录到日志文件中。
本文作者:周登晖(上海新炬中北团队)
本文来源:“IT那活儿”公众号

