ETCD No Space Error
背景分析
客户现场基于 patroni + etcd + postgres + haproxy + keepalived 部署一套主从复制的三节点集群,具体部署详情可参考 postgre 高可用集群安装,运行一年时间之后客户报从节点无限尝试登录主节点,疑似暴力破解情况。
于是乎登录集群查看检查 postgres 集群状态,发现并无异常。
systemctl status patroni.service
● patroni.service - patroni
Loaded: loaded (/usr/lib/systemd/system/patroni.service; enabled; vendor preset: disabled)
Active: active (running) since 日 2022-11-20 16:17:52 CST; 10 months 5 days ago
Main PID: 11292 (patroni)
Tasks: 280
CGroup: /system.slice/patroni.service
├─ 11292 /usr/bin/python3 /usr/local/bin/patroni /data/app/patroni/patroni.yml
├─ 11316 postgres -D /data/pgdata --config-file=/data/pgdata/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 --cluster_name=pg_cluster --wal_level=logical --hot_standby=on --max_connections=500 --max_wal_senders=24 --max_prepared_transactions=0 --max_locks_per_transaction=128 --track_commit_timestamp=True --max_replication_slots=16 --max_worker_processes=32 --wal_log_hints=on
├─ 11321 postgres: pg_cluster: logger
├─ 11331 postgres: pg_cluster: checkpointer
├─ 11332 postgres: pg_cluster: background writer
├─ 11342 postgres: pg_cluster: stats collector
├─ 11345 postgres: pg_cluster: postgres postgres [local] idle
11月 20 16:17:52 postgres-3 systemd[1]: Started patroni.
11月 20 16:17:52 postgres-3 patroni[11292]: 2022-11-20 16:17:52.851 CST [11316] LOG: redirecting log output to logging collector process
11月 20 16:17:52 postgres-3 patroni[11292]: 2022-11-20 16:17:52.851 CST [11316] HINT: Future log output will appear in directory "log".
11月 20 16:17:53 postgres-3 patroni[11292]: /var/run/postgresql:5432 - 接受连接
11月 20 16:17:53 postgres-3 patroni[11292]: /var/run/postgresql:5432 - 接受连接
11月 20 16:34:37 postgres-3 patroni[11292]: 服务器重新加载中
12月 26 14:59:33 postgres-3 patroni[11292]: 服务器进程发出信号
12月 26 15:01:13 postgres-3 patroni[11292]: 服务器进程发出信号
patronictl -c patroni.yml list
+-----------------+--------------+---------+---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+ Cluster: pg_cluster (7146121599824209406) ----+----+-----------+-----------------+
| postgres-1 | 10.1.1.17 | Replica | running | 12 | 0 | |
| postgres-2 | 10.1.1.18 | Replica | running | 12 | 0 | |
| postgres-3 | 10.1.1.19 | Leader | running | 12 | | |
+-----------------+--------------+---------+---------+----+-----------+-----------------+
于是检查 patroni 日志记录,发现大量报错,
2023-09-24 04:30:23 +0800 ERROR: <Unknown error: 'etcdserver: mvcc: database space exceeded', code: 2>
2023-09-24 04:30:23 +0800 INFO: no action. I am (postgres-3), the leader with the lock
2023-09-24 04:30:33 +0800 INFO: Lock owner: postgres-3; I am postgres-3
2023-09-24 04:30:33 +0800 ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 566, in wrapper
retval = func(self, *args, **kwargs) is not None
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 769, in _write_status
return self._client.put(self.status_path, value)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 294, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 337, in put
return self.call_rpc('/kv/put', fields, retry)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 558, in call_rpc
ret = super(PatroniEtcd3Client, self).call_rpc(method, fields, retry)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 262, in call_rpc
return self.api_execute(self.version_prefix + method, self._MPOST, fields)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 257, in api_execute
return self._handle_server_response(response)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 220, in _handle_server_response
_raise_for_data(data, response.status)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 136, in _raise_for_data
raise err(code, error, status_code)
从报错内容来看应该是 etcd 的问题,疑似空间满了,再次去检查 etcd 集群状态,确定为空间满了。
etcdctl endpoint health --cluster -w table
+--------------------------+--------+------------+---------------------------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+--------------------------+--------+------------+---------------------------+
| http://10.1.1.19:2379 | false | 1.185127ms | Active Alarm(s): NOSPACE |
| http://10.1.1.17:2379 | false | 1.257693ms | Active Alarm(s): NOSPACE |
| http://10.1.1.18:2379 | false | 894.418µs | Active Alarm(s): NOSPACE |
+--------------------------+--------+------------+---------------------------+
etcdctl endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://10.1.1.18:2379 | 194439f4d645e54f | 3.5.1 | 2.1 GB | false | false | 80 | 11549461 | 11549461 | memberID:1820643873094231375 |
| | | | | | | | | | alarm:NOSPACE |
| http://10.1.1.19:2379 | 6d4e068b57712188 | 3.5.1 | 2.1 GB | true | false | 80 | 11549461 | 11549461 | memberID:1820643873094231375 |
| | | | | | | | | | alarm:NOSPACE |
| http://10.1.1.17:2379 | a9efa80d3e8e6681 | 3.5.1 | 2.1 GB | false | false | 80 | 11549461 | 11549461 | memberID:1820643873094231375 |
| | | | | | | | | | alarm:NOSPACE |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
查看 etcd 日志 发现大量 {"level":"warn","ts":"2023-09-26T15:42:23.663+0800","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"3.163µs","request":"header:<ID:2416326952385748711 > put:<key:\"/pgcluster/pg_cluster/members/postgres-2\" value_size:218 lease:7386330616106706594 >","response":"","error":"etcdserver: no space"}
错误日志,确定为 etcd 问题,etcd 3.5 版本默认存储配置为 2 GB
(官方配置),从上述结果来看 DB SIZE 已经为 2.1 GB ,超过默认的配置,
于是导致 patroni 无法更新数据库节点信息。
修复 etcd 节点
patroni 使用 etcd 来保证 postgres 的成员信息、选主信息、以及 postgres 的配置信息,按道理不占用太多存储空间,为什么会出现空间满了的情况?结合上述日志信息 patroni 往 etcd 中写入 members ,因此可以确定是 patroni 在不停的更新 postgres 成员节点,导致某个 key 的版本记录不停增加,而 etcd 又未设置相应的自动清理策略,从而导致空间持续增加,最终达到 2GB 上限值触发 etcd 故障。接下来进行修复。
- 备份
etcdctl snapshot save etcd_bak.db
etcdctl -w table snapshot status backup.db
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| da76f994 | 6662600 | 23 | 2.1 GB |
+----------+----------+------------+------------+
- 清理
登录个节点依次执行清理工作。
# 检查告警
etcdctl alarm list
memberID:1820643873094231375 alarm:NOSPACE
# 检查 revision
etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'
6662600
# 压缩旧版本
etcdctl compact 6662600
compacted revision 6662600
# 执行清理
etcdctl defrag
# 检查状态
etcdctl endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://10.1.1.18:2379 | 194439f4d645e54f | 3.5.1 | 2.1 GB | false | false | 80 | 11549268 | 11549268 | memberID:1820643873094231375 |
| | | | | | | | | | alarm:NOSPACE |
| http://10.1.1.19:2379 | 6d4e068b57712188 | 3.5.1 | 33 kB | true | false | 80 | 11549268 | 11549268 | memberID:1820643873094231375 |
| | | | | | | | | | alarm:NOSPACE |
| http://10.1.1.17:2379 | a9efa80d3e8e6681 | 3.5.1 | 2.1 GB | false | false | 80 | 11549268 | 11549268 | memberID:1820643873094231375 |
| | | | | | | | | | alarm:NOSPACE |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
- 清除告警
在三个节点依次执行上述清理过程,直到 DB SIZE 都恢复正常。此时 ERRORS 信息依旧存在,需要清除告警。
etcdctl alarm disarm
# 再次查看状态
etcdctl endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://10.8.0.18:2379 | 194439f4d645e54f | 3.5.1 | 41 kB | false | false | 80 | 11549621 | 11549621 | |
| http://10.8.0.19:2379 | 6d4e068b57712188 | 3.5.1 | 41 kB | true | false | 80 | 11549621 | 11549621 | |
| http://10.8.0.17:2379 | a9efa80d3e8e6681 | 3.5.1 | 41 kB | false | false | 80 | 11549621 | 11549621 | |