ETCD No Space Error

Sep 26, 2023 4 min read

背景分析

客户现场基于 patroni + etcd + postgres + haproxy + keepalived 部署一套主从复制的三节点集群，具体部署详情可参考 postgre 高可用集群安装，运行一年时间之后客户报从节点无限尝试登录主节点，疑似暴力破解情况。

于是乎登录集群查看检查 postgres 集群状态，发现并无异常。

systemctl status patroni.service
● patroni.service - patroni
   Loaded: loaded (/usr/lib/systemd/system/patroni.service; enabled; vendor preset: disabled)
   Active: active (running) since 日 2022-11-20 16:17:52 CST; 10 months 5 days ago
 Main PID: 11292 (patroni)
    Tasks: 280
   CGroup: /system.slice/patroni.service
           ├─ 11292 /usr/bin/python3 /usr/local/bin/patroni /data/app/patroni/patroni.yml
           ├─ 11316 postgres -D /data/pgdata --config-file=/data/pgdata/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 --cluster_name=pg_cluster --wal_level=logical --hot_standby=on --max_connections=500 --max_wal_senders=24 --max_prepared_transactions=0 --max_locks_per_transaction=128 --track_commit_timestamp=True --max_replication_slots=16 --max_worker_processes=32 --wal_log_hints=on
           ├─ 11321 postgres: pg_cluster: logger 
           ├─ 11331 postgres: pg_cluster: checkpointer 
           ├─ 11332 postgres: pg_cluster: background writer 
           ├─ 11342 postgres: pg_cluster: stats collector 
           ├─ 11345 postgres: pg_cluster: postgres postgres [local] idle

11月 20 16:17:52 postgres-3 systemd[1]: Started patroni.
11月 20 16:17:52 postgres-3 patroni[11292]: 2022-11-20 16:17:52.851 CST [11316] LOG:  redirecting log output to logging collector process
11月 20 16:17:52 postgres-3 patroni[11292]: 2022-11-20 16:17:52.851 CST [11316] HINT:  Future log output will appear in directory "log".
11月 20 16:17:53 postgres-3 patroni[11292]: /var/run/postgresql:5432 - 接受连接
11月 20 16:17:53 postgres-3 patroni[11292]: /var/run/postgresql:5432 - 接受连接
11月 20 16:34:37 postgres-3 patroni[11292]: 服务器重新加载中
12月 26 14:59:33 postgres-3 patroni[11292]: 服务器进程发出信号
12月 26 15:01:13 postgres-3 patroni[11292]: 服务器进程发出信号
patronictl -c patroni.yml  list
+-----------------+--------------+---------+---------+----+-----------+-----------------+
| Member          | Host         | Role    | State   | TL | Lag in MB | Pending restart |
+ Cluster: pg_cluster (7146121599824209406) ----+----+-----------+-----------------+
| postgres-1      | 10.1.1.17    | Replica | running | 12 |         0 |                 |
| postgres-2      | 10.1.1.18    | Replica | running | 12 |         0 |                 |
| postgres-3      | 10.1.1.19    | Leader  | running | 12 |           |                 |
+-----------------+--------------+---------+---------+----+-----------+-----------------+

于是检查 patroni 日志记录，发现大量报错，

2023-09-24 04:30:23 +0800 ERROR: <Unknown error: 'etcdserver: mvcc: database space exceeded', code: 2>
2023-09-24 04:30:23 +0800 INFO: no action. I am (postgres-3), the leader with the lock
2023-09-24 04:30:33 +0800 INFO: Lock owner: postgres-3; I am postgres-3
2023-09-24 04:30:33 +0800 ERROR:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 566, in wrapper
    retval = func(self, *args, **kwargs) is not None
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 769, in _write_status
    return self._client.put(self.status_path, value)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 294, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 337, in put
    return self.call_rpc('/kv/put', fields, retry)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 558, in call_rpc
    ret = super(PatroniEtcd3Client, self).call_rpc(method, fields, retry)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 262, in call_rpc
    return self.api_execute(self.version_prefix + method, self._MPOST, fields)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 257, in api_execute
    return self._handle_server_response(response)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 220, in _handle_server_response
    _raise_for_data(data, response.status)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 136, in _raise_for_data
    raise err(code, error, status_code)

从报错内容来看应该是 etcd 的问题，疑似空间满了，再次去检查 etcd 集群状态，确定为空间满了。

etcdctl endpoint health --cluster -w table
+--------------------------+--------+------------+---------------------------+
|         ENDPOINT         | HEALTH |    TOOK    |           ERROR           |
+--------------------------+--------+------------+---------------------------+
| http://10.1.1.19:2379 |  false | 1.185127ms | Active Alarm(s): NOSPACE  |
| http://10.1.1.17:2379 |  false | 1.257693ms | Active Alarm(s): NOSPACE  |
| http://10.1.1.18:2379 |  false |  894.418µs | Active Alarm(s): NOSPACE  |
+--------------------------+--------+------------+---------------------------+
etcdctl endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://10.1.1.18:2379 | 194439f4d645e54f |   3.5.1 |  2.1 GB |     false |      false |        80 |   11549461 |           11549461 |   memberID:1820643873094231375 |
|                          |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| http://10.1.1.19:2379 | 6d4e068b57712188 |   3.5.1 |  2.1 GB |      true |      false |        80 |   11549461 |           11549461 |   memberID:1820643873094231375 |
|                          |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| http://10.1.1.17:2379 | a9efa80d3e8e6681 |   3.5.1 |  2.1 GB |     false |      false |        80 |   11549461 |           11549461 |   memberID:1820643873094231375 |
|                          |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+

查看 etcd 日志发现大量 {"level":"warn","ts":"2023-09-26T15:42:23.663+0800","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"3.163µs","request":"header:<ID:2416326952385748711 > put:<key:\"/pgcluster/pg_cluster/members/postgres-2\" value_size:218 lease:7386330616106706594 >","response":"","error":"etcdserver: no space"} 错误日志，确定为 etcd 问题，etcd 3.5 版本默认存储配置为 2 GB （官方配置），从上述结果来看 DB SIZE 已经为 2.1 GB ，超过默认的配置，于是导致 patroni 无法更新数据库节点信息。

修复 etcd 节点

patroni 使用 etcd 来保证 postgres 的成员信息、选主信息、以及 postgres 的配置信息，按道理不占用太多存储空间，为什么会出现空间满了的情况？结合上述日志信息 patroni 往 etcd 中写入 members ，因此可以确定是 patroni 在不停的更新 postgres 成员节点，导致某个 key 的版本记录不停增加，而 etcd 又未设置相应的自动清理策略，从而导致空间持续增加，最终达到 2GB 上限值触发 etcd 故障。接下来进行修复。

备份

etcdctl snapshot save etcd_bak.db
etcdctl -w table snapshot status backup.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| da76f994 |  6662600 |         23 |     2.1 GB |
+----------+----------+------------+------------+

清理

登录个节点依次执行清理工作。

# 检查告警
etcdctl alarm list
memberID:1820643873094231375 alarm:NOSPACE 
# 检查 revision
etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'
6662600
# 压缩旧版本
etcdctl compact 6662600
compacted revision 6662600
# 执行清理
etcdctl defrag
# 检查状态
etcdctl endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://10.1.1.18:2379    | 194439f4d645e54f |   3.5.1 |  2.1 GB |     false |      false |        80 |   11549268 |           11549268 |   memberID:1820643873094231375 |
|                          |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| http://10.1.1.19:2379    | 6d4e068b57712188 |   3.5.1 |   33 kB |      true |      false |        80 |   11549268 |           11549268 |   memberID:1820643873094231375 |
|                          |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| http://10.1.1.17:2379    | a9efa80d3e8e6681 |   3.5.1 |  2.1 GB |     false |      false |        80 |   11549268 |           11549268 |   memberID:1820643873094231375 |
|                          |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+

清除告警

在三个节点依次执行上述清理过程，直到 DB SIZE 都恢复正常。此时 ERRORS 信息依旧存在，需要清除告警。

 etcdctl alarm disarm
 # 再次查看状态
etcdctl endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://10.8.0.18:2379    | 194439f4d645e54f |   3.5.1 |   41 kB |     false |      false |        80 |   11549621 |           11549621 |        |
| http://10.8.0.19:2379    | 6d4e068b57712188 |   3.5.1 |   41 kB |      true |      false |        80 |   11549621 |           11549621 |        |
| http://10.8.0.17:2379    | a9efa80d3e8e6681 |   3.5.1 |   41 kB |     false |      false |        80 |   11549621 |           11549621 |        |

ETCD No Space Error

背景分析

修复 etcd 节点

Infee Fang

互联网二手搬砖工