ES 磁盘爆满问题处理

Aug 18, 2023 3 min read

故障分析

node-3上部署了一个es，存储的应用日志， node上应该有个定时清理程序，是通过 elasticsearch-curator 来实现的，注意关注下这个程序是否启动了，配置是否正确。

Green - everything is good (cluster is fully functional)
Yellow - all data is available but some replicas are not yet allocated (cluster is fully functional)
Red - some data is not available for whatever reason (cluster is partially functional)

故障排查过程

排查问题

1、查看文件系统空间，得知没有挂载数据盘，使用的是系统根目录，且根目录利用率达到92%。

[root@node-1 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                 3.9G     0  3.9G   0% /dev
tmpfs                    3.9G     0  3.9G   0% /dev/shm
tmpfs                    3.9G  281M  3.6G   8% /run
tmpfs                    3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/mapper/centos-root   83G   76G    7G  92% /
/dev/sda1               1014M  149M  866M  15% /boot
tmpfs                    783M     0  783M   0% /run/user/0

2、查看磁盘使用，/dev/sdb 未使用。

[root@node-1 ~]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda               8:0    0  100G  0 disk 
├─sda1            8:1    0    1G  0 part /boot

└─sda2            8:2    0   99G  0 part 
  ├─centos-root 253:0    0   83G  0 lvm  /
  └─centos-swap 253:1    0   16G  0 lvm  
sdb               8:16   0  500G  0 disk 
sr0              11:0    1  4.4G  0 rom

3、通过 ES REST API 检查集群状态。

curl -XGET http://localhost:9200/_cat/health?v
# 状态为 RED，且 node.total 为 2 
# 检查 node 详情
curl -XGET http://localhost:9200/_cat/nodes?v
# 只有 node-1 node-2 ，经确认后只部署了 2 台 ES

4、检查 es-curator 清理进程，简单通过 ps -ef | grep curator 未找到相关进程，经咨询得知 es-curator 使用定时任务执行清理任务。

ps -ef | grep curator
[root@node-2 ~]# crontab -l -u root
#Ansible: None
00 01 * * * /usr/bin/curator --config /opt/logging/elasticsearch-curator/config.yml /opt/logging/elasticsearch-curator/action.yml
#Ansible: None
00 01 * * * /usr/bin/curator --config /opt/logging/elasticsearch-curator/config.yml /opt/logging/elasticsearch-curator/action.yml

# 注：crontab 中存在两条相同的定时任务，切含 ansible 标签，估计是使用 ansible 重复执行了，后手工删除了一条。

5、检查 ES 服务运行状态，确定磁盘已满。

systemctl status -l elasticsearch.service
# 检查得知 es 服务状态为 active，但日志显示 high disk watermark [90%] exceeded on [xbxg8Qc3RXedVUjBWArVxQ][node-1][/data/es_data/data/nodes/0] free: 6.9gb[8.3%]

故障解决

方法一、使用 es-curator 清理数据

1、检查 curator 配置，确认保留周期为历史 5 周。

[root@node-2 ~]# cat /opt/logging/elasticsearch-curator/config.yml 
client:
  hosts:
    - 192.168.1.100:9200
    - 192.168.1.101:9200
        url_prefix:  # 解决存储问题之后又发现次配置有异常，缩进问题，会导致 cruator 执行失败，修复此问题之后 es-curator 恢复正常 
  use_ssl: False
  certificate:
  client_cert:
  client_key:
  ssl_no_validate: False
  http_auth:
  timeout: 30
  master_only: False

logging:
  loglevel: INFO
  logfile: /opt/logging/logs/curator/curator.log
  logformat: default
  blacklist: ['elasticsearch', 'urllib3']

[root@node-2 ~]# cat /opt/logging/elasticsearch-curator/action.yml
actions:
  1:
    action: delete_indices
...
      exclude: False
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%W'
      unit: weeks
      unit_count: 5

2、通过更改配置减少到 4 周后，手动执行清理任务，释放历史索引，但 ES 状态依然为 RED。

/usr/bin/curator --config /opt/logging/elasticsearch-curator/config.yml /opt/logging/elasticsearch-curator/action.yml
# 首次排查时未发现问题，次日进行详细排查时检查到 config.yml 配置异常，见上文配置文件中注释。

方法二、扩容分区

1、经排查得知数据卷未挂载单独挂卷，直接使用系统根分区，但根分区使用了 LVM ，经确认后可将数据盘扩容到根分区。

2、扩容 LVM

[root@node-1 ~]# fdisk /dev/sdb
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): n

# 输入 n 之后一直回车，再 w 保存。

# 分区写入内核
partprobe
# 创建 PV
pvcreate /dev/sdb1
# 扩容 VG
vgextend centos /dev/sdb1
# 扩容 LVM
lvresize -l +100%FREE /dev/centos/root
# 写入文件系统
xfs_growfs /dev/centos/root

3、继续扩容另一台节点

4、扩容完 ES 之后集群状态已经变成 yellow（主分片可用，副分片异常）

5、登录 node-2，根据运维手册尝试执行一条重新分片

curl –XGET http://localhost:9200/_cat/shards | grep "r UNASSIGNED"

curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '{
    "commands": [
        {
            "allocate_replica": {
                "index": "xxl-job-logfile-2021.47",
                "shard": 0,
                "node": "node-2"
            }
        }
    ]
}'
# 执行完成后，稍等片刻，ES自动执行重新分片

6、所有分片重新分配完成后，检查 ES 状态已变成 green

总结

本次故障主要原因为磁盘达到 elasticsearch 的 wartermark（90%）从而导致 elasticsearch 只能都取无法写入数据。引起此次故障的根因有两个：

1、curator 定时任务配置错误，导致历史索引无法删除，积累的数据达到警戒值，从而引发此次故障。

2、数据盘未挂载使用，导致 elasticsearch 使用了容量较小系统根分区作为数据盘。