Entity extraction integrating lexical information for coal mine safety accidents
-
摘要:
命名实体识别是构建煤矿安全事故领域知识图谱的基本任务,但中文缺乏明显的词汇边界特征,导致现有实体提取模型对词汇信息利用不充分。针对上述问题,提出了一种融合词汇信息的煤矿安全事故实体提取模型——融合词汇信息的RoBERTa−BiLSTM−CRF模型。首先,构建煤矿安全领域专业词典,采用RoBERTa获取字符特征向量,采用AC自动机算法进行字词匹配,得到字符对应的潜在词汇,采用Glove获取词汇特征向量。然后,通过自注意机制分配权重,将基于RoBERTa得到的字符特征向量和基于GloVe得到的词汇特征向量进行融合,得到包含词汇信息的融合向量。最后,将融合向量作为BiLSTM−CRF的输入,得到最优预测序列结果,实现煤矿安全事故实体提取。实验结果表明:① 融合词汇信息的RoBERTa−BiLSTM−CRF模型对煤矿安全领域12种实体提取的F1达91.63%,较RoBERTa−BiLSTM−CRF模型提高了1.63%。② 融合词汇信息的RoBERTa−BiLSTM−CRF模型在整体实体提取任务及各类实体类型的提取任务中,综合性能优于其他模型,说明模型架构设计对不同实体类型具有广泛适用性。
Abstract:Named Entity Recognition (NER) serves as a foundational task in constructing knowledge graphs for coal mine safety accidents, yet the absence of explicit lexical boundaries in Chinese text has constrained the effective utilization of lexical information by existing entity extraction models. To address this challenge, a RoBERTa-BiLSTM-CRF model integrated with lexical information was proposed for entity extraction in coal mine safety accidents. Initially, a domain-specific lexicon for coal mine safety was constructed, where character-level feature vectors were obtained via RoBERTa, and potential lexical units corresponding to characters were identified through the Aho-Corasick (AC) Automation. Subsequently, lexical feature vectors were derived using GloVe embeddings. These vectors were then fused via a self-attention mechanism, which dynamically allocated weights to integrate RoBERTa-based character features and GloVe-based lexical features, yielding a composite vector enriched with lexical semantics. Finally, the fused vector was fed into a BiLSTM-CRF framework to generate optimized prediction sequences, thereby achieving accurate entity extraction in coal mine safety accidents. Experimental results demonstrated that: (1) the proposed model achieved an F1-score of 91.63%, which was 1.63 % higher than that of the RoBERTa-BiLSTM-CRF model. (2) It outperformed comparative models in both overall entity extraction tasks and across various entity categories, indicating the broad applicability of its design to diverse entity types.
-
-
表 1 命名实体标注
Table 1 Scheme of named entity annotation
序号 实体 标签 实例 备注 1 事故灾害 Accident 水灾事故 研究对象 2 采煤施工 Method 掘进作业 人员操作 3 防治措施 Prevention 顶板维护 人员操作 4 救援善后 Rescue 抢排水 人员操作 5 工作人员 Person 采掘工 工作人员 6 机电设备 Facility 掘进机 机器 7 空间位置 Place 掘进工作面 环境 8 大气环境 Atmospheric 瓦斯 环境 9 地质条件 Geology 煤层厚度 环境 10 数据参数 Parameters 每班,每周 管理 11 安全管理 Management 综合应急预案 管理 12 组织机构 Organization 抢险救援指挥部 管理 表 2 RoBERTa−BiLSTM− CRF实体提取模型参数设置
Table 2 Parameters for RoBERTa-BiLSTM-CRF entity extraction model
参数 RoBERTa层 字词融合层 BiLSTM层 CRF层 batch size 32 − − − 句子最大长度 256 − − − 标签的数量 12 − − 12 转移矩阵维度 − − − 14×14 嵌入向量维度 1024 1024 1024 1024 Transformer层 12 − − − 隐藏层 768 768 128 − 多头注意力机制 12 12 − − 词汇向量维度 − 100 − − LSTM层数 − − 2 − dropout 0.1 0.1 0.5 − 学习率 3×10−5 3×10−5 1.5×10−3 − 归一化参数 − 0.7 − − 表 3 不同模型的实体提取结果
Table 3 Entity extraction performances across different models s
% 模型 F1 精确率 召回率 BiLSTM−CRF 70.83 71.53 70.14 RoBERTa−Softmax 84.91 85.64 84.19 RoBERTa−CRF 86.52 87.46 85.6 RoBERTa−BiLSTM−CRF 90.00 91.91 88.17 本文模型 91.63 92.38 90.89 表 4 12种实体类型提取的F1
Table 4 F1-scores for 12 entity categories
概念类 数量/个 F1/% BiLSTM−
CRFRoBERTa−
SoftmaxRoBERTa−
CRFRoBERTa−
BiLSTM−CRF本文
模型事故灾害 524 64.50 78.82 80.53 84.16 85.69 采煤施工 613 63.78 77.98 79.77 83.20 84.83 防治措施 515 65.83 80.00 81.75 87.38 86.99 救援善后 209 70.33 87.56 86.60 90.43 91.87 工作人员 185 76.22 85.41 92.43 98.38 97.84 机电设备 1721 76.06 84.72 93.72 94.19 95.41 空间位置 1127 72.94 90.68 88.73 92.28 93.88 大气环境 158 67.09 81.65 83.54 87.34 89.24 地质条件 253 69.96 84.58 86.56 92.09 91.70 数据参数 758 71.11 88.65 87.07 90.50 92.22 安全管理 432 73.15 91.90 92.59 92.82 94.44 组织机构 69 73.91 86.96 89.86 98.55 100.00 -
[1] 国家能源局. 煤矿智能化标准体系建设指南 [EB/OL]. (2024-03-13)[2024-08-13]. https://zfxxgk.nea.gov.cn/2024-03/13/c_1310768359.htm. National Energy Administration. Guide for building the intelligent standard system of coal mine[EB/OL]. [EB/OL]. (2024-03-13)[2024-08-13]. https://zfxxgk.nea.gov.cn/2024-03/13/c_1310768359.htm.
[2] 郭梨,高元,吴昊,等. 基于混合因果逻辑的尾矿坝事故知识图谱构建与应用[J]. 金属矿山,2025(1):233-242. GUO Li,GAO Yuan,WU Hao,et al. Construction and application of tailings dam accident knowledge graph based on hybrid causal logic[J]. Metal Mine,2025(1):233-242.
[3] JI Shaoxiong,PAN Shirui,CAMBRIA E,et al. A survey on knowledge graphs:representation,acquisition,and applications[J]. IEEE Transactions on Neural Networks and Learning Systems,2022,33(2):494-514. DOI: 10.1109/TNNLS.2021.3070843
[4] RAU L F. Extracting company names from text[C]. The Seventh IEEE Conference on Artificial Intelligence Application,Miami Beach,1991:29-32.
[5] GRISHMAN R,SUNDHEIM B. Message understanding conference-6:a brief history[C]. 16th Conference on Computational Linguistics,Copenhagen,1996:466-471.
[6] 任乐,张仰森,刘帅康. 基于深度学习的实体关系抽取研究综述[J]. 北京信息科技大学学报(自然科学版),2023,38(6):70-79,87. REN Le,ZHANG Yangsen,LIU Shuaikang. Review of research on entity relation extraction based on deep learning[J]. Journal of Beijing Information Science & Technology University(Science and Technology Edition),2023,38(6):70-79,87.
[7] HUANG Zhiheng,XU Wei,YU Kai. Bidirectional LSTM-CRF models for sequence tagging[J]. Computer Science,2015. DOI: 10.48550/arXiv.1508.01991.
[8] 曹卫东,徐秀丽. 基于R−BERT−CNN模型的实体关系抽取[J]. 计算机应用与软件,2023,40(4):222-229. DOI: 10.3969/j.issn.1000-386x.2023.04.036 CAO Weidong,XU Xiuli. Entity relationship extraction based on R-BERT-CNN[J]. Computer Applications and Software,2023,40(4):222-229. DOI: 10.3969/j.issn.1000-386x.2023.04.036
[9] 肖丹,杨春明,张晖,等. 基于多头注意力的中文电子病历命名实体识别[J]. 计算机应用与软件,2024,41(1):133-138,160. DOI: 10.3969/j.issn.1000-386x.2024.01.020 XIAO Dan,YANG Chunming,ZHANG Hui,et al. Named entity recognition based on Multi-Head Attention in Chinese electronic medical records[J]. Computer Applications and Software,2024,41(1):133-138,160. DOI: 10.3969/j.issn.1000-386x.2024.01.020
[10] 潘理虎,赵彭彭,龚大立,等. 煤矿事故案例命名实体识别方法研究[J]. 计算机技术与发展,2022,32(2):154-160. DOI: 10.3969/j.issn.1673-629X.2022.02.025 PAN Lihu,ZHAO Pengpeng,GONG Dali,et al. Combined ALBERT for named entity recognition in coal mine accident cases[J]. Computer Technology and Development,2022,32(2):154-160. DOI: 10.3969/j.issn.1673-629X.2022.02.025
[11] 王向前,李敏敏,孟祥瑞. 基于ALBERT−BiLSTM− CRF的煤矿事故案例文本命名实体识别方法[J]. 阜阳师范大学学报(自然科学版),2022,39(3):56-64. WANG Xiangqian,LI Minmin,MENG Xiangrui. Named entity recognition method of coal mine accident case text based on ALBERT-BiLSTM-CRF[J]. Journal of Fuyang Normal University(Natural Science),2022,39(3):56-64.
[12] 曹现刚,吴可昕,张梦园,等. 基于BERT的煤矿装备维护知识命名实体识别研究[J]. 机床与液压,2023,51(9):103-108. DOI: 10.3969/j.issn.1001-3881.2023.09.017 CAO Xiangang,WU Kexin,ZHANG Mengyuan,et al. Coal mine equipment maintenance knowledge named entity recognition model based on BERT[J]. Machine Tool & Hydraulics,2023,51(9):103-108. DOI: 10.3969/j.issn.1001-3881.2023.09.017
[13] 刘飞翔,李泽荃,赵嘉良,等. 基于ERNIE−BiGRU−CRF模型的煤矿安全隐患命名实体智能识别研究[J]. 煤炭工程,2024,56(2):206-212. LIU Feixiang,LI Zequan,ZHAO Jialiang,et al. Intelligent recognition of named entities of coal mine safety hidden danger based on ERNIE-BiGRU-CRF model[J]. Coal Engineering,2024,56(2):206-212.
[14] 夏江镧,李艳玲,葛凤培. 基于大语言模型的实体关系抽取综述[J/OL]. 计算机科学与探索:1-23[2024-07-22]. http://kns.cnki.net/kcms/detail/11.5602.TP.20250219.1506.010.html. XIA Jianglan,LI Yanling,GE Fengpei. A survey of entity relation extraction based on large language models[J/OL]. Journal of Frontiers of Computer Science and Technology:1-23[2024-07-22]. http://kns.cnki.net/kcms/detail/11.5602.TP.20250219.1506.010.html.
[15] MA Shengkun,HAN Jiale,LIANG Yi,et al. Making pre-trained language models better continual few-shot relation extractors[C]. Joint International Conference on Computational Linguistics,Language Resources and Evaluation,Torino,2024:10970-10983.
[16] MIAO Xin,LI Yongqi,ZHOU Shen,et al. Episodic memory retrieval from LLMs:a neuromorphic mechanism to generate commonsense counterfactuals for relation extraction[C]. Findings of the Association for Computational Linguistics,Bangkok,2024:2489-2511.
[17] LUO Da,GAN Yanglei,HOU Rui,et al. Synergistic anchored contrastive pre-training for few-shot relation extraction[C]. The 38th AAAI Conference on Artificial Intelligence,Vancouver,2024:18742-18750.
[18] XU Xiaolong,LI Chenbin,XIANG Haolong,et al. Attention based document-level relation extraction with none class ranking loss[C]. The 33th International Joint Conference on Artificial Intelligence,Jeju,2024:6569-6577.
[19] LI Guozheng,KE Wenjun,WANG Peng,et al. Unlocking instructive in-context learning with tabular prompting for relational triple extraction[C]. Joint International Conference on Computational Linguistics,Language Resources and Evaluation,Torino,2024:17131-17143.
[20] 刘婷,潘理虎,张素兰,等. 基于形式概念分析的采煤工作面本体构建研究[J]. 工矿自动化,2017,43(1):73-76. LIU Ting,PAN Lihu,ZHANG Sulan,et al. Research of ontology construction of coal mining face based on formal concept analysis[J]. Industry and Mine Automation,2017,43(1):73-76.
[21] STENETORP P,PYYSALO S,TOPIC G,et al. BRAT:a web-based tool for NLP-assisted text annotation[C]. The 13th Conference of the European Chapter of the Association for Computational Linguistics,Avignon,2012:102-107.
[22] 姜海洋,李雪菲,杨晔. 基于距离比较的AC自动机并行匹配算法[J]. 电子与信息学报,2022,44(2):581-590. DOI: 10.11999/JEIT210009 JIANG Haiyang,LI Xuefei,YANG Ye. Distance comparison based parallel pattern matching[J]. Journal of Electronics & Information Technology,2022,44(2):581-590. DOI: 10.11999/JEIT210009
[23] 赵鹏飞,赵春江,吴华瑞,等. 基于BERT的多特征融合农业命名实体识别[J]. 农业工程学报,2022,38(3):112-118. DOI: 10.11975/j.issn.1002-6819.2022.03.013 ZHAO Pengfei,ZHAO Chunjiang,WU Huarui,et al. Recognition of the agricultural named entities with multi-feature fusion based on BERT[J]. Transactions of the Chinese Society of Agricultural Engineering,2022,38(3):112-118. DOI: 10.11975/j.issn.1002-6819.2022.03.013
[24] 周燕. 基于GloVe模型和注意力机制Bi−LSTM的文本分类方法[J]. 电子测量技术,2022,45(7):42-47. ZHOU Yan. Text classification method based on GloVe model and attention mechanism Bi-LSTM[J]. Electronic Measurement Technology,2022,45(7):42-47.
[25] DEVLIN J,CHANG Mingwei,LEE K,et al. BERT:pretraining of deep bidirectional transformers for language understanding[C]. Conference of the North American Chapter of the Association for Computational Linguistics,Minneapolis,2019. DOI: 10.48550/arXiv.1810.04805.
[26] 李静宜,丁飞,张楠,等. 基于深度LSTM与遗传算法融合的短期交通流预测模型[J]. 无线电通信技术,2022,48(5):836-843. DOI: 10.3969/j.issn.1003-3114.2022.05.009 LI Jingyi,DING Fei,ZHANG Nan,et al. Short-term traffic flow prediction model base on fusion of depth LSTM and genetic algorithm[J]. Radio Communications Technology,2022,48(5):836-843. DOI: 10.3969/j.issn.1003-3114.2022.05.009
-
期刊类型引用(3)
1. 马英益. 矿井局部通风机智能集中控制系统方案设计. 设备管理与维修. 2024(11): 26-28 . 百度学术
2. 咸粤飞,崔晓光,胡冰,赵振华. 采煤机牵引变频器功率平衡控制策略的研究. 电力电子技术. 2023(11): 47-50 . 百度学术
3. 王亮,沈晔超,葛勇. 基于材料分拣模型设计的实验综述报告. 科技风. 2022(11): 58-60 . 百度学术
其他类型引用(3)