Abstract:
To address challenges such as multi-scale variations in underground targets, occlusion of moving objects, and the excessive similarity between targets and the environment, a deep learning-based method was proposed for detecting and recognizing unsafe behaviours of underground coal miners. A top-down approach was adopted to construct a YOLOv5s_swin target detection model based on a self-attention mechanism. This model was developed by introducing a sliding window operation into the Transformer-based self-attention mechanism to obtain Swin-Transformer, which was then used to enhance the traditional YOLOv5s model, resulting in YOLOv5s_swin. To tackle the issue of multi-scale variations in human detection bounding boxes caused by the varying distances between underground personnel and surveillance cameras, a high-resolution feature extraction network was employed to extract human keypoints after detecting personnel. Subsequently, a spatiotemporal graph convolutional network (ST-GCN) was utilized for behaviour recognition. Experimental results showed that YOLOv5s_swin achieved an accuracy of 98.9%, an improvement of 1.5% over YOLOv5s, with an inference speed of 102 frames per second (fps), meeting real-time detection requirements. The high-resolution feature extraction network effectively extracted human keypoints at different scales, and the HRNet_w48 network, with more feature channels, outperformed HRNet_w32. Under complex industrial and mining conditions, the ST-GCN model demonstrated high accuracy and recall rates, enabling precise classification of miners' behaviors, with an inference speed of 31 fps, thereby meeting underground monitoring requirements.