The automated detection of safety factors and risks is based on detecting from the visual material the conditions and situations which may contribute to formation of a risk. The digital visual material, photographs or video, collected from the work site may be analysed to detect the potential risks and related factors with a suitable deep learning based computer vision solution. Before reaching such capability, a model based on the deep neural networks needs to be trained as in any other data based machine learning solutions using a suitable training data set. For the deep learning purposes, the training data needs to be gathered, stored, quality controlled and annotated with suitable labels.

The machine learning based computer vision model is trained for detecting risks or risk countermeasures at the construction sites. In general, this requires substantial amounts of training data in a form of digital images and/or video containing such features as wished to be detected. The quality of the images is important in two senses. Firstly, semantically they should contain what is desired to be detected and secondly technical quality should be reasonable regarding resolutions etc. The resolution usually may have an impact on the final detection precision but for training and detection they are always practically scaled down to a fixed size as dictated by the detection algorithm. However, if the final detection system should be robust and able to detect even in less than perfect conditions some, even manmade, imperfections are desired in the training data. The raw digital material as such is not enough to reach the adequate levels of detection precision in the final model within a reasonable computing time. Thus, the training data needs to be refined manually with an annotation process that should highlight those risk objects on the images as identified in the corresponding risk model. Especially in this case annotation classes will include safety barriers and safety nets as risk countermeasures; and piles of garbage as a fire risk factor with locating and labelling bounding boxes as in the Figure 1 and Figure 2 original and annotated image respectively. Sample images in this document are provided by the courtesy of HRS possessing all rights to the original images.

Figure 1 Original raw image from the construction site. (Copyright HRS)

Figure 2 Labelled and annotated image from the construction site. Annotation here contains barrier and garbage labels with bounding boxes and their x and y coordinates for illustrative purposes only. (Copyright HRS)

The annotation process produces training data capturing samples of the needed classes with a class label including the location of the item with a bounding box. The boxes’ location on the original image’s coordinate space consists of x, y coordinates of the upper left, and lower right corners for each labelled instance. The detection system when operational will be able to detect and predict these same attributes from the construction site images for further usage in risk analysis. The annotation and the output of detection system will, thus, contain the following:

Class – object class: barrier | net | garbage

x1 – upper left corner x-coordinate of the bounding box

 y1 – upper left corner y-coordinate of the bounding box

x2 – lower right corner x-coordinate of the bounding box

 y2 – lower right corner y-coordinate of the bounding box

Many of the freely available tools will be applicable for the annotation work and produce suitable annotations for the training data.

The iterative model creation process may include several training and verifying cycles. In each cycle we try to find suitable model architecture and its neural network training and hyperparameters to improve the overall detection precision. The current trend in the visual object detection and image classification is towards the deep convolutional neural networks. These outlay a challenge on the available data quantities and quality but also on the available computing resources. This usually means GPU resources very beneficial for parallel floating point operations evidential in ML processes. To maintain a reasonable time frame in the training process finetuning pre-existing models and transfer learning procedure should be favoured. The finetuning and transfer learning utilizes an existing model to diversify it to detect new classes of objects in the new domain. The results of the training cycles may also suggest deficiencies in the data and corrective requirements may be communicated to the training data gathering and creation processes that are, however, outside of this context. Should the training be successful, in the end the system should be able to produce detections following the same scheme as in the training data described earlier, which includes the object label and its approximate location on the provided image.

In the object detection application domain various models based on the convolutional networks have proved to be suitable and precise enough. Solutions like Faster R-CNN (Ren 2017) and various versions of Yolo (Redmon 2018) has evidentially been highly performing even with video feeds while maintaining adequate detection precision. Figure 3 represents ResNet-101 (He 2016) solution for object classification and localization architecture.

Figure 3 Convolutional network based ResNet with convolutional layers C1-C5 consisting of 64, 128, 256, 512 and 512 internal layers respectively completed with two fully connected (fc) layers of size 2048 neurons responsible for output of detected object classification and bounding box dimensions.. Image source (Ding 2019).

As the current tendency is towards the even deeper and more complex machine learning model architectures, we cannot ignore the performance costs of such solutions. It may be desirable if the created model would maintain high level of precision in detection but also allow detection processing on the handheld devices with a live video streaming material also as many handheld devices possess substantial amounts of graphic displaying but also AI targeted GPU processing power. Initially, however, the detection is intended to be provided as a server based service contributing visual analysis results for the knowledge based final risk analysis processes as described later.

Do you want to know more? Download our deliverable from here and share your opinion with us through our LinkedIn or our Twitter communities!


Ding, Runwei & Dai, Linhui & Li, Guangpeng & Liu, Hong. (2019). TDD-Net: A Tiny Defect Detection Network for Printed Circuit Boards. CAAI Transactions on Intelligence Technology. 4. 10.1049/trit.2019.0019.

Redmon, Joseph and Farhadi, Ali YOLOv3: An Incremental Improvement. (2018). , cite arxiv:1804.02767 Comment: Tech Report .

Ren, S., He, K., Girshick, R.,et al.: ‘Faster r-cnn: Towards real-time object detection with region proposal networks’, Pattern Analysis and Machine Intelligence, 2017, 39, (6), pp. 1137–1149