Moving Video Object Detection and Segmentation using YOLOv8 and SAM
--
Introduction
The detection and segmentation of moving objects is a challenging task in computer vision. It faces many challenges, such as object occlusion, variation of light intensity, camera angle, distortion, etc. In addition to these challenges, the detection and segmentation of moving objects is also computationally expensive. This is because it requires the algorithm to process each frame of video in real-time.
Despite these challenges, the detection and segmentation of moving objects is a valuable task with many potential applications. For example, it can be used in self-driving cars to detect and track other vehicles, pedestrians, and cyclists. It can also be used in video surveillance to detect and identify criminals.
I have used different state-of-the-art techniques that have been pre-trained on extensive datasets and are publicly accessible. These techniques encompass YOLOv8 and Segment-Anything models, which I will elaborate on below.
Initially, I attempted to use YOLOv8 for both object detection and segmentation; however, the results fell short of expectations. Specifically, the quality of the mask edges was notably poor. As a solution to enhance moving video object detection and segmentation, I opted for a combination of YOLOv8 and the Segment Anything Model.
YOLOv8
YOLOv8, developed by Ultralytics, represents the latest advancement in the YOLO (You Look Only Once) object detection model series. These models are renowned for their unique capability to predict all objects within an image in a single forward pass. YOLOv8 marks the most recent iteration in this series, incorporating significant enhancements over its predecessors. Notable upgrades include the adoption of anchor-free detection, the introduction of C3 convolution, and the integration of mosaic augmentation techniques. This cutting-edge model excels in real-time object detection, tracking, and segmentation tasks, building upon the successes of its YOLO predecessors. YOLOv8 introduces novel features and optimizations, resulting in superior performance, greater flexibility, and enhanced efficiency.
Segment-Anything (SAM)
Segment Anything (SAM) is the latest state-of-the-art image segmentation model produced by Meta AI Research in April 2023. It is trained on a massive dataset of over 1 billion masks from 11 million images. SAM is designed to be promptable, so it can generalize to new image distributions and tasks without additional training. The model can work with different prompts, such as a set of points, a bounding box, or a free-form text description.
Combination of YOLOv8 and SAM for moving object detection and segmentation
I have used a combination of YOLOv8 and SAM to improve the detection and segmentation of moving and static objects in videos. YOLOv8 models are fast and more accurate in object detection, while SAM is specialized in segmenting objects.
I started by processing videos frame by frame. To simplify the problem, I limited the object detection problem to only the person class. I then predicted the bounding boxes for all person class objects using YOLOv8 and processed the bounding boxes for the SAM model. Finally, I used the SAM predictor module to prompt the bounding box and the current frame to segment all the people present in the frame. Here is the code for it:
def process_frame(frame, yolo_model, predictor):
## prediction from the YOLOv8 model
results = yolo_model(frame, conf=0.25, classes=[0])
## getting the bounding boxes
for result in results:
boxes = result.boxes
bbox = boxes.xyxy
## setting the current frame for segmentation
predictor.set_image(frame)
input_boxes = bbox.to(predictor.device)
## if no person is exist, then return none
if len(input_boxes) == 0:
return None
## processing the frame for the SAM model
transformed_boxes = predictor.transform.apply_boxes_torch(input_boxes, frame.shape[:2])
## Predicting the segmentation masks for the persons
masks, _, _ = predictor.predict_torch(
point_coords=None,
point_labels=None,
boxes=transformed_boxes,
multimask_output=False,
)
return masks
Here is the code for the complete pipeline on GitHub. Below is an example of a processed video demonstrating the detection and segmentation of objects (persons) using YOLOv8 and SAM.