Belegarth Video Analysis

The Idea

My idea for this project was to create an app which can take a video of Belegarth combat and, on device and in real time, annotate the video with the locations and movements of all players and gear including weapons, shields, and armor, and use the combination of visual information and audio track to be able to make heralding calls.

The proof of concept version of this project is a program on my computer that can run over a few hundred frames of video and produce high accuracy annotations of the players and their equipment.

First Attempts with MediaPipe

Tracking human poses and motion is well established, and I had adequate success using the MediaPipe PoseLandmarker for tracking 33-point models of human movement during fights. There are blips in accuracy, especially behind shields and with changing count of players, but overall it runs in approximately real time on my laptop and annotates almost every frame with reasonable accuracy.


Tracking object motion is well established, but “boffer weapons” are not a natural category with large existing datasets to specify them. My first hope was to use GroundingDINO to be able to match a natural language description of the items I want to identify to get bounding boxes. This was largely unsuccessful–the annotations were very spotty at best, actively terrible at worst, and took far too long to run (multiple hours to annotate a 40 second video).

My next attempt was using MediaPipe object detection to detect “baseball bats” which I figured might be similar enough to identify boffers; this worked with about the same (unusably bad) accuracy as GroundingDINO, but at least did so at 5-10 FPS, allowing me to iterate several times.

Hand Annotation

Finally, I broke down and learned enough about how to use CVAT online, and used this to annotate about 160 images I had taken myself, providing bounding boxes for all of the people, armor, weapons, and shields. At this point I regret having included the people in the process, since the PoseLandmarker is already tracking people well enough and trying to do so is a distraction for my weapon tracking model.

I then used MediaPipe-model-maker to fine tune a MobileNet object detection model on my annotated data. This was sufficiently successful that I got passably precise (though low recall) annotations from the model on some videos, especially with a small amount of manual cleanup of inaccurate annotations. I then wrote some code to translate the generated annotations back into the PASCAL training data format, so that I can use this to bootstrap the amount of data I have by generating only high confidence model predictions and using this as additional training data.

Problems with MediaPipe

This process is so far limited in a couple of important ways.

  1. The first is that the bootstrap data has only really been successful on videos of single blue duels. Larger battles haven’t worked very well and while the model sometimes recognizes polearms, it struggles with shields.

  2. The second is that this is very computationally expensive. This is exaggerated by the fact that my system is unable to use my local GPU for training; the only version of mediapipe available on pip is 10.21 (vs current dev version 10.30) which requires tensorflow >=2.15 (vs. current pip version 2.20) and mediapipe-model-maker is similarly behind, and compatible versions of cuda predate my laptop.

  3. Third, This substantially changes the distribution of my training data, since videos have many frames with many commonalities that the hand picked photo annotations did not share

  4. The model seems to have learned that foam swords should be held at about a 30 degree angle passive guard, and fails to recognize them when they spin around too much.

This fourth point led me to my next step, which was to augment my training data with transposed versions of the frames and bounding boxes. This turned out to not help at all.

Transition to YOLO

After consulting Gemini about my library choices, I realized that because object detection from annotation is a ten year old solved problem these days, Gemini had recommended a ten year old deprecated library, leading to the GPU versioning issues described above. This makes some amount of sense in retrospect, but it is frustrating since looking back on my work, most of the effort was wasted recreating scaffolding that already existed in other libraries I could have worked with from the beginning. I would say I should take a lesson from this to search for proper libraries first, but I actually did do a serious search, which is how I found the GroundingDINO project which initially seemed to meet my needs, and I often spend too long wishing for proper tools to already exist rather than building them myself. So taking that lesson seems like overadjusting.

Regardless, for this project I switched to using the ultralytics-yolo library. It does all of the things I expected to be built-in, such as training, saving trained models, producing annotated videos, automatically versioning models and predictions, and handling object detection, segmentation, oriented bounding boxes, and poses using compatible APIs and formats that are easily retrieved from CVAT (though working from smaller data sets meant I manually handled the training/validation split)

Combining with some hand-annotated data (163 images by M, 555 frames of a video by Aur, and 37 frames of high action video by M) trains a sufficiently good oriented-bounding-box model for both weapons and shields to progress to other steps.

Training epochs take seconds instead of minutes.

While investigating and improving tracking accuracy will be necessary for use in practice, the current state is sufficient for me to start building out the rest of the pipeline in code to search for other high level blockers.

Next Steps

Rather than fully trusting an end-to-end model, I wanted to decompose this problem into steps of:

  1. Use an audio filter for short (<1 second) clips with action

  2. Create a geometric model of persons and equipment within the clip

  3. Search for geometric intersections between tracked objects, interpolating between frames

  4. Combine audio data and intersection data to train an outcome classifier

So there are three directions separate from the video annotation to proceed:

  1. Creating Audio annotations for classifiers in steps 1 and 4

  2. Translating oriented bounding box annotations (plus possibly some occlusion information? and assumptions about object rigidity) into a 3D geometric model

  3. Building a finder for interpolation intersections

Step 3 I worked out on paper, and using quadratic interpolation means intersections would simply be roots of a quadratic equation in two variables (time and position on line segment) findable via quadratic formula. Because of the second variable, intersections would be a curve, and the intersection value as the length of that curve. This was fun, and since the quadratic formula makes this purely algebraic would also be very fast-running, but not an exciting engineering challenge.

Since pose landmark models can automatically create 3D predictions, and the nature of camera images plus object rigidity heavily constrains the outputs, I expect that adding a 3D prediction element to oriented bounding box predictions will be simple (though the annotation process for 3D boxes in CVAT looked painful).

Therefore I’m now looking into audio classification, and how to write annotations for audio tracks associated with videos.

Next
Next

MASK evals with small models