Cleaner Pose Estimation using Open Pose

9 min readSep 23, 2020

Here it is, finally! One of the most awaited articles of all time, and that’s apparently the first one I am writing! Eh, just kidding. This is just one of my, yet another, nerd-attempts to share some knowledge. But this time it is on a new platform. (I’ve always wanted to write on Medium)

Anyway, much with the introduction, this article is essentially about one of the hottest topics in Computer Vision these days, and that’s Pose Estimation! And if you have ever worked on it, you might realize that it doesn’t sometimes give a very clear demonstration. In this article, we will be additionally solving that issue using some post-processing!

Human Pose Estimation (HPE) is an important problem that has enjoyed the attention of the Computer Vision community for the past few decades.

It primarily involves localization of human joints (also known as keypoints — elbows, wrists, etc) in images or videos, and hence, analyzing the presence of specific poses in a space of all articulated poses. It is a crucial step towards understanding people in images and videos.

HPE has been used as the backbone of pretty cool applications like Action Recognition, Workout Form Correction, etc., and plays a vital role in a range of fields like Augmented Reality, Animation, Gaming, Robotics, etc.

And these are apparently the coolest fields to get your hands-on, and that gives HPE all the spotlight!

I won’t be really be going into details on how HPE works and the different architectures/ approaches, but here’s a snippet of how it looks like. I am pretty sure you’d love it!

A simple Jump Rope Application I made using HPE. Let’s save this topic for the next article.

Anyway, if you’re interested in having a detailed understanding of HPE, here are a few sources I would recommend:

Let’s jump straight into the action and see how we can apply pose estimation on our own videos!

Workflow

I will be using ‘Open Pose’, an opensource pose estimation model by CMU Perceptual Computing Lab, which is compatible with the OpenCV DNN module. So in an essence, we will be using mainly OpenCV for the whole project, topped by some post-processing using the SciPy module.

I assume that you already have a basic knowledge of using python, or at least have Python 3.x (x >= 6 preferably) installed on your systems.

I highly recommend setting up an environment for all of the projects involving basic image processing and computer vision. I prefer, and am pretty comfortable using the conda interface for my projects.

Even if you use a different interface, I’d recommend that you get comfortable with anaconda at some point down the line as it is hands down one of the most convenient interfaces to use to practice Machine Learning and Computer Vision. To get started with conda, here is the link I would suggest you go through.

Let’s get started!

The libraries we’ll be using are majorly OpenCV, SciPy, NumPy, and Pandas added on with some helper libraries like progressbar, os, .csv, etc.

Download the Repo and the Models

We are using models trained on Caffe Deep Learning Framework. Caffe models have 2 files:

.prototxt file which specifies the architecture of the neural network — how the different layers are arranged etc.
.caffemodel file which stores the weights of the trained model

To download the .caffemodel file, click on this link. Next step, simply clone/ download my GitHub Repo here. Once downloaded, jump into the models directory and paste the downloaded .caffemodel file here. That’s it, we are done with ‘downloading the model’ part of it.

Additionally, for the sake of simplicity and cleanliness, you can also make 2 new directories, samples and outputs, which as the names suggest would contain the input video and the output results respectively

Let’s jump into the code.

What we have in the repo are 2 .py files.

The first one, open_pose.py, is essentially the whole backbone. This file will parse through each frame of your video and will extract the coordinates of each joint apparent in the frame and save it in a .csv file. Some of you might have a nice question here, and the answer is — Yes, we can get a resulting stick figure composed on top of the video from this algorithm, but there are a few shortcomings, you’ll understand once we study the code. Hence, we have,
clean_output.py, which, will get us a cleaner output from the resulting .csv from the previous code.

I’ll use the code gists here to explain what happens in both the codes before getting the results, so brace through if you wanna peep at what the py files are doing behind your back.

First up — open_pose.py

The first 6 lines here import the needed libraries. Lines 9–14 define the needed paths that we’ll be using in the code. (You can alternatively use argparse to input these files directly from the terminal before running the code.)

Line 17 loads the model and the weights into the RAM using the OpenCV DNN module.

(I wish there was a way to continue the code block from line 18, but there isn’t. Sucks! Let me know if there is any, btw. I am a noob on Medium.)

In Lines 1–8 above, we initialize the input video from the previous block and define the different specifics of the video like the fps, number of frames, height, width, and the required output height and width (h, w). This is followed by defining the input height and width values required by the model. (We’ll be resizing each of the frames to these dimensions later.)

Lines 1–7 in this block initialize the output video writer.

Now here’s an important catch! This particular part might not work on your systems.

Yes, this can be a huge pain in the ass, and it sucks! The reason is you might have different codecs installed on your system. That is, the output extension here (apparently .mp4 here, defined above in the first block) might not be compatible with the codec (*.MP4V) on your systems. What you have to do is do a bit of hit and trial at your end.

Let me help you a bit. Here are a few CODECs you experiment with — *’mp4v’; *’mp4v’; *’MP4V’; *’MJPG’; *’CVID’; *’MSVC’; *’X264’; *’XVID’; with both extension .mp4 or .avi. Once you figure it out, you might just save a lot of time later if you head on a task to save a video using python or OpenCV. If you figure it out early, you’re lucky, otherwise you gotta take the pain bois.

Lines 9–10 initialize the data that we’ll be feeding to our .csv file. Previous_x and previous_y are basically to make sure if the frame crops off a few joints, the new data will be updated by the previous frame’s data. (I know you’ll go for another read of the last line, but leave it, it doesn’t matter)

Lines 13–17; the pairs are the pair of joints so that they make a body part, for e.g. the line composed on the shoulder, between the neck joint and the shoulder joint.

Lines 20–21; thresh is the threshold probability above which the possibility of the occurrence of a joint would be considered. The rest are the colors we want to give to our ‘stick figure’. I might just call it a stick figure in the rest of the article :P

(Here you go, a bigger gist, cuz I am sick of the non-continuous code blocks)

The first 6 lines above initialize your progressbar, to keep a track on the progress of your, umm, progress. We start the iteration through the frames of the input video on line 9. And the rest is history. JK, nothing special.

Lines 10–22; We read the frames, we break the loop if there’s no apparent frame (this is really important. Try skipping this part you might lose all your progress at 99% of the progressbar), we preprocess the frame to the compatible format for inputting in the model. Line 18; we input that preprocessed image (inpBlob) to the model, and get the output.

NOTE: Intelligent Question? The answer is no, we do not change the colorspace from BGR to RGB, since Caffe models themselves accept BGR color!

The output is a 4-D matrix with:

The First Dimension being the image id, in case we input multiple images. In our case, it’s a single frame! So not relevant.
The Second Dimension is the index of a keypoint. The model produces Confidence Maps and Part Affinity maps which are all concatenated. The model that we are using (MPI Model) produces 44 points, and we are interested only in the first 15 points, which correspond to the joints.
The Third Dimension is the Height of the output map.
The Fourth Dimension is the Width of the output map.

On line 30, we find the location (coordinates) of the key points by finding the maxima in the output map of that key point. Following that we readjust the coordinates depending on the output we need on the lines 31 & 32 and append the coordinates to a list which will be fed to the output CSV file later.

Lines 43–48; we compose the circles on the frame at the location of every output coordinate received, and insert the lines to create the stick figure! Lines 50–54 write the frame to the output video location.

This while loop takes a little time to complete depending on the computation of your system. The progress bar gives you an idea of by when the loop will end, you might go out, have a round of beer and come back.

In the end, lines 68–71 feed the data collected through the whole iteration into a single .csv file. Now, this is an important step in order to get a cleaner output using the next code, that is, clean_output.py.

Get a cleaner output

Bored? Well, this part might be an important value addition to what you’ve already obtained. Let’s move on to understand what clean_output.py!

The first eight lines import the essential libraries that we need. Note that, we will be centrally using the Signal module’s (Line 2) Savgol Filter to smoothen the .csv file that we obtained in the last part. (See Lines 17–19)

The window size parameter (window_length) specifies how many data points will be used to fit a polynomial regression function. The second parameter specifies the degree of the fitted polynomial function (if we choose 1 as the polynomial degree, we end up using a linear regression function).

In every window, a new polynomial is fitted, which gives us the effect of smoothing the input data. To understand more on how Savgol Function works, check this link out.

Another problem with raw pose estimation models (especially Open Pose) is that the coordinates reflect themselves within frames and that sometimes give a very, umm, non-aesthetic experience. To solve that, we define the next code block.

The block above essentially takes a reference from the shoulders of the person in the frame, and analyses if the left shoulder is on the left side of the frame, and right is on the right. If it’s otherwise, all the points reflect back to their original positions. Pretty simple, and effective!

The above code block is more or less similar to the one in open_pose.py, where we initialize the video capturing, save the specifics in some variables, initialize the writer, etc.

NOTE AGAIN! Please use the CODEC and the Extension pair specific to your systems, as I explained before!

Now finally, the moment you’ve been waiting for (hopefully not) the last code block is here!

Needless to explain I guess, this one iterates through the input video (apparently the same as before) and composes the cleaned up points on each frame, iteratively!

And you open the outputs directory, and there you see the results!

This is the difference between what you got before and what you get now...

Do you see the difference? The left video is the unclean output by the raw pose estimation model, and the right one is the output after applying the Savgol function. I get it, this is not real-time. But nevertheless, it’s efficient for building simple applications!

Won’t give much of an Outro, since the post has already been pretty long, but if you like this tutorial, I’d definitely keep my next one on how to make a simple jump rope application I demonstrated above!

Let me know how you like it with the responses and your comments! (First try so need some motivation guys, TBH).