Do you dance? Do you have a favourite dancer or performer that you want to see yourself copying their moves? Well, now you can!
Imagine having a full-body picture of yourself. Just a still image. Then all you need is a solo video of your favourite dancer performing some moves. Not that hard now that TikTok is taking over the world…
Image animation uses a video sequence to drive the motion of an object in a picture. In this story, we see how image animation technology is now ridiculously easy to use, and how you can animate almost anything you can think of. To this end, I transformed the source code of a relevant publication into a simple script, creating a thin wrapper that anyone can use to produce DeepFakes. With a source image and the right driving video, everything is possible.
Learning Rate is my weekly newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me every Friday with updates and thoughts on the latest AI news, research, repos and books. Subscribe here!
How it Works
In this article, we talk about a new publication (2019), part of Advances in Neural Information Processing Systems 32 (NIPS 2019), called “First Order Motion Model for Image Animation” [1]. In this paper, the authors, Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci and Nicu Sebe, present a novel way to animate a source image given a driving video, without any additional information or annotation about the object to animate.
Under the hood, they use a neural network trained to reconstruct a video, given a source frame (still image) and a latent representation of the motion in the video, which is learned during training. At test time, the model takes as input a new source image and a driving video (e.g. a sequence of frames) and predicts how the object in the source image moves according to the motion depicted in these frames.
The model tracks everything that is interesting in an animation: head movements, talking, eye tracking and even body action. For example, let us look at the GIF below: president Trump drives the cast of Game of Thrones to talk and move like him.
Methodology and Approach
Before creating our own sequences, let us explore this approach a bit further. First, the training data set is a large collection of videos. During training, the authors extract frame pairs from the same video and feed them to the model. The model tries to reconstruct the video by somehow learning what are the key points in the pairs and how to represent the motion between them.
To this end, the framework consists of two models: the motion estimator and the video generator. Initially, the motion estimator tries to learn a latent representation of the motion in the video. This is encoded as motion-specific key point displacements (where key points can be the position of eyes or mouth) and local affine transformations. This combination can model a larger family of transformations instead of only using the key point displacements. The output of the model is two-fold: a dense motion field and an occlusion mask. This mask defines which parts of the driving video can be reconstructed by warping the source image, and which parts should be inferred by the context because they are not present in the source image (e.g. the back of the head). For instance, consider the fashion GIF below. The back of each model is not present in the source picture, thus, it should be inferred by the model.
Next, the video generator takes as input the output of the motion detector and the source image and animates it according to the driving video; it warps that source image in ways that resemble the driving video and inpatient the parts that are occluded. Figure 1 depicts the framework architecture.
Code Example
The source code of this paper is on GitHub. What I did is create a simple shell script, a thin wrapper, that utilizes the source code and can be used easily by everyone for quick experimentation.
To use it, first, you need to install the module. Run pip install deep-animator
to install the library in your environment. Then, we need four items:
- The model weights; of course, we do not want to train the model from scratch. Thus, we need the weights to load a pre-trained model.
- A YAML configuration file for our model.
- A source image; this could be for example a portrait.
- A driving video; best to download a video with a clearly visible face for start.
To get some results quickly and test the performance of the algorithm you can use this source image and this driving video. The model weights can be found here. A simple YAML configuration file is given below. Open a text editor, copy and paste the following lines and save it as conf.yml
.
model_params:
common_params:
num_kp: 10
num_channels: 3
estimate_jacobian: True
kp_detector_params:
temperature: 0.1
block_expansion: 32
max_features: 1024
scale_factor: 0.25
num_blocks: 5
generator_params:
block_expansion: 64
max_features: 512
num_down_blocks: 2
num_bottleneck_blocks: 6
estimate_occlusion_map: True
dense_motion_params:
block_expansion: 64
max_features: 1024
num_blocks: 5
scale_factor: 0.25
discriminator_params:
scales: [1]
block_expansion: 32
max_features: 512
num_blocks: 4
Now, we are ready to have a statue mimic Leonardo DiCaprio! To get your results just run the following command.
deep_animate <path_to_the_source_image> <path_to_the_driving_video> <path_to_yaml_conf> <path_to_model_weights>
For example, if you have downloaded everything in the same folder, cd
to that folder and run:
deep_animate 00.png 00.mp4 conf.yml deep_animator_model.pth.tar
On
my CPU, it takes around five minutes to get the generated video. This
will be saved into the same folder unless specified otherwise by the --dest
option. Also, you can use GPU acceleration with the --device cuda
option. Finally, we are ready to see the result. Pretty awesome!
Conclusion
I
this story, we presented the work done by A. Siarohin et al. and how to
use it to obtain great results with no effort. Finally, we used deep-animator
, a thin wrapper, to animate a statue.
Although there are some concerns about such technologies, it can have various applications and also show how easy it is nowadays to generate fake stories, raising awareness about it.References
[1] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “First-order motion model for image animation,” in Conference on Neural Information Processing Systems (NeurIPS), December 2019.
No comments:
Post a Comment