Using AI generated images as a video style guide for VFX

This volume of our AI journal delves into a potential experimental workflow for motion VFX. The process begins with raw footage from the real world, captured using a high-resolution cinema camera, and transforms it into vibrant animations by applying AI-generated effects with the help of Midjourney and Runway.

Upon observation, it appears that not many artists are immersing themselves in the technical aspects of their workflow, especially when it involves using multiple generative platforms in their professional practice.

The approach outlined below is intended to assist concept artists, editors, and storytellers in perceiving these AI platforms not as rivals replacing creative talent, but more as collaborative partners that can contribute to and enhance their thought processes.

Eco city generated Midjourney

Getting to know the AI-powered video-editing software

Runway is an AI-driven platform developed for creatives with a heightened focus on video generation and editing. The user interface is far more approachable than trying to create your own instance of Stable Diffusion on your local machine, and its speed of development updates is impressive.

I’ve been diving more into the capabilities of Runway’s Gen1 video tools and ~~eagerly await the release of Gen2~~. (Gen 2 came out while I was writing this post, stay tuned for a future Gen 2 update!)

Gen 1 video example with text prompt style guide, the prompt was “multidimensional shading, minimalistic forms, layered mesh, linear simplicity, bold, graphic lines”

For those transitioning from a "traditional" practice of media production and VFX to integrating a wider generative toolset, a noteworthy feature of Runway stands out. Its capacity to generate video effects from static reference images or a simple text prompt proves to be a highly innovative and practical feature.

Gen 2 offers a text-to-video feature. This allows users to simply describe their desired outcome and the system will generate a brief 4-second snippet based on the provided description and parameters, eliminating the need for any source footage.

Gen 2 text prompts from the Instagram community, with three iterations per prompt, created the clips above.

Nonetheless, the output from these features comes at a cost; $10 will afford you 71 seconds of rendered footage. Furthermore, clips are restricted to a maximum of 4 seconds. The path to making these features broadly desirable for commercial scaling is still a long one. For this experiment, the total expenditure amounted to approximately $50.

Harnessing Runway’s editing and Midjourney’s capabilities

These capabilities can inspire thought around how to better craft images and videos during the pre-production and filming stages. The learnings gleaned from this process are certainly worth sharing.

To illustrate, consider an example of a recent experimental workflow that leverages both Midjourney and Runway. With Gen1, you can make use of image-to-video and text-to-video styling. This means you can shape your output footage by either describing it with a text prompt or uploading an image or illustration to guide the aesthetic of your video footage.

Futuristic scifi ice city generated by Midjourney

eco city generated by midjourney

MidJourney was utilized to generate two distinctly different city skylines, one echoing the aesthetics of a green, eco-brutalist city and the other portraying an icy polar landscape. These were then applied as style guides to a short series of clips, filmed at LACMA’s Chris Burden’s “Metropolis 2” sculpture. (image via LACMA)

Chris Burden's Metropolis II at LACMA

Upon reviewing all the captured footage, the process started with traditional techniques such as cutting and color editing using Premiere. The clips were intentionally kept to a maximum duration of 4 seconds, given that this is the current limitation that Runway can handle for this particular rendering process.

traditional digital timeline editing in Adobe Premiere

Upon crafting a sequence that was satisfactory, all the color-graded and nested footage was batch exported using Adobe Media Encoder. Additionally, a collection of stills was extracted from each of the clips within the sequence, directly from the Premiere timeline.

Bridge view of footage captured at LACMA

After exporting the stills, they were uploaded to Midjourney utilizing the /describe feature (Check out our other blog post all about /describe!) The aim was to establish a baseline reference for how the AI model would assess the framing and technical qualities of the source footage. If you want to learn more about how to get started with Midjourney check out our blog post here! Below are the results from these descriptions, along with a sample version generated from one of the suggested prompts:

Blue trains traversing a track in a built up city - AI generated photographs made in Midjourney

Upon generating all the descriptions and re-imaginings of the source footage, a search for patterns in the AI's assessments of the images was conducted. It was intriguing to observe how Midjourney reinterpreted the prompt into three distinct scenarios: a train track, a factory, and an artist’s studio.

The choice to use source footage from the mini-city sculpture was made with the belief that the distorted scale of motion and reference objects in the frame could effectively serve as a litmus test for the current capabilities of these AI models.

A subfolder was created within the Runway library to store all the source footage. The process of individually selecting clips for work then commenced. Displayed below is what the style reference menu box appears like, along with the image selected to serve as a style guide for the footage.

Runway Gen 1 input prompt box

Before generating the effect, it's possible to preview the styles and experiment with the sliders and prompt inputs to iterate on style types. Below is an example of the preview thumbnails. They require a few seconds to generate, and users can click on an image to select it as a style guide for the uploaded footage.

After about 3 minutes of generation time, here’s what the output of the far right thumbnail yields:

As is clearly evident, VFX studios won't be going out of business anytime soon.

The footage, while conveying a frosty urban setting, suffers from sloppy cartoonishness and mixed lighting effects that prevent it from being rooted in any recognizable or believable reality.

After much experimentation, a style guide was chosen to start producing the desired output footage:

settings and preferences for Runway Gen-1

The creation process was slow and iterative, resulting in two versions of each clip from the original Premier timeline. The final output can be seen below. The top frame utilizes the arctic cloud city image as a style guide, while the bottom frame has an animation styled after a brutalist, sci-fi, eco-city applied to the source footage.

While the composition, framing, and action consistently remain the same across all the clips, the miniatures in Chris Burden's piece serve as a compositional anchor, providing a steady base when envisaging alternative universes.

Though the VFX may currently appear somewhat rough, this is undeniably a promising start. They could potentially serve as a concept storyboard for larger projects.

Remember, these tools - MidJourney and Runway - are only just emerging. To put it in perspective, consider Adobe Software, a company that was established over 40 years ago in 1982. This is merely the dawn of what such AI technologies can achieve.

Thanks so much for reading. If you know of anyone who you think would like this kind of learning out loud in the AI space, please share this newsletter and check out Echo Echo Studio for more videos, designs, client work, and experiments.

Please note, comments must be approved before they are published

Getting to know the AI-powered video-editing software

Harnessing Runway’s editing and Midjourney’s capabilities

The choice to use source footage from the mini-city sculpture was made with the belief that the distorted scale of motion and reference objects in the frame could effectively serve as a litmus test for the current capabilities of these AI models.

While the composition, framing, and action consistently remain the same across all the clips, the miniatures in Chris Burden's piece serve as a compositional anchor, providing a steady base when envisaging alternative universes.

Leave a comment