Catch up on the latest AI articles

LAVE, An Agent-assisted Video Editing Tool That Utilizes LLM

LAVE, An Agent-assisted Video Editing Tool That Utilizes LLM

Large Language Models

3 main points
✔️ Proposes LAVE, a new agent-assisted video editing tool that leverages large-scalelanguagemodels
✔️
Offers the choice betweenlarge-scale languagemodel-assisted and manual editing, allowing users to adjust according to their own editing style
✔️ User testing results providefutureProvides useful suggestions for system design when incorporating models into video editing

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing
written by Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, Raj Sodhi
(Submitted on 15 Feb 2024)
Comments: Paper accepted to the ACM Conference on Intelligent User Interfaces (ACM IUI) 2024
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)


code: 

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Video is a very powerful medium for communication and storytelling. With the advent of social media and video sharing platforms, its popularity has skyrocketed and many people are now creating and sharing their own content. However, video editing can be difficult for beginners and can be a major barrier. It is a particularly difficult task for those unfamiliar with video conceptualization in the planning stages. In addition, the editing process requires a lot of detailed work, such as selecting clips, trimming, and creating sequences, on top of which a variety of skills are needed to create a coherent story as envisioned. Moreover, to perform these tasks requires not only learning how to use versatile and complex editing software, but also manual work and storytelling skills.In some cases, you may even have to wonderwhichediting software to use inthe first place.

Recently, attempts have been made to address the challenges of video editing by leveraging natural language processing. By utilizing natural language as an interface for video editing, users can directly communicate their intentions and reduce manual work. For example, recent products incorporating AI technology allow users to edit videos using models that generate videos from text. In addition, voice-based video navigation allows users to manipulate video using voice commands.In addition,naturallanguage can be used to represent video content to streamline the manual editing process. Text-based editing also allows users to edit videos efficiently by adjusting transcripts according to time.

However, despite these product improvements, many video editing tools still rely on manual editing and lack support functions under customized conditions. Many users still struggle with the complexity of video editing.

Here we present LAVE, a new video editing tool that provides language extensions via a large-scale language model; LAVE understands, plans, and executes relevant actions on the user's free-form linguistic commands to efficiently accomplish what the user wants to edit. And to do so, it introduces a large-scale language model-based agent for planning and execution.

Actions performed by LAVE include brainstorming ideas, conceptualizing an overview of a video corpus, semantic-based video search, storyboarding, trimming clips, and much more. LAVE uses a visual language model (VLM) LAVE uses a visual language model (VLM) to automatically generate a visual summary of the video, which allows the large-scale language model to understand the video content and leverage natural language processing capabilities to assist the user in the editing process LAVE provides two methods of operation, agent-assisted and direct, giving the user the situational It provides the flexibility to adjust the agent's actions according to the user's situation.

We are also conducting a user test, along with eight participants (including both novice and professional video editors), to evaluate LAVE's usefulness as a video editing tool. Results show that participants are able to create satisfactory videos using LAVE. Users also rated the system's features as easy to use and useful for creating videos. We also propose design implications for the development of future multimedia content editing tools that integrate large-scale language models and agents.

Design Policy for The Video Editing Tool "LAVE

This paper explores the potential of collaborative video editing between humans and large-scale language modeling agents through the design, implementation, and evaluation of the LAVE video editing tool. To explore this potential, we have established two main design principles.

The first is to leverage natural language processing to lower the barriers to editing (D1)LAVE aims to improve the way people think about manual video editing by leveraging the power of natural language and large-scale language models LAVE is designed to help users move smoothly from ideation to actual editing using natural language LAVE is designed to help users move smoothly from ideation to actual editing using natural language. By doing so, we aim to significantly lower the barriers to editing and make it easier for anyone to produce high-quality videos.

The second isto maintain user autonomy in the editing process (D2).Currently, there is concern that user autonomy may be lost with respect to AI-assisted content editing. To address this concern, LAVE offers both AI-assisted and manual editing options. Users can adjust AI assistance as needed or opt out altogether. This ensures that the final work reflects the user's image and that the user remains in decision-making authority.

LAVE is designed to respect the creative initiative of users while utilizing natural language and AI technology to make the video editing process more intuitive and user-friendly. This design policy is expected to allow users to freely express their ideas without worrying about technical barriers.

Interface of The Video Editing Tool "LAVE

LAVEis a new video editing tool thatprovides agent support and language extensions through a large-scale language model. It provides video editing functions that are intuitive and efficient for users. The figure below shows theuser interface of the LAVE video editing tool.

LAVE's UI consists of five components. Here we specifically mention three of the main ones."

  • (A) Video editing agent
  • (B) Language Extended Video Gallery
  • (C)Automatic title generation
  • (D)Video summary display
  • (E) Video editing timeline

(A) video editing agentassists the user in the editing process through conversation. The user interacts with the agent using free-form language, and the agent provides customized responses. The agent then supports the following features

  • Footage overview: summarizing and categorizing video clips
  • Idea Brainstorming: Propose ideas for video editing
  • Video Search: Find relevant videos based on language queries
  • Storyboarding: Order clips based on storylines provided

The agentoperates in two modes, the planned state and the executed state,as shown in the figure below.In the planned state (left), theuser enters an edit command to the agent.The agentthenchecks the user's goals and clarifies specific goals.Inaddition, the agent proposes specific steps to achieve the goal. If the user is not satisfied with these steps, he/she can modify the plan.

Once the user approves the plan, the agent moves to the Execute state.In this state, the user approves each of the agent's actions in turn.The result of each action is then presented to the user by the agent.If there is a next action to be taken according to the plan, the agent notifies the user of the next action and waits for approval.the LAVE video editing agent thus supports the user's editing process smoothly, with step-by-step planning and execution. By utilizing this agent, the video editing process becomes more efficient and effective.

Next, (B) the Language Enhanced Video Gallery providesa natural language description of the clip, making it easy to understand the clip's content. The title and summary indicate the content of the clip, and users can easily add the clip to the editorial timeline with the "Add to Timeline" button. Additionally, natural language queries can be used to search for videos in the gallery and display them in order of relevance.

And (E) Video Editing Timeline.Selected video clips are displayed on the timeline. Each clip is represented by three thumbnails (start, middle, and end), allowing you to grasp its contents at a glance. Andthe timeline supports two main functions.

One allows users to drag and drop clips to set the order. Alternatively, clips can be ordered automatically using the large language model-based storyboard feature. Anotherallows users to manually select a start and end point or use a large language model-based trimming function to extract specific segments.

LAVE can assist in a wide range of editing processes, from idea generation to planning to editing operations. However, it does not force users to follow a strict process. Users are free to select and use the functions they need according to their own editing goals.

For example, users with a clear editorial policy and story line can skip the idea generation stage and start editing immediately. This flexibility is a key feature of LAVE.

Currently, LAVE is optimized primarily for casual editing for social media platforms. Integration of large-scale language model agents in professional editing, where accuracy is required, remains a future challenge, but it is expected that these needs will be addressed in the future.

LAVE's flexible methodology allows users to tailor video editing to their own style and needs, enabling a wide range of users, from beginners to experienced editors, to work efficiently.

Back-End System - Agent Design

The LAVE agent leverages the diverse language capabilities of large-scale language models, including reasoning, planning, and storytelling. The agent has two states, "planning" and "execution," which has two advantages:one is high-level goal-setting, which allows the user to set a high-level goal that is not required by theagent, and theotheris the ability to specify a detailed command that is not required by theagent. This means that users can set high-level goals that include multiple actions, eliminating the need to specify detailed commands. The second is plan review and revision. The agent presents the plan prior to execution and provides the user with the opportunity to modify it, thus providing sufficient control as well.

To assist this planning and execution agent, a back-end pipeline is designed. As shown in the figure below, this pipeline creates an action plan based on user input and translates it from text to function calls to execute the corresponding functions.

LAVE's video editing agent action planning uses a large-scale language model prompting technology. This prompt format decomposes complex tasks into subtasks and presents specific steps to achieve the user's goal. To decompose a complex task (the user's goal) into subtasks (editing functions), we utilize the chain-of-sort concept, which takes advantage of the reasoning capabilities of large-scale language models. The first half of the prompt's structure is as follows.

  • Role Assignment: Instruct agents to serve as video editing assistants
  • Action Descriptions: detail a list of actions the agent can perform, allowing the user to select the appropriate response to the user's command
  • Formatting instructions: direct the user to output the action plan in a consistent format, clearly listing the user's editorial goals and the steps to achieve them

The conversation history and most recent user input are then added, which serves as a complete prompt to generate an action plan. The system maintains a message history of up to 6000 tokens and is tuned to fit within the context window of a large language model.

Once the action plan is formulated, each action is executed sequentially with user approval. In this way, the user can observe the results of each action while deciding on the next step LAVE parses the description of each action from the action plan and translates it into the corresponding back-end function call. It does this using GPT-4 checkpoints that are fine-tuned specifically for function calls. The result of the function execution is reflected in the front-end UI and presented to the user.

Back-End System - Implementation of an Editing Function that Leverages a Large-Scale Language Model

LAVEprovidesfivefeatures toassist users in editing their videos, utilizing the following large-scale language model

  1. Get an overview of footage
  2. idea brainstorming
  3. Video Search
  4. Storyboarding
  5. clip trimming

The first four functions are available through the agent, and the last one is available by double-clicking on a clip on the editing timeline. All functions are built on automatically generated linguistic descriptions of the unedited footage, including a title and summary for each clip.

To generate this text, video frames are sampled every second and captioned using the LLaVA model. Based on the captions, GPT-4 generates a title and summary and assigns a unique ID to each video. This ID is used in subsequent storyboarding functions, for example.

LAVE's video search function embeds this text using OpenAI's text-embedding-ada-002 and stores it in a vector database. When searching, the user's query is embedded with the same model and ranked by calculating the cosine distance between the video and the query. This ensures that the most relevant videos are displayed in the UI.

The first of the five features leveraging the large-scale language model, "Footage Overview Capture," categorizes videos based on common themes within the user's video collection and provides an overview. The prompts include visual narration and are sent to the large-scale language model, and the generated overview is presented in the chat UI.

The second, "Idea Brainstorming," generates creative editing ideas based on the user's video. Prompts include function instructions and additional creative guidance as needed. The generated ideas are displayed in the chat UI.

The fourth item listed, Storyboarding, sequences video clips based on user-supplied narratives. Based on user guidance, a large-scale language model creates a storyboard and updates the order of the videos in the timeline. Output is provided in JSON format for easy subsequent processing.

Thefifthitem, "clip trimming," uses the inference capabilities of a large-scale language model to identify segments of video that match the user's trimming commands. The trimming results are presented to the user in JSON format. Trimming accuracy can be adjusted based on frame sampling rate.

These LAVEs are designed as full-stack web applications.The front-end UI is developed using React.js, providing an intuitive and easy-to-use interface. The back-end server is built with Flask and works smoothly with the front-end.

For reasoning about large language models, we primarily use OpenAI's state-of-the-art GPT-4 model. When mapping action plans to functions, we use the gpt-4-0613 checkpoint, which has been fine-tuned specifically for function invocation use. the maximum context window for GPT-4 is 8,192 tokens, and within this limit an agent can process approximately 40 video descriptions can be processed.

We also use LangChain's Chromadb (wrapper) to build the vector store for video search. This provides an efficient and fast search function. In addition, videopreprocessing is performed on a Linux machine equipped with an Nvidia V100 GPU for fast data processing and caption generation. The final video editing results are synthesized using ffmpeg. ffmpeg is a very powerful tool for video editing and encoding.

LAVE combines these technological elements to provide a high-performance, user-friendly video editing experience. The entire system works seamlessly together to efficiently support users' editing tasks.

User Testing - Overview

User testing is being conducted to obtain user feedback on LAVE. The tests aim to assess the extent to which LAVE's language extensions contribute to the video editing process and to understand users' reactions to agents that utilize large-scale language models. In particular, we are investigating how the agent affects user agency and uniqueness.

Eight participants with different video editing experiences participated in the user test. Three of them were female, with an average age of 27.6 years (standard deviation = 3.16). The participants were also drawn from technology companies and had a range of experience in video editing, from beginners to professionals.

  • Beginners (P4, P5, P7, P8): have little or moderate experience in video editing, especially P8 has the least experience and last edited several years ago
  • Proficient (P1-3, P6): proficient in video editing tools, P1 is a designer and edits occasionally for work, P2 minored in film studies and has been editing since high school, P3 runs a YouTube channel, P6 is a PhD student and edits lifelog videos once a week

This diverse group of participants evaluates LAVE's performance in a variety of editing situations.The day before the user test, participants are asked to submit a set of videos for preprocessing and to provide at least 20 clips, each less than one minute long.Theuser testtakes1 to 1.5 hours and is conducted in a quiet environment.

When participants arrive at the test site, they will begiven an overview of the test and an explanation of LAVEfor approximately 15-20 minutes.They are then asked to produce a videousing LAVE using their own footage, whichtakes about 20-30 minutes;after using LAVE, participantscomplete a questionnaire in whichthey answer a variety of questions about their perceptions of the usefulness, ease of use, trust, agency, and agent role for each feature and for the system as a whole. They are also asked about their preference for agent-assisted versus manual operation for each editing feature. All survey questions follow a 7-point Likert scale.

This is followed by a semi-structured interview lasting approximately 20-30 minutes, during which participants can share their thoughts and ask any questions they may have. During the user test, we do not instruct users to prioritize speed, but rather to observe how users use LAVE to edit their videos, and to ensure that the environment is conducive to gathering feedback.

User Testing - Results and Discussion

Here are some results and observations from user testing.

All participants were able to create a satisfactory video using LAVE, with a low degree of dissatisfaction (mean = 2, standard deviation = 1.3).Seven of theparticipantsrated their satisfaction with the final results as 6 out of 7, with one (P2) giving it a 5. The results indicate that many people find LAVE enjoyable to use and would like to use it regularly. In particular, the LAVE is valued for lowering the barriers to video editing for beginners.

While participants generally found LAVE's design useful and easy to use, they were divided in their evaluation of some features. In particular, participants who valued originality tended to dislike suggestions from agents. It was also pointed out that the probabilistic nature of the large-scale language model sometimes resulted in trimming and storyboarding results that differed from expectations.

Many participants also found LAVE's automation reliable and easy to control. Many participants felt strongly about the contribution of their own work.


And no one viewed the AI agent as a leader, withhalf of the participants perceiving the agent as an "assistant" and the other half as a "partner. They felt that they were receiving support from their partners, feeling that it was their work and that they were editing it. Additionally, many participants felt that LAVE contributed especially in the creativity department. Participants who viewed their agents as partners found that they had a particularly strong sense of co-creation with the AI.

Discussion of User Test Results

User testing has shown that using natural language as a means of interacting with the system and expressing multimedia content is highly effective. Using natural language reduces manual work and makes editing more understandable. In the future, it is expected that the system will not be limited to video editing, but will also enable editing of a wider range of multimedia content by converting voice, motion, and other sensory input into text.

User testinghas also shown that incorporating agents that leverage large-scale language models can improve the content editing experience, but that preferences for agent assistance vary by user and the nature of the task. Users who value original ideas tend to avoid brainstorming with agents, while others prefer it.In the future, itis likely that users will require the ability to provide agent assistance that automatically adapts to their preferences and the nature of their tasks, and to enable, disable, or customize assistance as needed. It will also be necessary to provide flexibility between agent assistance and manual editing, allowing users to fine-tune AI predictions and correct inaccuracies.

Furthermore, we found that the user's prior knowledge and experience with large-scale language models affects the degree to which they are able to make use of the editing system and the system.Users with a deep understanding of large-scale language models are able to quickly grasp the agent's functionality and utilize it efficiently, whileusers unfamiliar withlarge-scale languagemodels may not be able to take full advantage of the system. Therefore, more follow-up functionality for novice users is likely to be required.

Based on these suggestions, efforts should be made to improve the design of content editing systems that utilize large-scale language models to provide assistance that is more adaptable to user needs.

Summary

This paperproposesLAVE, a new agent-assisted video editing tool that leverages large-scale language models.This system supports video editing by maximizing the effective use of natural language using state-of-the-art technology.

This paperdetails the key features ofthisLAVE.It also demonstrates the effectiveness of LAVE through user strikes and organizes user perceptions and reactions to a large-scale language modeling agent that assists in video editing. In addition, we share insights from this study that can help inform the design of similar systems in the future.

This paper provides a new perspective on the future of agent-assisted media content editing tools and demonstrates their potential.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Takumu avatar
I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us