RAG for Video: Searchable Timelines and Shot Libraries

RAG for video is not just RAG with bigger files.

Text RAG retrieves chunks. Video RAG retrieves moments. A moment has speech, visuals, sound, speaker identity, camera angle, timecode, scene context, and sometimes an edit decision attached to it. If you flatten that into plain transcript text, you lose the thing that makes video searchable.

The goal is not "ask questions about a video." The goal is to find the exact shot, quote, scene, or timeline segment that helps an editor make a decision.

Video Has Multiple Timelines

A video file has at least three timelines.

Media time is the source file timecode.

Transcript time is where words appear.

Edit time is where clips sit in a timeline.

These are not the same. A clip from 00:42 in the source file might appear at 01:10 in the final edit. A quote might be split across cuts. B-roll might cover an interview line.

A useful video RAG system preserves all of this.

{
  "moment_id": "m_9182",
  "asset_id": "interview_a_cam_03",
  "source_start": 42.1,
  "source_end": 57.8,
  "transcript": "The hard part was not writing code. It was making the workflow reliable.",
  "speaker": "founder",
  "visual_tags": ["talking_head", "office", "medium_shot"],
  "timeline_refs": [
    {"timeline_id": "launch_v04", "start": 18.2, "end": 31.7}
  ]
}

This object is much more useful than a text chunk.

Retrieval Should Be Multimodal and Structured

A video search query can mean different things.

"Find the clip where she talks about reliability" is transcript retrieval.

"Find the shot with the product dashboard" is visual retrieval.

"Find the energetic moment after the demo" is timeline-aware retrieval.

"Find unused B-roll of the office" requires asset metadata and edit history.

Query Type	Best Signal
Quote search	Transcript embeddings and keyword search
Visual search	Frame embeddings and shot tags
Speaker search	Diarization
Version search	Timeline references
Unused footage	Asset index minus timeline usage

The mistake is relying on one embedding index. Video needs multiple indexes joined by timecode.

The Indexing Pipeline

A practical video RAG pipeline runs in stages.

First, probe the media and create stable asset IDs. Second, extract audio and generate a transcript with speaker labels. Third, detect shots and sample frames. Fourth, create visual tags and embeddings. Fifth, attach edit history when the asset appears in a timeline.

The output should be an index manifest, not a pile of unrelated vectors.

{
  "asset_id": "demo_day_interview_cam_a",
  "indexes": {
    "transcript": "ready",
    "frames": "ready",
    "speakers": "ready",
    "timeline_usage": "ready"
  },
  "coverage": {
    "duration_seconds": 1842,
    "transcript_segments": 312,
    "shot_segments": 94
  }
}

This makes the system debuggable. If visual search fails, you can inspect the frame index. If quote search fails, you can inspect transcript coverage. Without this manifest, retrieval quality becomes guesswork.

Chunking Video by Scene Beats

Text RAG often chunks by tokens. Video should chunk by scene beats.

A beat can be:

A sentence or complete answer.
A shot boundary.
A change in speaker.
A change in scene.
A meaningful action.

The right chunk depends on the use case. For interviews, answer-level chunks work well. For sports footage, action-level chunks matter. For tutorials, step-level chunks are better.

Do not blindly split every 30 seconds. That creates chunks that start and end in the middle of meaning.

The Edit Decision Layer

The most powerful part of video RAG is not search. It is edit memory.

The system should know:

Which clips were used.
Which clips were rejected.
Which notes caused which changes.
Which exports were approved.
Which moments appeared in high-performing versions.

This turns the RAG system into an editing brain. It can answer:

Show me unused clips where the founder explains the core problem in under 15 seconds.

That query requires transcript search, duration filtering, speaker labels, and timeline exclusion. A plain vector database cannot answer it alone.

RAG Results Should Return Artifacts

A video RAG result should return more than text.

It should return:

Source clip link
Timecode
Transcript
Thumbnail
Confidence
Related timeline usage
Suggested edit action

For agents, artifact links are essential. The agent should not paste a 200MB video into context. It should receive a signed clip preview and structured JSON.

Key Takeaways

Video RAG retrieves moments, not text chunks.
Preserve media time, transcript time, and edit time.
Use multiple indexes: transcript, visual frames, speakers, shots, and timeline references.
Chunk video by scene beats or answers, not arbitrary time windows.
Return artifact links and structured edit metadata, not raw media dumps.