Datasets
Video

Talking Head

People speaking to camera or across from an interviewer, framed tightly with controlled lighting that isolates face and voice from surroundings. The footage captures clean audio alongside unobstructed front-of-face framing across a wide range of speakers.

Hours449K+
Countries30+
Languages10+

Training Use Cases

Audio-driven facial animation and lip-sync
Avatar and digital human generation
Visual speech recognition and audio-visual ASR
Expression and affect recognition during natural speech
Key Highlights
30+ countries of origin and 10+ languages including English, Spanish, Mandarin, French, German, Hindi, Japanese, Arabic
5+ framing conventions from tight close-up through medium, two-shot, over-the-shoulder, and cut coverage
Solo direct-to-camera and two-person sit-down conversation formats across the set
Controlled lighting throughout, with face and voice prioritized over surroundings

Metadata Fields

durationLength of clip in HH:MM:SS
resolutionPixel dimensions (e.g., 1920x1080, 3840x2160)
frame_rateFrames per second (e.g., 24, 30, 60, 120)
contains_audioWhether the clip carries an audio track (boolean)
primary_categoryDominant content category assigned to a video
styletight_close_up | medium_close | two_shot | over_the_shoulder | multi_camera_cut
conversation_formatsolo_to_camera | one_on_one_interview | sit_down_two_shot | hosted_segment
speaker_count1 | 2
languagePrimary spoken language (ISO 639-1 code)
country_of_originCountry where footage was produced (ISO 3166 code)