Artificial intelligence meets WGBH's archives

This post was originally published on this site

Artificial intelligence meets WGBH’s archives

An effort started by computational linguist James Pustejovsky aims to index and catalog some of the most famous programs in public television and radio’s history — no humans needed.

WGBH Archives

The stacks in the WGBH archives.

WGBH Boston is one of the nation’s most celebrated public television and radio stations, a top producer of blockbuster programming. Over the decades, its prodigious output has created an equally sizable problem. How to keep track of the roughly 400,000 audio, video and film recordings in its archives?

Programs go back more than half a century and include iconic ones like “American Experience,” “Frontline,” and “NOVA,” and more obscure ones, like “Gallimaufry,” “Hot Nights” and a 1961 lecture by Harvard philosophy professor Gabriel Marcel on “The Existential Backgrounds of Human Dignity.”

The station’s archival vault contains row after row of storage shelves, all piled high with audio and video recordings. Last spring, computer scientist Kelley Lynch, MS’17, visited the vault with a WGBH archivist who retrieved a heavy metal box. “Beef bourguignon” was scrawled on masking tape affixed to the box.

There was no way to know the box held a recording of one of television’s most historic programs — the 1963 debut episode of Julia Child’s first television show, “The French Chef.”

Thousands of other films, videos and audiotapes in the vault, many of them also landmark moments in television history, bear similarly slim identifying information. WGBH hoped to identify all the tapes’ content.

But “having a librarian sit down and catalog every single item would have been insane,” said Karen Cariani, the David O. Ives Executive Director of WGBH’s Media Library and Archives. “We needed something faster.”

Lynch, James Pustejovsky, the TJX Feldberg Professor of Computer Science, and several of his other students proposed to bring order to the chaos. Pustejovsky and his students volunteered to program computers to analyze the material and provide identifying information using AI algorithms.

Over time, the AI programming would make the computers smarter — better able to recognize show hosts, discern scene breaks, identify background scenery and even generate basic content summaries.

Eventually, anyone interested in tracking down old programs would be able to search the station’s archives. Researchers could use the database to find historic interviews and recordings.

WGBH eagerly accepted Pustejovsky’s offer to help. Last April, his lab and WGBH began collaborating. “These materials are part of our national heritage,” says Pustejovsky. “It’s critical they’re widely available.”

It’s a huge undertaking. In some cases, a show’s video includes outtakes. Cariani says there’s an extended interview of Henry Kissinger on the 1989 PBS TV show, “War and Peace in the Nuclear Age” that will doubtless interest historians.

WGBH also wants to catalog the American Archive of Public Broadcasting — some 100,000 digitized programs dating back to the 1940s from more than 100 public television and radio stations.

Six years ago, the Corporation for Public Broadcasting awarded joint stewardship of the archive to WGBH and the Library of Congress. The materials range from a Hawaii public television program on indigenous peoples to every episode of Sesame Street.

Why weren’t these shows properly tagged and identified when they were made? Cariani says staff rarely had the time due to the hectic pace of production. “The first thing on a producer or writer’s mind is getting the next show on the air, not archiving the one you just completed,” she said.

In 1979, WGBH began following the Modern Language Association’s guidelines for cataloging materials, but implementation was still spotty.

Pustejovsky and his lab started their work last spring and have been taking a slow, steady approach to indexing the materials. They began with the easiest part.

Many shows begin with a film slate, or clapboard, with the broadcast date, producer and title. Pustejovsky’s team used optical character recognition (OCR) to extract this text, which is often handwritten.

Where there’s a program transcript, Pustejovsky and his collaborators use timestamps to align the transcript’s words with the spoken dialogue down to the millisecond.

Facial recognition is trickier. Algorithms developed by Pustejovsky’s team will look for instances where an on-screen name is used to identify the interviewee. The next time the person appears, even if unidentified, the computer will recognize him or her. That information will become part of the transcript, indicating every time and at what time the individual speaks.

Then there’s identifying the location. Pustejovsky wants the computer program to determine whether a segment in a television show was shot indoors or outdoors, in a forest or on top of a mountain, inside a house or office. The computer must search for clues in the transcript or identify visual elements, such as a tree or water, in the video.

Perhaps the most challenging task will be generating summaries, says Pustejovsky.

These descriptions will be basic for now; for example, “storm in Louisiana” for one segment. Over time, within enough refinements, the algorithms may be able to produce more complete summaries, such as, “The banks of the river are overflowing during the hurricane.”

Some of the tools Pustejovsky and his lab are developing already exist, but they’re proprietary and cost money. Pustejovsky plans to create open source software so it’s available free to libraries, TV and radio stations.

WGBH’s Cariani said it may be years or even a decade before all the work is completed. “There’s a huge amount of really great content that’s been produced and created over the years that we need to preserve and make accessible to the American people,” she said. “We have to start somewhere.”