How important is a word to a particular genre?
Who initiates violence more often: protesters or police?
What if we could search for things based on shape, rather than keywords?
At a conference for the digital humanities hosted by UC Berkeley, computer scientists and humanists gathered from around the U.S. to discuss bold research questions like these, made possible by growing stores of data in digital libraries and a few new machine learning tricks.
One such library is HathiTrust, a digital database of 16 million volumes. The organization, co-located at Indiana University and the University of Illinois, also has a research arm: the HathiTrust Research Center, or HTRC, which offers tools and guidance for researchers wanting to mine the collection for new discoveries in human language and history.
In late January, the center held its 2018 HTRC UnCamp, filling the fifth floor of Moffitt Library with project presentations and crash courses on textual analysis. The conference also included break-out sessions throughout campus, in the D-Lab and the Berkeley Institute for Data Science, or BIDS.
The goal of the UnCamp was to pull together the diverse group of researchers using HathiTrust, from educators and librarians to community members, explained Robert McDonald, associate dean for research & technology strategies at Indiana University.
This conference in particular was exciting, McDonald said, because of a surge in community engagement and attendance as people have become more familiar with the database. About 150 people registered for this conference, he said, compared with about 30 at the last UnCamp, in 2015.
On its website, HathiTrust boasts several built-in algorithms that help researchers learn new things about texts based on their metadata — features such as word usage and page numbers. Most of the digitized texts in the collection are still under copyright, so researchers are cut off from studying them in traditional ways.
The benefit of HathiTrust’s database is that computers, not humans, are searching the texts, so researchers can still discover important linguistic clues without violating copyright.
The web-based tools on the site radically expand what researchers can do with their work. But perhaps more significantly, those capabilities also widen circles in the humanities, by introducing the need for new skills and surprising collaboration.
“Most humanities people, we just work alone — we sit in a room and write, or read,” said Loren Glass, an English professor at the University of Iowa who is using the database to study the relationship between where a writer is from and what they go on to write about. “I have enormously welcomed this collaborative laboratory dynamic where, instead, you sit in a room with other people with different skill sets and you’re able to all benefit from each other’s work.”
“The more of that, the better,” he said.
University of Nebraska researchers Leen-Kiat Soh and Elizabeth Lorang, who gave one of the keynote talks at the conference, are a good example. Soh is a computer scientist, Lorang, a poetry-loving librarian. Together, they created AIDA — a tool to search digitized images for specific types of literary content. At the conference, they showed how they’re using machine learning to find poems buried in historic newspapers.
Tens of millions of poems have been published in historic newspapers, but not all of them end up in the “poet’s corner.” They’re sprinkled throughout obituaries, marriage announcements, and advertisements. You’d have to comb through each newspaper by hand to find them — an impossible task.
Instead, the team tried to think about what a poem looks like. They measured the spaces between stanzas and the jaggedness of the right margins, and trained an algorithm to detect similar patterns across endless fields of black and white.
“The original idea was to find the poems, and then think about how to analyze the text,” Lorang said. “But now it’s become, let’s find them in order to make this possible for other people to do.”
“We could pursue this as a research project for years and years, but ultimately if there’s not uptake in the community, it’s not going to matter,” she continued. The conference, she said, was a chance to get feedback on their project, as well as get a better feel for where to go in the future.
The wider goal, she said, is to bring attention to lesser-known poems and correct some historical oversights. With our current search tools, we’re only ever looking for names and lines we already know about, she said.
Many of the projects discussed focused on recovery work in our collective canon. Textual analysis and big data make lesser-known voices easier to find, giving us the chance to reshape the cultural record.
One conference guest, Annie Swafford, a digital humanities specialist at Tufts University, is curating a corpus of works by a group of British women who, in the 1880s, formed the first women’s literary dinner club. “Women didn’t just want to talk about clothes — they wanted intense, philosophical discussion,” Swafford said. She’s interested in how the vocabulary and themes of women’s writing of the time differed from their male counterparts.
Swafford came to the conference to discover new research tools for her work, but also to learn how to support others’ work. Swafford is Tuft University’s first digital humanities specialist, and next month, she’ll lead an introductory workshop on textual analysis. She said she’s excited to show people some of the HathiTrust tools. She particularly liked Bookworm, a simple program that compares the popularity of a word across place and time and can help teach students about how language is a changing phenomenon.
Audience members played with Bookworm on their personal computers during the conference. They also tried their hand at creating work sets with the HathiTrust database, and running simple text analyses such as topic modeling (where a computer sorts through word patterns and clusters related words together to give you an idea of what a text’s major themes are).
A major focus of the UnCamp was educating people about how to take advantage of HathiTrust’s digital collection. During the hands-on sessions, Chris Hench, a postdoc at the D-Lab and BIDS, presented an instructive module he built with Cody Hennesy, the campus’s information studies librarian, to teach people how to build worksets from the database. Teammate Alex Chan, a third-year computer science student, then showed attendees an example of the kinds of programs users can build to investigate those collections. He presented an algorithm he built that, after a bit of training, can automatically sort volumes into genres based on similarities in language.
The educational HTRC module, Hench said, was an extension of some of the data analysis training that Berkeley’s Division of Data Sciences has been offering around campus. Hench and the data science modules team visit a range of courses, working with students to answer relevant questions with crunchable data.
In an International & Areas Studies course, for example, students investigated different measurements of social inequality. The data team helped the class quantify the weight of societal factors such as education, wealth, and income, and plug them into an overall inequality assessment.
With all of the exciting content, most speakers barely finished their presentations in time, hurrying through their last slides, anxious to share final details.
Nick Adams, who works in BIDS, presented the web interface he developed to crowdsource the arduous hand-labeling work needed to train algorithms. Right now, he’s examining newspapers in 184 cities for stories on protests to analyze why and how police and protesters initiate violence.
In the last seconds of his talk, he turned to acknowledge his collaborator, Norman Gilmore.
“I’m a sociologist,” Adams said. “I’ve gotten into text analysis in the last few years … but I am not a software engineer. This would not have happened without Norman.”