This story originally appeared in Yale Engineering magazine.
Whether you’re figuring out the best place to catch an Uber ride or mapping the human brain, there’s a better, faster way to do it. Amin Karbasi, assistant professor of electrical engineering and computer science, is working on it.
Working at the intersection of learning theory, optimization, and information processing, Karbasi’s research focuses on developing ways to better navigate our increasingly data-filled world. There’s a greater need than ever for this kind of research. Thanks to the Internet and social media in particular, a tremendous amount of data is generated every second by millions of users. Every minute Instagram users post nearly 220,000 new photos, YouTube users upload 72 hours of video, and Facebook users share nearly 2.5 million pieces of content.
“It’s no secret that data’s getting bigger and bigger, and one way or another, we need to deal with that,” said Karbasi, who is also a faculty member at the Yale Institute for Network Science.
How we organize and make sense of all this information is an ongoing challenge in computing. In many cases, he said, we just discard the data. It’s easy, but not the best way to deal with it. Using data-driven algorithms and other techniques, such as summarization methods that find the right representative subset to get a clear picture of the whole set, Karbasi wants to find a better way. One approach is sampling.
“You have a huge amount of data and you want to sample the most important points,” said Karbasi, who was listed this year by the prestigious International Conference on Machine Learning as one the most prolific researchers in the field. “If you sample and find the most important points, you’re going to have a much smaller data set, but hopefully the quality is going to be similar.”
Much of Karbasi’s work involves finding the sweet spot between accuracy and speed in data searches. In each case, they need to figure out how comprehensive the list needs to be.
“Do you want to be exact or do you want to be fast? You can’t have both,” Karbasi said. “That’s the trade-off. In medical applications, you want to be really accurate, but in mundane mission-learning applications, deciding if this image is a cat or a dog — the stakes are lower — I can make mistakes once in a while. But if it’s a doctor trying to tell whether it’s a tumor or not, you have to be very, very careful.”
Once they assess the proper speed/accuracy ratio, Karbasi and his research team can develop the right methods for extracting the data. “Tell me how much wiggle room I have and I can tell you how fast I can compute,” he said.
Karbasi’s work is filled with high-level mathematics, seemingly endless equations and many abstract concepts. But the results play out in some of the most everyday ways — a better online system for making movie recommendations, for instance, or finding the best place to catch a ride. A recent study of his aimed to create the most efficient system for finding waiting locations for an Uber.
“If you’re in Manhattan, every centimeter, ever corner can be a pickup location,” he said. “The question is which corner should I pick?”
For this, a mathematical strategy known as discrete optimization, comes into play.
“You have 10 million data points in front of you but you want to choose only 100 of those,” he said. “If you want to find the representative points or intelligently summarize the data, you have to maximize these discrete functions. Our group has focused a lot on the discrete optimization problems — how fast and accurate we can compute them.”
A fully comprehensive algorithm that factors in every street corner would take a prohibitively long time to compute. But an algorithm that takes only a few seconds to provide results would likely be welcomed by consumers, even if it required them to to walk a block to catch their rides. Throughout the year, the best spots change — areas near ice rinks are popular in winter, for instance. A way to update the best locations each day would be invaluable to the company.
To find the optimum waiting spots for Uber drivers, Karbasi and his fellow researchers analyzed a dataset of 100,000 Uber pickups in Manhattan from April 2014 (just a fraction of potential pickup locations in Manhattan). They developed an algorithm that reduces the set of 100,000 to 30 spots representative of the larger set, and then chooses three different waiting locations within
And if you get antsy waiting for a ride, try completely mapping the neural connections in the human brain — an endeavor estimated to take some 14 billion years. That is (obviously) a long time, but identifying the topology of the brain’s network could tell us a lot about the physiological basis for how we process information. Fortunately, Karbasi is working on other less time-consuming methods to do so.
Working with researchers from the Ecole Polytechnique Federale de Lausanne in Switzerland, Karbasi helped develop an algorithm that scales to large datasets of recorded neural activities. By mathematically analyzing this mapping, the researchers could tell the conditions under which the algorithm successfully identified the type of synaptic connections within the available data.
Another brain-related study brought Karbasi in collaboration with two other researchers at Yale, Todd Constable and Dustin Scheinost in the departments of neurosurgery and radiology and biomedical imaging. The researchers analyzed the fMRI scans of more than 100 subjects from the Human Connectome Project, a five-year effort to create a network map of the human brain. Doing so allowed them to develop a method of analyzing the neuronal connections of individual brains that allow them to successfully predict the subjects’ IQs, their sex, and even tasks they were performing at the time of the brain scan.
The researchers focused on what’s known as voxels. Analogous to a pixel, a voxel is the lowest resolution achievable in the scans, and each can represent up to millions of neurons. Researchers cluster voxels into different areas called nodes or parcels, a process known as parcellating. A universal atlas of the brain has been developed through traditional methods of parcellating the brain, but these methods don’t factor the many inter-individual variations and the unique nature of the neural connections. Because a single functional atlas may not apply to all individuals or conditions, these variations are particularly important for patient and developmental studies.
“Traditional approaches to human brain parcellation collapse data from all the subjects in the group and then they cluster the average,” said Mehraveh Salehi, a Ph.D. candidate in the the labs of both Constable and Karbasi. “But we’ve shown that if you do this at the individual level, each individual has a different parcellation.”
To individualize the existing parcellations, the team used a method of summarizing large amounts of data known as exemplar-based clustering, which seeks the most representative elements of the data.
“If we account for those variations, we can build up better models from the functional connectivity analysis, and those models are better at predicting behaviors, such as IQ,” she said.
Karbasi said it was remarkable how much information they could get from the network of voxels.
“What was very fascinating was that the shape of the network tells a lot of stories,” Karbasi said. “For example, we can say whether this person in the scanner is a male or a female. It also tells us that these people are performing different types of tasks. It’s like reading the brain.”
He added that they’re just “scratching the surface” of the technology’s potential.
“Just imagine what we might do in 20 years if we can really read the brain, and understand what people are thinking,” Karbasi said. For example, he said, it could potentially lead to a better understanding of how the brain makes the transition from one emotional state to another and new treatments for depression.
In focusing on these problems, Karbasi’s group needs to factor in which platforms they’re designing their solutions. For instance, a typical home computer has a much more limited capacity than a company such as Google, which has massive computing power. So it makes sense that the Internet search giant has sought Karbasi’s expertise. His upcoming sabbatical will be spent doing research for the company, which recently awarded him with funding to help turn the tens of millions of data points into something manageable. One method Karbasi is looking at involves choosing elements from a particular dataset that fall into a category, but aren’t overly similar.
“We are trying to come up with algorithms that can do this kind of thing fast,” he said. “What we do is we represent every image by a data point, or a vector, and then we can define distances between the vectors.” He compares the data points to molecules of a gas — they’re far from each other, but fill the entire space.
To do this, Karbasi’s research team applied their method on a publicly available dataset, called “tiny images,” which contains 80 million images crawled from the web. “What we wanted to do was summarize this data — if you want to pick 100 images, which ones? We came up with algorithms that can do this very fast.”
They developed a distributed algorithm that chops the data into small pieces so that each piece can be performed on a single computer. “And then we merge the results, and do something intelligent with them,” he said.
Using classical algorithms would take an extremely long time to essentially perform the same task. Using his algorithms, the computers in his lab finished in only a few hours. Google — with all of its resources — might take only a few seconds.
Karbasi’s work has made him a sought-after expert. In addition to Google, the U.S. Air Force and Microsoft are among those to have also funded his research. He recently completed an online training program for the Defense Advanced Research Projects Agency (DARPA) that aims to create a better data-driven online education system that interacts with humans. Recruiting subjects from Amazon’s Mechanical Turk platform, the test trains users to distinguish between three types of woodpeckers.
As part of the test, users try to identify a specific bird and then receive another example based on that answer. The system monitors the responses and adjusts its teaching approach to the test taker’s learning style. While computer training programs often take a “one size fits all” approach, Karbasi’s program personalizes massive online courses. The general idea behind the test has applications well beyond wildlife.
“At DARPA, they need to know ‘Is this person OK, is this person not OK?’” Karbasi said. “It’s very hard for people to read and understand the cues, so they want an automated machine learning system that learns about humans.”
From our mundane tasks to critical medical procedures, data is becoming ever more present in our lives and serves as a common thread through seemingly disparate problems. Karbasi is among the researchers making the daunting amount of information a little more manageable.
“These are all very different applications, but at end of day, these are the same problems — and we can solve them,” he said.