For self-driving-car developers, like many iPhone and Google Photos users, the growing cost of storing files on the cloud has become a nagging headache.

Early on, robocar companies pursued a brute-force approach to maximize miles and data. “We could take all the data the cars have seen over time, the hundreds of thousands of pedestrians, cyclists, and vehicles, [and] take from that a model of how we expect them to move,” said Chris Urmson, an early leader of Google’s self-driving project, in a 2015 TED Talk

Urmson spoke at a time when autonomous vehicle prototypes were relatively few and the handful of companies testing them could afford to keep almost every data point they scooped up from the road. But nearly a decade later, Google’s project and many others have fallen far behind their own predictions of the timeline for success. Growing fleets, fancier sensors, and tighter budgets are forcing companies working on robotaxi and robofreight services to get pickier about what stays on their servers. 

The newfound restraint is a sign of maturity for an industry that has begun moving people and goods without drivers in a few cities when the weather’s good and streets are relatively clear, but is yet to generate profits. Figuring out which data to keep and which to discard could be key to expanding service to more locations as companies train their technology on the nuances of new areas.

“Having tons and tons more data is valuable to some extent,” says Andrew Chatham, who oversees the computing infrastructure at the Google driverless tech spinout Waymo. “But at some point, having more interesting data is important.” Rivals including Aurora, Cruise, Motional, and TuSimple are also keeping closer watch on their data stores.

The trend could spread at a time that driverless projects are facing pressure to control spending after years of losses. Companies ranging from General Motors, which owns robotaxi service Cruise, to Waymo-owner Alphabet are in the midst of wide-ranging cost-cutting this year—including mass layoffs—as sales in core businesses slow due to a shaky economy. Meanwhile, cheap and easy funding is drying up for autonomous vehicle startups.

Naturally, all spending is under scrutiny. Amazon Web Services charges about 2 cents per gigabyte monthly for its popular S3 cloud storage service, a price that adds up quickly on data-intensive projects, and doubles in some cases when factoring in bandwidth costs to transfer data. Intel estimated in 2016 that each autonomous vehicle would generate 4,000 gigabytes of data per day, a volume that would cost about $350,000 to store for a year at Amazon’s current prices.

Chucking data might sound perverse for the tech industry. Companies like Google and Meta have long been ridiculed and even penalized for collecting everything they can—including users’ locations, clicks, and searches—with the idea that greater understanding of behavior leads to better designed services. The mantra created a culture of collecting data despite any clear application. For instance, Google CEO Sundar Pichai acknowledged in 2019 that only “a small subset of data helps serve ads.”

Self-driving-car developers initially held a similar philosophy of data maximization. They generate video from arrays of cameras inside and outside the vehicles, audio recordings from microphones, point clouds mapping objects in space from lidar and radar, diagnostic readings from vehicle parts, GPS readings, and much more.

Some assumed that the more data collected, the smarter the self-driving system could get, says Brady Wang, who studies automotive technologies at market researcher Counterpoint. But the approach didn’t always work because the volume and complexity of the data made them difficult to organize and understand, Wang says.

In more recent years, companies have started holding on to only data believed to be specifically useful, and have also focused on organizing them well. Practically speaking, data from driving on a sunny day in the desert for an hour might start looking repetitive, so the utility of keeping them all has come into question.

Limits aren’t entirely new. Chatham, the distinguished software engineer at Waymo, says getting access to more digital storage wasn’t simple when the company was a tiny project inside Google over a decade ago and he was a one-person team. Data that had no clear use was deleted, like recordings of failed driverless maneuvers. “If we treated storage as infinite, the costs would be astronomical,” Chatham says.

After Waymo became an independent company with significant outside investment, the project gobbled data storage more freely. For instance, when Waymo started testing the Jaguar I-Pace in late 2019, the crossover SUV came with more powerful sensors that generated a bigger stream of information—to the point that full logs for an hour’s driving equated to more than 1,100 gigabytes, enough to fill 240 DVDs. Waymo increased its storage capacity significantly at the time, and teams got less picky about what they kept, Chatham says.

More recently, Chatham’s team began setting strict quotas and asking people across the company to be more judicious. Waymo now keeps only some of its newly generated data and more recently began deleting saved data as it becomes outdated compared to current technology, conditions, and priorities. Chatham says that strategy is working well. “We have to start discarding data fast as our service grows,” he says.

Waymo carried paying passengers more than 23,000 miles in California between September and November of last year, up from about 13,000 miles over a similar timeframe just six months earlier, according to disclosures to state regulators.

Data caps in some cases have factored in the priorities of autonomous vehicle companies. With some negotiation allowed, Chatham’s team allots quarterly storage allowances to groups of engineers working on different tasks, such as developing AI to identify what’s around a vehicle (perception) or testing planned software updates against past rides (evaluation). Those teams decide what’s worth keeping—say, data on the actions of emergency vehicles—and an automated system filters out everything else. “That becomes a business decision,” Chatham says. “Is snow or rain data more important to the business?”

Snow has won out for now, because Waymo so far has only limited data from driving in it. “We’re keeping every piece,” Chatham says. Rain has gotten less interesting. “We’ve gotten better at rain, so we don’t need to go to infinity.” Being data-thrifty can sometimes prompt creativity or valuable discoveries, he says. Waymo learned at one point that its rain data needlessly included all the sensor readings its cars had collected while parked.

Across self-driving projects, data from busier, crazier times has the best chance of surviving. “Rare objects and unusual scenarios, such as obstacles in the roadway or cyclists with surfboards,” says Balajee Kannan, vice president of autonomy at driverless tech maker Motional, a joint venture between Hyundai and automotive supplier Aptiv. 

The quickly-growing Cruise has said that less than 1 percent of the data it generates from driving in San Francisco contains what its teams view as useful information, so it too doesn’t store all of them now. Its autonomous Chevy Bolt cars drove paying passengers over 13,000 miles in the city last fall, compared with 3,400 miles when it kicked off service during the summer. With its deployment growing, Cruise is working on improvements to its data storage systems that make it easier and more affordable to expand service, though spokesperson Rachel Holm declines to share details.

Deletion isn’t the only solution. Moving data to “cold” storage, which at AWS costs as little as one-tenth of a cent per gigabyte per month, can also shed costs, but they can only be accessed slowly, limiting their usefulness.

Aurora, which is testing driverless trucks on freeways in Texas, uses an automated system to sort the terabytes of data generated by driving about 50 loads per week for pilot customers across the state. Engineers flag crucial data, such as recent incidents involving dangerous road debris or aggressive drivers, to ensure it is saved in regular storage. Anything unprotected or unused is automatically put on a death watch, moving to successively colder storage every month until, after three months, a substantial amount starts getting deleted. Measurements calculated from the raw data are the only bits kept.

“It’s like trimming your fingernails,” says Tim Kelton, who runs Aurora’s infrastructure. “You have to do it every week. It’s not something you can ignore.” The company also ditches data from sessions when its technology is driving really well or running on outdated sensors, because there’s less to learn from. Overall, only about 15 percent of Aurora’s data are in its most accessible tier of storage.

Not everyone is at their limits just yet. TuSimple, another driverless trucking company, has collected, compressed, cataloged, and stored all the data from each of the tens of thousands of drives since its founding in 2015. But the company, which conducted its first driverless route in December 2021, is keeping an eye on its 50 petabytes of capacity, and moves most data to cold storage after four years, says Robert Rossi, its vice president of operations. 

AI software that can extract valuable data from compressed files could eventually help companies keep more logs without breaking the data bank, says Weisong Shi, a computer scientist at the University of Delaware who has worked with automakers to cut data storage and transmission.

But he points out that if Waymo and its competitors finally manage to reach wide deployment, with large fleets of vehicles, they’ll have to junk a lot more data. “Once you go into mass production, cost will be a big deal,” Shi says. “We haven’t reached the point where we desperately need more storage, but this day will be coming soon.”