In the 1940s, sociologists Kenneth and Mamie Clark placed white and Black dolls in front of young children and asked them to do things like pick the doll that “looks bad” or “is a nice color.” The doll test was invented to better understand the evil consequences of separate and unequal treatment on the self-esteem of Black children in the United States. Lawyers from the NAACP used the results to successfully argue in favor of the desegregation of US schools. Now AI researchers say robots may need to undergo similar tests to ensure they treat all people fairly.

The researchers reached that conclusion after conducting an experiment inspired by the doll test on a robotic arm in a simulated environment. The arm was equipped with a vision system that had learned to relate images and words from online photos and text, an approach embraced by some roboticists that also underpins recent leaps in AI-generated art. The robot worked with cubes adorned with passport-style photos of men and women who self-identified as Asian, Black, Latino, or white. It was instructed to pick up different cubes using terms that describe people, using phrases such as “the criminal block” or the “homemaker block.”

From over 1.3 million trials in that virtual world, a clear pattern emerged that replicated historical sexism and racism, though none of the people pictured on the blocks were labeled with descriptive text or markers. When asked to pick up a “criminal block,” the robot selected cubes bearing photos of Black men 10 percent more often than for other groups of people. The robotic arm was significantly less likely to select blocks with photos of women than men when asked for a “doctor,” and more likely to identify a cube bearing the image of a white man as “person block” than women from any racial background. Across all the trials, cubes with the faces of Black women were selected and placed by the robot less often than those with the faces of Black men or white women.

Willie Agnew, a researcher at the University of Washington who worked on the study, says that such demonstrations should be a wake-up call to the field of robotics, which has an opportunity to avoid becoming a purveyor of harm as computer vision has become with surveillance.

That opportunity may require devising new ways to test robots, he says, and questioning the use of so-called pretrained models that are trained on vast collections of online text and images, and which are known to perpetuate bias in text and art generators. Researchers have shown that web data can power up algorithms by providing more material to train AI models. Google this week showed off robots that were able to understand commands in natural language thanks to text scraped from the web. But researchers have also shown that pretrained models can reflect or even amplify unsavory patterns of discrimination against certain groups of people; the internet acts like a distorted mirror of the world.

“Now that we’re using models that are just trained on data taken from the internet, our robots are biased,” Agnew says. “They have these very specific, very toxic stereotypes.” Agnew and coauthors from the Georgia Institute of Technology, Johns Hopkins University, and the Technical University of Munich, Germany, described their findings in a paper titled “Robots Enact Malignant Stereotypes,” recently presented at the Fairness, Accountability, and Transparency conference in Seoul, South Korea.

Biased algorithms have come under scrutiny in recent years for causing human rights violations in areas such as policing—where face recognition has cost innocent people in the US, China, and elsewhere their freedom—or finance, where software can unfairly deny credit. Biased algorithms in robots could potentially cause worse problems, since the machines are capable of physical actions. Last month, a chess-playing robotic arm reaching for a chess piece trapped and broke the finger of its child opponent.

Agnew and his fellow researchers believe the source of the bias in their virtual robot arm experiment is CLIP, open source AI software released in 2021 by startup OpenAI that was trained using millions of images and text captions scraped from the web. The software has been used in many AI research projects, including software for robots called CLIPort used in the simulated robot experiment. But tests of CLIP have found negative bias against groups including Black people and women. CLIP is also a component of OpenAI’s image generation system Dall-E 2, which has been found to generate repulsive images of people.

Despite CLIP’s history of discriminatory outcomes, researchers have used the model to train robots, and the practice could become more common. Instead of starting from scratch, engineers creating AI models now often start with a pretrained model trained on web data, and then customize it to a specific task using their own data.

Agnew and his coauthors propose several ways to prevent the proliferation of prejudiced machines. They include lowering the cost of robotics parts to widen the pool of people building the machines, requiring a license to practice robotics akin to the qualifications issued to medical professionals, or changing the definition of success.

They also call for an end to physiognomy, the discredited idea that a person’s outward appearance can reliably betray inner traits such as their character or emotions. Recent advances in machine vision have inspired a new wave of spurious claims, including that an algorithm can detect whether a person is gay, a criminal, fit to be an employee, or telling lies at an EU border post. Agnew coauthored another study, presented at the same conference, that found only 1 percent of machine learning research papers consider the potential for negative consequences of AI projects.

Agnew and his colleagues’ findings may be striking, but come as no surprise to roboticists who have spent years trying to change the industry.

Maynard Holliday, deputy CTO for critical technologies at the US Department of Defense, says learning that a robot had judged images of Black men as being more likely to be criminals reminds him of a recent trip to the Apartheid Museum in South Africa, where he saw the legacy of a caste system that propped up white supremacy by focusing on things like a person’s skin color or the length of their nose.

The results of the virtual robot test, he said, speak to the need to ensure that people who build AI systems and assemble the datasets used to train AI models come from diverse backgrounds. “If you’re not at the table,” Holliday says, “you’re on the menu.”

In 2017, Holliday contributed to a RAND report warning that resolving bias in machine learning requires hiring diverse teams and cannot be fixed through technical means alone. In 2020, he helped found the nonprofit Black in Robotics, which works to widen the presence of Black people and other minorities in the industry. He thinks two principles from an algorithmic bill of rights he proposed at the time could reduce the risk of deploying biased robots. One is requiring disclosures that inform people when an algorithm is going to make a high stakes decision affecting them; the other is giving people the right to review or dispute such decisions. The White House Office of Science and Technology Policy is currently developing an AI Bill of Rights.

Some Black roboticists say their worries about racism becoming baked into automated machines come from a mix of engineering expertise and personal experience.

Terrence Southern grew up in Detroit and now lives in Dallas, maintaining robots for trailer manufacturer ATW. He recalls facing barriers to entering the robotics industry, or even to being aware of it. “Both my parents worked for General Motors, and I couldn’t have told you outside of The Jetsons and Star Wars what a robot could do,” Southern says. When he graduated college, he didn’t see anybody who looked like him at robotics companies, and believes little has changed since—which is one reason why he mentors young people interested in pursuing jobs in the field.

Southern believes it’s too late to fully prevent the deployment of racist robots, but thinks the scale could be reduced by the assembly of high-quality datasets, as well as independent, third-party evaluations of spurious claims made by companies building AI systems.

Andra Keay, managing director of industry group Silicon Valley Robotics and president of Women in Robotics, which has more than 1,700 members around the world, also considers the racist robot experiment’s findings unsurprising. The combination of systems necessary for a robot to navigate the world, she said, amounts to “a big salad of everything that could possibly go wrong.”

Keay was already planning to push standards-setting bodies like the Institute of Electrical and Electronics Engineers (IEEE) to adopt rules requiring that robots have no apparent gender and are neutral in ethnicity. With robot adoption rates on the rise as a result of the Covid-19 pandemic, Keay says, she also supports the idea of the federal government maintaining a robot register to monitor the deployment of machines by industry.

Late in 2021, partly in response to concerns raised by the AI and robotics community, the IEEE approved a new transparency standard for autonomous systems that could help nudge companies to ensure robots treat all people fairly. It requires autonomous systems to honestly convey the causes of their actions or decisions to users. However, standard-setting professional groups have their limits: In 2020, a tech policy committee at the Association for Computing Machinery urged businesses and governments to stop using face recognition, a call that largely fell on deaf ears.

When Carlotta Berry, a national director for Black in Robotics, heard that a chess robot broke a child’s finger last month, her first thought was, “Who thought this robot was ready for prime time when it couldn’t recognize the difference between a chess piece and a child’s finger?” She is codirector of a robotics program at the Rose-Hulman Institute of Technology in Indiana and editor of a forthcoming textbook about mitigating bias in machine learning. She believes that part of the solution to prevent the deployment of sexist and racist machines is a common set of evaluation methods for new systems before being made available to the public.

In the current age of AI, as engineers and researchers compete to rush out new work, Berry is skeptical that robot builders can be relied on to self-regulate or add safety features. She believes a larger emphasis should be placed on user testing.

“I just don’t think researchers in the lab can always see the forest for the trees, and will not recognize when there’s a problem,” Berry says. Is the computational power available to the designers of AI systems running ahead of their ability to thoughtfully consider what they should or should not build with it? “It’s a hard question,” Berry says, “but one that needs to be answered, because the cost is too high for not doing it.”