AI and Humans as Teammates

Ed Brown

The CREW platform helps humans train AI agents to become better at tasks while working to understand how best to optimize this type of training and human-AI partnerships. (Image: The researchers)

Tech Briefs: What got you started on this project?

Boyuan Chen: Advanced AI Technologies have been so popularized that almost any random person you grab on the street is either a heavy user of AI, or an AI developer, or at least has heard a lot about it. Although most people are associated with AI in some form, it's not clear how we can best interact with AI agents and how interacting with AI will affect humans. This triggered us to think, practically speaking, can we design a system that studies how humans and AI interact? And then use it to study how to improve AI systems to better augment human intelligence.

Philosophically, I’ve always been interested in understanding and creating intelligence. Being a researcher in robotics and AI is unique in that it’s a field where you can actually create artificial intelligence. So, that directed me to this project, thinking about the role of humans and the role of AI in this road map for the future development of artificial intelligence. The paradigm I would advocate for is human-AI teaming. Humans and AI will be teammates. The final form of super-intelligence will not be AI alone nor humans alone, it’s the collective intelligence between humans and AI. This is a very old phenomenon with human beings, and even animals — we always team with others.

Tech Briefs: Isn’t AI just a tool for humans to use?

Chen: I don’t simply see AI as a tool, because tools are passive. In the long term, I see AI becoming almost a new species that will surpass any level of tools that we have been using. When you use tools, they don't interact with you, and they don't generate new things. They basically follow a sort of strict pipeline: You feed in certain inputs, and they generate certain outputs.

Computers in some way are tools. For example, you type things into a Word document and the words will be there. A large language model like ChatGPT is a static language model — it doesn't change itself — it's hosted on the cloud and it's an AI agent. However, it reacts very differently when used by different humans, which is a chain effect of interaction. With the first sentence when I talk to an AI, it decides how in the future I'm going to interact with it, and this is a chain of interactions. There are also individual differences among the different large language models. They each have their own biases — something like a social identity; and of course, humans have all kinds of biases too. AI agents are biased because of the way they're designed, and the training data they have seen. They also have biases based on how different users interact with them.

I think these are the critical reasons that we need to study the future AI system as human-AI teaming. I'm not saying this because I want to put AI on the same level as humans. I think humans are strong in many aspects, but we're also not so good at many things, vs. AI which also suffers in many aspects as well and probably has lots of fundamental limitations that are not easy to resolve.

But in many aspects AI is already much better than humans. So, for example, precise manipulation of numbers; large storage of memory systems; faster input and output reading speeds. They learn a lot from data. Humans can take an hour to read a paper, while AI can read it in a second. So, I think the best worlds come when you combine the two. Our perspective is that you have to interact with the AIs, but how we interact with them is the key.

Tech Briefs: Could you tell me something about how your platform works. What it does? What it actually is?

Chen: We're deeply into this human-AI teaming problem, but we're not alone. Researchers from many different communities and areas, not only computer scientists, are focusing on this problem. There are lots of researchers from neuroscience, cognitive science, ecology, social science, network science, security, privacy — all of these sub-domains of science, engineering, and technology are excited about it.

But the issue, and where I struggled a lot a few years back even though I was very excited about this problem, is that there is no platform that I can use to study human-AI teaming. Let’s say I'm a human researcher and I want to focus on how different individual humans — the individual differences among humans — will lead to differences in the AI agents they interact with. There's been no tool I could use to do that. If I'm a neuroscientist, and I want to measure human physiological responses, for example, their brain waves or their eye tracking data, which are signals that neuroscientists use to investigate how human physiological responses change while they're interacting with AI, I don’t have a tool for that.

There are lots of tools you can use to measure a single human and some state-of-the-art tools in neuroscience that can measure 2-3 humans altogether as a team, but there’s no way that you can easily inject an AI agent in there. It just doesn't exist as a tool. And every group of researchers uses their own methods. So, that triggers a lot of issues, like: How do you reproduce results? How do you compare with each other? How do we even know we're making measurable progress?

In fields like neuroscience, cognitive science, or psychology, many Ph.D. students spend their first two years building a platform and then spend another year to collect data and then they spend the next few years to analyze the data and publish manuscripts.

If you’re an AI researcher, you know a lot about how to design a simulation environment. You know how to train machine learning agents, say to play games, but there was no platform that enables a human to either control the AI agents directly, or provide feedback about how the AI agent is working. A platform to support both human researchers and AI researchers to collaborate and to study different aspects of human-AI teams didn’t exist.

Our platform, CREW, is the first of its kind that provides the capability to support research from all the areas that I mentioned. We built this platform to support multiple different areas of research and, in fact, multiple universities and government agencies have already been using it for their active research on different aspects of human-AI teaming.

A second capability of CREW touches upon the problem of scale. It’s natural to think about the study of one human and one AI interacting with each other. But, it's not easy to know collective effects, for example, what happens if you have 50 humans and 50 AI agents all coexisting in the same working environment. What is the final effect of this when the number of agents increases, and especially when different agents are running different algorithms — what's going to happen is not clear.

Our CREW platform can support dozens, even hundreds, of AI agents all being online, and not necessarily with the same human. They don't even have to be the same AI, they can be different AI agents that are trained by different companies, or different research labs. All coexist in this ecosystem. We think of it almost like a mini world that hosts dozens to hundreds of AIs and humans.

CREW unifies the interface for many different researchers to use, and it also supports scalable research. One of the most important lessons we learned about this new wave of AI research is that scale matters a lot. When you go to any AI research talk by the field leaders, you almost always hear this one question from the audience: “How well can this method scale? Will this scale to more data, more humans, more AI agents?”

As a concrete example, imagine you have a platform that can host dozens to hundreds of human-AI agents. You can customize the environment you're interested in. For example, you might be interested in a 2D Atari game as the testbed. The groundbreaking DeepMind Nature paper in 2015 that reported achieving superhuman performance playing an Atari game. You can use that environment, and you can use all the infrastructure we provided to study that problem.

If you're interested in robotic tasks, if you want to study how a robot will navigate in a complex environment to search for objects in a search-and-rescue operation, we provide infrastructure to create that environment. That infrastructure will enable you to design your own task, design your own agents, and focus on the specific parts you're interested in.

Furthermore, we have an interface for you to record human physiological signals. And we also have an interface to record the signals and the decision-making processes from AI. So, that really enables you to focus on your own area without spending time to build everything. That impact is — even in my lab, just with that platform, we've been already building a few new projects within just months.

Another example is a human subject study in which we were able to conduct 50 human subjects within just a week with this platform. The previous state-of-the-art was less than 10 humans, and if you go to a top machine-learning conference, most of them use 3 to 5 human subjects and most of them are the authors of the paper. Those are the people who created the algorithm, who know the algorithm, how it works, and they conclude that: “Our algorithm performs best.” But when we try to reproduce the results, most of them don’t work as expected, and since most are not open-sourced, we don't know how they were actually implemented. We have to go their paper, implement their algorithm and add it to our library to use it, and create comparisons.

Our goal is to design a human-AI teaming system that works for all of us, not just for a select group of experts who design the algorithms. Our goal is to remove the differences among how individuals interact with AI. Instead, we will enable AI to adapt to all kinds of different human individuals.

Tech Briefs: Well, that very point is what I'm unsure about. If you're having 100 different people, how do you generalize when each of those persons are different? How do you get a general answer from a particular group of people? People vary and they're unpredictable and even one person can respond differently from today to tomorrow depending on whether they have a stomachache today or not.

Chen: Exactly, that's the super question. First of all, we don't have any answer for that, and nobody knows how to do it. But, wouldn't it be great if we have adaptive systems that can adapt to different humans and give the best outcomes for each? And this is the big change, because we think technology advances have been great. But, it turns out if you observe all the great advances in technologies, we have been the ones who adapted to the technology. We have to learn how to use them; we have to find a particular pattern to adapt our own minds to the technology.

I think the opportunity here with AI is unique because it is an adaptive system. So, how do we build that adaptive AI system so that it can also adapt to humans? And if we can’t, I think the middle ground is co-adaptation between humans and technology. Most tools, for example computers and the Internet, don't adapt to us — we adapt to them. So, this is where AI comes in — the whole promise of AI systems is that they will adapt.

But, I can confidently say we are on the path of answering that question. The reason is that the first step to answer the question of how to design a system that can adapt to individual humans is to understand the differences between humans when they interact with AI. So, that's why in this platform we have a series of cognitive studies, for example measuring human response time, their spatial reasoning capability, their theory of mind capability for predicting agents’ behavior, and so on.

Through this series of cognitive capability measurements of humans, we noticed that some of them were especially important. It turns out that that AI trained and guided by people with those important capabilities, achieves higher performance in a particular task setting. To create something, the first step is to understand it. We want to understand the differences between humans and which differences will lead them to have different effects on the AI system they're using. The next step would be to create a system based on this understanding — we can design a system to adapt to those differences.

Tech Briefs: Could you give me an example of a use case?

Chen: For sure. An example is from our own follow-up research on this. We used the CREW platform to conduct experiments and design an algorithm we call GUIDE. It's an algorithm that enables a human to watch an AI agent learn and provide guidance when needed. For example, if the AI agent is learning to play Hide-and-Seek or trying to play a game where it will search for an object in an unknown environment, it would need to explore the map and then find out where that object is from the camera video feed. If the AI was practicing on its own, it would only get feedback from the environment when the task is completed. When the task is not completed, when the object has not yet been found, it won’t get any feedback because it doesn’t know how far away they are from the target.

Our GUIDE algorithm, however, enables a human while they're watching the AI practice, to provide feedback on its performance. There is a panel where humans move a mouse up and down indicating good behavior or bad behavior. If you see an agent get stuck, you move your mouse down. If you see it exploring the map, which is something it should do before it sees the target object, you give it a higher reward. The human will watch the AI and give continuous feedback about good and bad behavior — just the way we train pets to learn new skills. When we used this algorithm to train the AI agents, it drastically improved the AI performance after only 10 minutes of human feedback. This task would typically require millions of training data with AI alone. Our GUIDE algorithm improves the efficiency of how AI learns things — it learns faster on less data and fewer interactions with the simulation.

Although none of the 50 human subjects were experts in AI machine learning, the overall performance showed that they could enable the AI to learn much faster, even though they knew nothing about the algorithm’s design.

When we looked into individual differences among the humans, we saw that some of them performed better with the AI they had trained on. We also did a study to understand the significance of the differences among these humans and determine which capabilities were important. For example, for certain tasks, how fast you can react to a change in the environment matters more; in other tasks, your spatial navigation capability matters more.

Tech Briefs: I'm not sure how you can train the human-AI team on a simple video game. How do you take that and use it to train human-AI interaction on a practical real-world task.

Chen: I have two answers for that. First, the majority of the progress made in AI was actually developed based on these kinds of seemingly simple computer games. The examples include AlphaGo, which beats human Go players. AlphaGo was initially developed on chess and then eventually on Go and poker.

Those exact same algorithms now have been used for traffic planning, for energy management, and for the recent large language models. Chat GPT, DeepSeek — all of these large language models — rely on these same algorithms. So, it's convenient for us too — it's really hard and very risky to, say, manipulate a building’s energy in order to develop a new algorithm. When we develop algorithms, it's much safer to use something that is very difficult even for humans — most people cannot be pro players of computer games; cannot be a pro players of Go; cannot be pro players of poker. If it's very hard for humans, then developing an AI form of the task has always been shown to be effective for developing the best outcome. And the best algorithms on these tasks are usually also among the best algorithms that are actually useful in the real world — it's the exact same algorithm.

The underlying reason is that these computer games, although seemingly simple, also reflect the fundamental challenges in real-world tasks. For example, the Hide-and-Seek computer game is similar to, say, a wildfire response. If you think about it, what are the challenges for wild-fire responses: resource management; knowing how many agents we have available; how the fires are spreading; how do we perceive the environment with the sensors we have; how do we coordinate many different humans, firefighters, tools, helicopters, all accomplishing the same task: to put out a fire. These are the exact same things, if you think about it, in most computer games. You manage resources, coordinate multiple agents, perceive the environment you have to explore with sensors.

So that's one thing: We have learned that if you make progress on computer games, you will eventually make progress in the real world. The second answer for that is that for CREW, since we're a robotics lab, we care a lot about a real-world problems. We're now working to design a new interface to extend this to more realistic tasks that humans care about and to demonstrate that this actually will be helpful for them.

We already have done some work on physical robots to perform search-and-rescue tests. It's the Hide-and-Seek algorithm transferred directly to the physical world. We have eight ground vehicle robots that can be coordinated for search and rescue, and we're using CREW as the platform to develop the algorithm. The algorithm will then be transferred directly to the physical robots. We call this algorithm HUMAC.

So, stay tuned and we'll have a lot more results to share — hopefully in the coming year.

Tech Briefs: How do you incorporate psychology and social science?

Chen: I'm a huge fan of human intelligence — I see this as my lifelong mission, understanding and creating intelligence.

Human intelligence is a super-critical aspect of things that we look at because it is what we see as the most advanced intelligence. If we're going to build intelligence, we're going to need to understand it. In our research we refer a lot to aspects of neuroscience, psychology, cognitive science, social science. So, for example, a portion of my Ph.D. work focuses on enabling robots to have theory of mind capability. This is the cognitive capability that humans develop at an early age, where children start to have the ability to predict the behavior of other humans. This is the key to social intelligence — to be able to understand that other people may see the world differently from themselves. This is why Hide-and-Seek is so important: It turns out to be one of the most essential and classical tasks that human scientists use to study human intelligence, especially for social intelligence and multi-agent collaboration. This is because if you play Hide-and-Seek with children before they have the theory of mind capability, they will cover their own eyes because they think “if I cannot see you then you cannot see me.” Once they have developed the theory of mind, they are able to play Hide-and-Seek by hiding behind objects, because they have learned that even though I cannot see you, it doesn't mean you cannot see me.

The goal for CREW is to enable researchers from different fields to collaborate. We have collaborators from neuroscience, psychology, and social science already joining our efforts. So, we're not simply just doing this from an engineering perspective or a computer science perspective, we're looking at this as a holistic research domain that requires collaboration from many different areas.