[00:00:36] Camille Morhardt: Hi, and welcome to this episode of Cyber Security Inside. This is actually Part 2, where we’re specifically talking about intelligent systems research out of Intel labs. We talked with Lama Nachman already, and she gave us an overview of some of the research that’s happening within her lab. Now we’re going to dive in with a subset of the lab and three PhD research scientists. We have an anthropologist, a computer scientist, and a user research scientist, all three with us. Dawn Nafus, Saurav Sahay and Sinem Aslan, welcome the three of you to this conversation.
We’re going to talk about emotion recognition today and specifically how you get there with context-aware multimodal AI systems. And if that doesn’t scare anybody who’s listening, then you’re probably not paying enough attention. We should be doing these things carefully and with a lot of insight and ethicists and anthropologists, as well as computer scientists, all working together. So this is why we have multidisciplinary crew on the call with us today.
Let’s start with Dawn, actually. Can you just tell us what is emotion recognition?
[00:01:55] Dawn Nafus: Essentially it is the idea that we could build Artificial Intelligence systems to recognize different emotional states. And typically, there’s a notion of input that’s external, right? So, um, most of the attempts seem to be around facial expression, gaze. Um, there’s also some things that you can do linguistically to sort of see in text, are people writing in a positive manner or a negative manner. Those are the kinds of typical scenarios I see, but I leave it to the real technical experts to refine that definition.
[00:02:34] Camille Morhardt: So then I’ll ask Saurav actually to just take us one level down on the computer science side and just at a high level–and we’ll work our way down to more depth here–but what are some of the different kinds of multimodal systems that we’re using to recognize these things?
[00:02:52] Saurav Sahay: Sure. Sure. Camille, so emotion recognition technology is a technology where we’re using things like as Dawn mentioned, uh, facial expressions and, uh, physiological sensing to audio sensing, using acoustic context, using your say, for example, how you type on the keyboard, your typing speed and things like that, and takes into account all of these sensors to compute, uh, emotional states.
And emotion is a topic of research that has several flavors and several theories. For example, there’s a famous scientist Ekman who came up with these five basic emotions, happy, sad, uh, neutral, and, and some of these states that typically many systems that we develop our model today try to have data for these emotion categories or the output classes, and then the models can detect these states.
Now going a level deeper emotion recognition is, as I said, the technology, but the actual use cases, for example, in, uh, in the driving scenario, in my vehicle, I have this amazing attention-assist feature that tells me when I’m drowsy. So this system is also using some flavors of sensing to detect, am I might alert or not?
[00:04:11] Camille Morhardt: Okay. And then Sinem them, let’s hear from you. Can you describe one of the projects that you’re working on in the lab, in the classroom?
[00:04:19] Sinem Aslan: Yeah, sure. Uh, so I mean, in one of the projects, using context-aware, multimodal AI technologies, we are trying to understand student engagement while they are engaged with digital content on their laptop. When you look at the state of art, majority of researchers are looking at emotional understanding problem from the perspective of like facial emotions, right? The mimics that you are making, or, uh, your eyebrow movement or all of those different things.
But here, uh, at our lab, we are investigating this problem from more like multimodal kind of perspective, because we know that we don’t only show our emotions through our facial expressions. And especially when you think about a scenario where like a child is watching an instructional video on a laptop, right? Uh, probably most of the time his or her reactions in terms of the facial expressions might be subtle. So this is where we need really the complimentary data sources, right?
At a high level, we try to understand engagement, but engagement is a multi-componential research construct for us because we know that learning is emotional as much as intellectual. So we are trying to understand whether a student is on task or off task during learning, but at the same time, the other level of engagement is on emotional engagement where a student, whether they are confused, bored, or satisfied at any time of learning. So if you combine these two different pillars, we come up with this final engagement state, right?
So if a student is on task and satisfied, then we map it as engaged. But if a student is on task, but having some emotional problems, like maybe they are confused or bored, then we map it as maybe engaged. And if a student is off task, then as like not engaged at all. So these two dimensions gives us this broader perspective.
And if you look at how we evaluate emotional states, first of all, we use data from the laptop camera as vision modality, and we also use context and performance data from the content platform. And we extract features from these two modalities and we feed these features into our classifiers, which are then fused to provide us the final engagement state. So this is like how we are handling emotional understanding in education research.
[00:06:51] Camille Morhardt: I’ve got to take a step back early in this conversation and, you know, anybody feel free to answer. I, I’m kind of looking at Dawn maybe to kick us off on this one, but yikes. I mean, there’s gotta be a little bit of a yikes factor with that because we’re not even just it’s it’s maybe even, um, scary enough to be on camera or feel like, you know, there’s a camera that can monitor your facial expressions, which you may or may not be able to conceal.
Now we’re talking about all different kinds of sensors that, you know, let’s just assume ultimately you can’t really have a poker face through all of them. So, I mean, Dawn helped help me understand, you know, what’s kind of going to happen with this culturally and societally and ethically.
[00:07:34] Dawn Nafus: Yeah. I mean, it’s, uh, it’s you nailed the issue on the head. There’s a yikes factor in a number of ways. Um, and I’ll, I’ll start personally, right? So I work in this lab. I enjoy working with my colleagues. I respect what they do. And, you know, I believe Sinem when she says that we can support students in making sure that their emotional needs are met, which is, you know, how I understand what Sinem’s doing or at least part of it.
Um, at the same time, you know, I’ll go off to anthropology conferences and I’ll talk about emotion recognition and the first thing people will say is, “but you cannot fundamentally make a claim about somebodies emotional state. Right for, for them.” Right? I mean, there’s like, there’s a kind of a power relationship here.
And, and, you know, they’ll go on to sort of say that actually there’s a lot more cultural variation and psychologists might have it and, you know, and sort of on and on it goes, right. So, so we have here like a scientific controversy. Right. And I, and I think it’s a meaningful one and actually working with people who sit on both sides of the controversy for me personally, it’s whiplash, if I’m honest. Um, but the yikes then I think comes in. So when Sinem says, “well, we use multi-modality for reason.” Right. And that reason in part is to get over that anthropologist sense of like, But you can’t just read the faces. Right. And I think part of what we’re saying here in our lab is no, you can’t. Right. You have to do something more to try to understand that context.
Then the next layer, you know, sort of where my work comes in is to ask, “okay, who’s in control here.” Right. “And who gets to make that claim about is the student distressed? Is that student not distressed? To what end are we, you know, saying, look, the student is on task or not.” Right. “Who’s gonna win and who’s gonna lose in that scenario?” And how are students going to grow up feeling watched in that way? Right?
Now that’s not to say that there’s not a history of schools and surveillance. That experience of being in class and being watched is not an unfamiliar one since my computer’s also a problem. Right. But if we, if we ask who who’s in charge, who benefits, who doesn’t, and who gets to make these technologies at all? then we can start to unpick “Okay. Where’s the benefits and where’s the real risk?”
[00:10:09] Sinem Aslan: So just to add to this kind of perspective, I think I totally understand the concerns around emotional understanding, and I think it’s important to contextualize emotional understanding in the usage itself. Right? So if you look at how teachers interact with children, they interact with them based on mostly their emotional state, so if they see them confused, they need to take an action. Right. And it’s part of their day-to-day job. And when we give children with laptops, they lose all of the data that they have, which is the body language. Even if the classroom like is full of children and there is one teacher, right it’s really hard for them–this is what we identified from the ethnographic research that we have done–it’s really hard for them to understand and monitor their engagement in real time.
So, in a way, we are making it more efficient, we are not creating some sort of a new data for them. This is the data that they already use on a day-to-day basis, but we are making it more effective and efficient for them.
If you look at like a, for instance, a work environment, right? No, I mean, even myself would not be okay with sharing my emotional states with my managers, because, that’s not what she do right on a day-to-day basis; that’s an extra information for her, but for a concrete classroom scenario, it’s already part of that context. And what we are doing is to really make it more efficient.
[00:11:43] Camille Morhardt: Okay. So maybe I will, I’ll buy that on the small scale and the individual, I guess, autonomy to say whether a parent or a person to say, “okay. Yes. You know, I think this is going to help me learn better I’m in or “no, thank you.” Right. But I guess now that we have people in distributed physical environments, and we have AI kind of on the backend or in the cloud that can process this, not on this one one-on-one level, like you’re saying where it’s the teacher getting a ping, like, “Hey, the back row is you’ve completely lost the back row.” You know, “now we’re saying, well, AI could like process people in their individual homes and provide information to some like broader platform or application or being in control, I guess, whatever Dawn’s kind of referencing. How do we deal with that kind of a scenario?
[00:12:36] Sinem Aslan: Again, , it depends on the usage, right? How you use these analytics. So in our case, we are not using it as a summative evaluation of like, “okay, this student is 90% engaged. She will get A, from this class,” right? That’s not the users that we are doing.
What, what we are doing is we are sharing these learning analytics with the teacher so that the teacher use them as a baseline for starting the conversations with individual student. And from our own pilots, we have seen how the teachers are utilizing these things. Right. Like in one of the pilots, for instance, what we have seen was very interesting. So I would like to share it if that’s okay. Um, so like, um, the teacher saw the child not so much engaged recently and, uh, she went and approached the student and said, like, “I see that you are not really engaged these last couple of days.” And she realized that the student is started the headache kind of an eye problem and that was the reason he wasn’t like watching the video, but he was still listening to it. Right. Because an instructional video, you can either watch or listen to it. And, but the model say the student was not like an engaged in this context.
So if the teacher just relied on what the models say and say an evaluation, “okay, then this student is really problematic,” then it’s a problem. Right. But here in our usage, we are asking the teacher to use that information as a starting point to create that conversation. The other way around, she doesn’t know anything about the students, right? She doesn’t know, are they struggling? Are they engaged? Are they following me? Like she doesn’t have anything to act on. But now she has a baseline information, but of course there is always a risk, right? Even for the most–how to say—“useful technologies” we develop, people can use it for the negative purposes, but that’s not our intent.
[00:14:38] Saurav Sahay: This field currently is evolving very fast and, uh, people are aware that we don’t want our AI to make decisions or create biases in the mind of say the decision makers for say children in the case, when they are learning in a classroom. One of our collaborators, we are working with, they, they are it’s a university. They’re creating a digital platform where they’re trying to use non-inclusive technology, not camera, but some other sensing modality to try to detect whether students are engaged or not. And based on that, they’re not informing the teachers about what the student’s issue is, but how they can improve, uh, or how they can, uh, be better educators, how they can teach in a better way? What’s the idea of active listening? Where is it? Kids are getting stuck. And if we can surface these issues also there, that’s actually very interesting.
And there are other areas, for example, postpartum depression is an active research area where people have done a lot of research using sentiment analysis and emotion recognition to try to detect whether someone is depressed or going through postpartum depression or not. It’s all usage based and that definitely many good users of these technologies.
[00:15:58] Dawn Nafus: If I could jump in for a second, you know , in a sense, what we’re talking about is really what makes responsible AI so hard and it’s really pushing in a way at the cutting edge of what responsible AI is because some of my colleagues right now are asking this question of when do things start to become unethical? and when is intent, in a sense, both the right place to start, but also not the whole story? Right. So, I mean, there’s a couple of things here. One is if we can satisfy ourselves that, you know–in the example of the technology that that’s Saurav and Sinem were talking about–if we satisfy ourselves that the way we’ve got it designed is doing the right thing and we’re going to be net positive here, then part of our job is also to then kind of do that almost adversarial scenario writing where we say, “okay, let’s put on the red hat in exactly the same you way you would in security and say, “okay, now what? Now it goes to the customer that whatever it is, and you can start to fill in the blanks and you can start to see, you know, what starts like as a good idea and then, you know, somebody might actually for totally benign reasons to say, “all right, well, let’s just expand the scope a little bit.” Right. “Let’s just add one more feature” or it doesn’t even necessarily think about, and it’s like, “well, of course the school is going to want the data, right. That’s transparency.” And then all of a sudden what starts was, you know, started out pretty good and carefully done, ends up being. something that really is quite problematic.
And so that’s where I think the second thing comes in, which is, as a society, you know, we need to be much tougher customers. When schools are starting to purchase this stuff, right, we need to, you know, as the responsibile AI community, we need to be supporting them and asking the really hard questions about how does this work? How doesn’t it work? What don’t we want? What can we switch off? Right. And with that kind of more skeptical customer base, then we can start to make sure that things, land where they wants to land and don’t end up having some mission creep into some territory that I don’t think anybody wants.
[00:18:12] Camille Morhardt: Right. And it might be useful noting there is not a universal standard for privacy at this point, legally, or, I mean, literally from a standards perspective. I mean, there isn’t one, but I did have a conversation with Claire Vishik, who’s a fellow at Intel, um, specifically around privacy and its policy around the world.
And it’s not the same everywhere. And there’s actually not a lot of common standards. There’ll be very specific standards for specific things or regional standards, but you’re kind of up against that as well.
[00:18:41] Dawn Nafus: There’s a lot of cultural variation and in what we mean by privacy and when, and when and where would you even want it? I have colleagues who do research in Papua New Guinea, who literally cannot get away for like an hour because the villagers think “he’s lonely. And why would you ever do that?” Right. So it is very, it’s highly relative.
[00:19:04] Sinem Aslan: I feel like it’s important to give users option, right? Like at this stage, if they believe that for instance, this feature is useful for them.
[00:19:13] Camille Morhardt: Right. Opt-in versus universal. Yeah,
[00:19:15] Sinem Aslan: Exactly. Exactly. I mean, obviously I have for instance, have a dialogue system at home and it just listens to me all the time. I know that, you know, like, but I don’t have any other options, at least the one that I know, right. Like to decide on how to use my data kind of a thing. So I think that kind of giving that kind of agency to the users and individual users making those decisions, right, if I see a benefit I will opt-in, but if I don’t, then I can opt out. Right.
[00:19:49] Camille Morhardt: So Saurav, can you ground us a little bit more, I know you work on some of the underlying algorithms being computer scientist. What actually are some of these systems? When I talked with Lama, she did talk about looking at how human bodies interfere with wifi signals and that we can actually detect movement and presence through wifi signals.
I think everybody’s heard of, you know, cameras, there’s gesture recognition. There’s this new kind of natural language processing. So can you tell us what the different kinds of sensors are that are getting at emotion recognition?
[00:20:24] Saurav Sahay: Sure. So I, I work in the area of multi-modal dialogue and interactions and, uh, in this, uh, where we work, uh, on conversational AI technology. So we work on things like multimodal language understanding and generation dialogue management, and then creating solutions that can make these interactions–either human-AI interaction or AI-AI interaction, very efficient.
So for emotion recognition, we are looking at audio, vision and text modality primarily, and also heart rate. BCI–brain computer interface–and we have typing speed. I mentioned like context awareness–how do you interact with your machine and how can you capture various, uh, what application you’re using? What’s your say speaking rate? what’s your typing speed? All of these are relevant to detecting signal-like emotion or emotion in the use case, uh, could become, say your frustration or your confusion or some other signal that is fine tuned actual use case that you are trying to develop.
[00:21:33] Camille Morhardt: So you’re talking about if I start hitting the keys really hard and typing really fast, then you can assume, I feel really compelled or maybe angry, or you might have some subset of emotions. You’re narrowing it down? This is not a calm, happy space based on how you’re typing?
[00:21:50] Saurav Sahay: You could, or you could not, depending on your baseline. So now if you have been typing calmly most of the time, like, but then suddenly your typing speed goes up quite a lot or you’re punching on keys hard, it’s a signal that then with the other context together, you can come up with some inferences. And more often than not, we still use supervised learning methods. So you have the signal, you have the training data to create that systems today. So then yeah, you can make fairly good conclusions about how your typing speed is in influencing your affective state in a way.
[00:22:28] Camille Morhardt: Okay. One more question on this front: when you talk about human-brain interface, uh, you’re going to have to help me out. I understand like eye movement and gestures of my hands, but now you’re connecting sort of directly or indirectly through the brain. Tell us what that is exactly? How do you take go from, you know, a sensor on a brain to “I can finish that sentence for you?”
[00:22:54] Saurav Sahay: I remember a demo that were, that happened more than 10 years ago, when I was at Georgia tech and there was a person sitting on a machine and he was thinking about getting a coffee mug of coffee. And there, there was this robot that just by magic, gave that coffee to the person. So just like that neural interfaces are getting mature enough with a lot of sensing that happens. With e.d. sensors you can now create systems that can detect single words and some characters that you’re thinking about.
So this is a nice link to the language modeling work that’s happening today. That already is very powerful today that can help you generate predictive texts, say the next word or auto-complete that we all have seen in certain commercial email systems today. So, so a lot of interesting work is happening also in our lab where we’re trying to connect EEG signal with word prediction technology, to help patients, people who are in locked in states, uh, complete the sentence and even faster.
[00:24:01] Camille Morhardt: I can think the word coffee at this point, maybe that’s from a, a filtered set of words or provided set of words that the computer could recognize, but I don’t know. Could I just think “unicorn” and it would get it, or is it more like it’s going to anticipate that I’m at work and my thoughts are going to be around something from, you know, cyber security to coffee and it can recognize that word because I’m lighting certain signals in my brain that I think about when I think coffee.
[00:24:32] Saurav Sahay: So today’s technology is limited. Mostly it’s a limited set of vocabulary and words that we can predict; but there’s a lot of work happening that keeps on expanding that list of words that the system can guess. And now linking it with context, linking it with what you said just before can allow you to even further expand your vocabulary.
[00:24:56] Dawn Nafus: As an anthropologist. I can, again, this is a little bit of a far field, but one of the things that anthropology really does is it tries to understand what metaphors are people using to understand what it is they’re doing. And I do think it’s notable that we’re in a time where notions of the brain, computer-brain interfaces, neuromorphic computing–which is a way of taking inspiration from one understanding of what a brain is, right to sort of do the hardware, right? Labeling everything “smart.” Right.
It’s sort of interesting that we’re kind of in this moment where there’s a sort of preoccupation the human human brain; it’s never anybody else’s brain, it’s always the human, right? which is sort of an evolution from sort of earlier kind of more machinic notions, right. That like, you know, but the economy is the machine, right. Everything is kind of quite mechanical. So we can sort of see this as a, kind of a cultural moment, but it’s also a bias in a way, right.
Because, if you think about, you know, other values outside of Silicon Valley, right? you might arrive actually at different metaphors. That might be more inspirational–ecological metaphors, metaphors to do with kinship or family, or, you know, we can have a nice big laundry list, but. It’s just notable that there’s this, this cultural preoccupation that’s is driving computing in some directions and not necessarily others.
[00:26:22] Camille Morhardt: You mean driving computing toward the human brain and how we’re processing and our emotional recognition versus some other value that might be out there? Um, Sinem what would you like to add as kind of a final thought here? Give us something that you think is sort of a hot topic maybe we haven’t touched on enough in this conversation.
[00:26:45] Sinem Aslan: Maybe this bias, uh, discussion, uh, because I think we are also like, of course talking about the bias that these machine learning models can produce, but there’s also the human bias, as well, right? For my own research, I know that from literature teachers are for instance, biased towards some kids, right. They always give verbal interventions to those kids, or they always praise certain kids–I mean, some of them, at least, right. So there is also human bias involved in these things.
And I mean, to me, it might be more dangerous, right? Like I, we can potentially control the bias in machine learning models by controlling the data set that we are kind of training them. But on the other hand, there is also the bias that humans do on a day-to-day basis. So like a, how do we balance these two? Are we giving maybe more data to get rid of some of the human bias, right? Like that’s another thing that I think we should think about.
[00:27:48] Camille Morhardt: Interesting you guys. Oh my God, such a fascinating conversation. it’s so interesting, I think really, um, psychology, anthropology, computer science all sitting down together in the same lab, but also, you know, like you’re saying Dawn, you’re attending anthropology conferences where other whole conversations are happening and bringing that back into the lab. I think it’s, it’s really interesting.
And of course you, you three are a subset of this lab, too. I just want to mention this is Intelligent Systems Research Lab, which has many other disciplines as well. Um, and so. Just really wanted to give a little bit more insight into some of the people on the team and what they’re thinking and studying.
So Sinem, Saurav and Dawn, thank you so much for joining today.