Multi Agent AI and Network Knowledge Graphs for Change — Ola Mabadeje, Cisco
Channel: aiDotEngineer
Published at: 2025-08-22
YouTube video id: m0dxZ-NDKHo
Source: https://www.youtube.com/watch?v=m0dxZ-NDKHo
[Music] Good afternoon everyone. My name is Ola Mabad. I'm a product guy from Cisco. Um so my presentation is going to be a little more producty than techy, but um uh I think you're going to enjoy it. So um I've been at Cisco working on uh AI for the last three years and um I work in this group called outshift. So outshift is Cisco's incubation group. uh our charter is to help Cisco look at emerging technologies and see how this emerging technologies can help us accelerate the road maps of our traditional business units and uh so um by uh by training I'm an electrical engineer um dabbled into network engineering enjoyed it and I've been doing that for a while but over the last three years focused on AI um our group also focuses on quantum technology so quantum networking is something that we're focused on and um if you want to learn more about what we do. Uh we're out shift at Cisco. Uh you can learn more about that. So uh for today we're going to dive into this uh real quick and um like I said I'm a product guy. So I usually start with my customers problems trying to understand what are they trying to solve for and then from that work backwards towards creating a solution for that. So as part of the process for us we usually go through this incubation phase where we ask customers a lot of questions and then we come up with prototypes we do a testing b testing and then we kind of deliver an MVP into a production environment and once we get product market fit that product graduates into the Cisco businesses so this customer had this issue they said when we do change management we have a lot of challenges with failures in production how can we reduce that can we use AI to reduce that problem. So we double clicked on that problem statement and we realized it was a major problem across the industry. I won't go into the details here but it's a big problem. Now uh for us to solve the problem wanted to understand does AI really have a place here or it's just going to be rulebased automation to to solve this problem. And we looked at the workflow we realized that there are specific spots in the workflow where AI agents can actually help address a problem. And so we we kind of highlighted three, four and five where we believe that AI agents can help increase the value uh for customers and reduce the pain points that they were describing. And so we sat down together with the teams. We said let's figure out a solution for this. Um and so uh this solution consists of three big buckets. The first one is the fact that it's a it has to be natural language interface where network operations teams can actually interact with the system. So that's the first thing and not just engineers but also systems. So for example in our case we built this system to talk to an ITSM tool such as service now. So we actually have a agents on the service now side talking to agents on our side. Um the second piece of this is the multi- aent system that sits within the within this application. So we have agents that are tasked at doing specific things. So an agent that is tasked as doing impact assessment doing testing doing uh reasoning around uh potential failures that could happen in the in the network. And then the third piece of this is where we're going to spend some of the time today, which is network knowledge graph. So we have a a the concept of a digital twin in this case. So what we're trying to do here is to build a twin of the actual production network. And that twin includes a knowledge graph plus a set of tools to execute test testing. And so um we're going to dive into that in a little bit. But before we go into that, I I we we had this challenge of okay, we want to build a representative representation of the actual network. How are we going to do this? Um because if you know networking pretty well, networking is a very complex uh technology. You have a variety of vendors in the customers environment, variety of devices, firewalls, switches, routers and so on. And all of these different devices are spitting out data in different formats. So the challenge for us was how can we create a representation of this real world network using knowledge graphs in a data schema that can that can be understood by agents. And so the goal was for us to create this ingestion pipeline that can represent the network in such a way that agents can take the the right actions in a meaningful way and predictive way. And so for us to to kind of proceed with that we had this three big buckets of things to consider. So we we had to think about what are the data sources going to be. So if you again in networking there are controllers systems there the devices themselves there agents in the devices there are configuration management systems all of these things are all collecting data from the network or they'll have data about the network now when they spit out their data they're spitting it out in different languages yang JSON and so on another set of considerations to have and then in terms of how the data is actually coming out it could be coming out in term of streaming telemetry it could be configuration files in JSON it could be some other form of of data How can we look at all of these three different considerations and be able to set come up with a set of requirements that allows us to actually build a system that that addresses the customer's pain point again and so um the team uh from a product side we had a set of requirements we we wanted a system that uh a knowledge graph that can have multimodel flexibility uh that means you can talk key value pairs you understand JSON files it understands uh relationships across different entities in a network. Second thing is performance. Uh if a if an engineer is querying a knowledge graph, we want to have instant access to the node information about the node no matter where the the location of that node is. That was important for our customers. The second thing was operational flexibility. So the schema has to be such that uh we can consolidate into one schema framework. Uh the fourth piece here is where the the the rag piece comes into place. So we've been hearing a little about graph rag for for for a little bit today. uh we wanted this to be a system that has ability to have vector indexing in it so that when you want to do semantic searches at some point you can do that as well. And then in terms of just ecosystem uh um stability we want to make sure that when we put this in the customer's environments uh there's not there's not going to be a lot of heavy lifting that's going to be done by the customer to integrate with their systems and again it has to support multiple vendors. So these were the requirements from a product side and then our engineering teams kind of we started to consider some of the options on the table. uh neo forj obviously uh market leader uh and the various other open source tools. At the end of the day the engineering teams decided to kind of do uh some analysis around this. So I can I'm showing a table on the right hand side. It's not an exhaustive list of things that they considered but these were the things that they looked at that they wanted to see okay what is the right solution to address the requirements coming from product and um uh we they kind of we kind of all centered around the first two here no 4G and Arango DB but for historical reasons the team decided to go with because we had some use cases that were in the security space uh that was kind of a recommendation system uh type of use cases that we wanted to kind of continue using and so um But we are still exploring the use of Neo forj for some of the use cases that are coming up as part of this project. So um we settled on on a DV for this and uh we eventually came up with a solution that looks like this. So we have this knowledge graph solution. This is an overview of it. Um on the left hand side we have all of the production environment. We have the controllers the the Splunk which is a sim system traffic telemetry coming in. All of them are coming into this ingestion service uh which is doing an ETL transforming all of this information into one schema open config. So open config schema is a schema that is designed around networking primarily and uh it helps us to because it there's a lot of documentation about it on the internet. So LM understand this very well. So um this setup is primarily a a database of uh of uh networking information that has open config schema as a primary way for us to communicate with it. So uh natural language communication through an individual engineer or the agents that are actually interacting with that system. And so we built this in the form of layers. So uh if you if you're if you're into networking again um there is a set of entities in the network that you want to be able to interact with. Uh so we have layered this up in this way such that if uh there's a tool call or there's a decision to be made about a test for example let's say you want to do a test about uh configuration drift as an example um you don't need to go to all of the layers of the graph you just go straight down to the raw configuration file and be able to do your comp comparisons there. If you're trying to do like a test around reachability for example then you need a couple of layers maybe you need raw configuration layers data control data plane layers and control plane layers. So um it's structured in a way that when the agents are making their calls to this system uh they understand what the request is from the from the uh system and they're able to actually go to the right layer to pick up the information that they need to ex to execute on it. So this is kind of a high level view of what the graph system looks like in layers. Now um I'm going to kind of switch gear switch gears now and go back to the system. Remember I described a system that had agents a knowledge graph and digital twin as well as natural language interface. So let's talk about the agentic layer and before I kind of talk about the specific agents in um in this system on this application we are looking at how we are going to build a system that is based on open standards for all of the internet and this is one of the challenge we have within Cisco. We we are looking at a system a a set of a collective open source collective that includes all of the partners we see down here. So we have uh outship by Cisco we have lang chain Galileo we have all of these uh members who are supporters of this uh of this collective and what we are trying to do is to set up a system that allows agents from across the world. Uh so it's a big vision uh that they can talk to each other without having to do heavy lifting of reconstructing your agents every time you want to integrate them with another agent. So it consists of identity uh schema framework for defining an agent skills and capabilities the directory where you actually store this agent and then how you actually compose the agents both at the semantic layer and the synthetic layer and then how do you observe the agents in process all of these are part of this collective uh vision as as as a group and if you want to learn more about this is on agency.org RG and I also have a slide here that kind of talks about um there's real code actually that you can leverage today or if you want to contribute to the code uh you can actually go there there's a GitHub repo here that you can go to and and you can start to contribute or use use the use the data um there's documentation available as well and there's sample applications that allows you to actually see how this works in real life and uh u we know that there's MCP there's A2A all of these protocols are becoming uh very popular uh we also integrate all of these protocols Because the goal again is not to uh create something that is bespoke. We want to make it open to everyone to be able to create agents and be able to make these agents work in production environments. So back to the specific application we're talking about based on this framework, we delivered this set of agents. Uh we built a set of agents as a group. So we have five agents right now as part of this application. Um there's an assistant agent that's kind of the planner that kind of orchestrates things across the glo across all of these agent agents. And then we have other agents that are all based on React reasoning loops. There's one particular agent I want to call out here, the query agent. This query agent is the one that actually interacts directly with the knowledge graph on a regular basis. Um we have to fine-tune this agent because um we initially started by doing a uh attempting to use rack to do some querying of the knowledge graph, but that was not working out well. So we decided that for immediate results, we're going to fine-tune it. And so we did some finetuning of of the of of this agent with some schema information as well as example queries. And so that helped us to actually reduce two things. The number of tokens we were burning because every time we were before that the AQL queries were going through all of the layers of the knowledge graph and in a in a reasoning loop was consuming lots of tokens and taking a lot of time for it to result to return results. After fine-tuning, we saw a drastic reduction in number of tokens consumed as well as the amount of time it took to actually come back with the results. So that kind of helped us there. Um so um I'm going to kind of pause here. I'm talking a lot about there's a lot of slide wear here. I want to show a quick demo of what this actually looks like. So tying together everything from the natural language interface interaction with an ITSM system to how the agents interact to how that collects information from knowledge graph and delivers results to the customer. Okay. So um the scenario we have here is a a network engineer wants to make a change to a firewall rule. they have to do that to accommodate a new server into the network. No doubt. And so what they need to do is to first of all start from ITSM. So the submit a ticket in uh in their in service now. Now our system here the the v the UI I'm showing you right here is the UI of the actual system we've built the application we built. We have ingested information about the uh tickets here in natural language and so the agents here are able to actually start to work on this. So I'm going to play a video here just to make it uh uh more relatable. So the first thing that's happening here is that these agents uh the first agent is asking that the inter for the for the information to be synthesized in a summarized way so that they can understand uh what to quickly do. The next action that has been asked here is for you to create an impact assessment. So impact assessment here just means that I want to understand. So will this change have any implications for me beyond the immediate uh target area and that's going to be summarized and we are now going to ask the agent that is responsible for this particular task to go and attach this information into the ITSM ticket. So I'm going to say uh attach this information about the impact assessment into the ITSM ticket. So that's been done. Now the next step is to actually create a test plan. So test plan is one of the biggest problems that our customers are facing. Um they they run a lot of test but they miss out on the right test to run. So this agents are actually able to reason through a lot of information about test plans across the internet and based on the intent that was collected from the service now ticket is going to come up with a list of tests that you have to run to be able to make sure that this firewall rule change doesn't make a big impact or create problems in production environment. So as you can see here, this agent has gone ahead and actually listed all of the test cases that needs to be run and the expected results for each of the tests. So we're going to ask this agent to attach this information again back to the ITSM ticket because that's where the approval board needs to see this information before they implement before the approved implementation of this change in production environment. So we can see here that that information has now been attached back by this agent to the ITSM tickets. So two separate systems but agents talking to each other. Now the next step is actually run a test on one of these test cases. So um in this case the configuration file that is going to be used to make the change in the firewall is sitting in the GitHub repo. And so we're going to do a pull request of that config file and going to take that information. So this is the GitHub repo where the where we're going to do a pull request. We're going to take the link for that pull request and paste it in the ticket and so that when the executor execution agent starts doing its job is actually going to pull from that and use it to run this test. So um at this moment we we have we're going to start running the test. We're going to ask this agent to go ahead and actually run the test and execute on this test. And so um I have attached the change sorry I don't have my glasses. I've attached my uh change candidates to the ticket. Can you go ahead and run the test? So what is going to happen here is if you look on the right hand side of this screen here, a series of things are happening. The first thing is that the this agent called the executor agent goes looks at the test cases and then it goes into the knowledge graph and it's going to go ahead and actually do a snapshot of the most recent visual or most recent information about the network. is now going to take the pull request that it pulled from GitHub, the snapshot it just took from the knowledge graph. It's going to compute it together and then run all of the individual test one at a time. So we can see that it's running the test one test, test test one, test two, test three, test four. So all of this is happening in what we call a digital twin. So a digital twin again is a cons combination of the knowledge graph, a set of tools that you can use to run the test. So an an example of a tool here could be batfish or could be routnet or some other tools that you use for engineering purp for network engineering purposes. So once all of these tests are completed uh this tool actually is going to this agent is going to now generate a report about the test results. So um we give it some time to run through this. It's still running the tests but when once it concludes all of the tests it's going to report actually uh the test results are. So which results which tests actually passed which ones failed. for the ones that have failed, he's going to make some recommendations on what you can do to go and fix the problem. Um um I'm going to skip to the front here to just quickly get this on uh done quickly because of time. Um so um it's attached the results to the ticket and this is the report that it's spitting out in terms of this is the report for the test that were run. So this execution agent actually created a report about all of the different test cases that were run by the system. So um very quick short demo here. Uh there's a lot of detail behind the scenes but I can answer some questions offline. Um the the the couple of things I want to leave us with is that uh before I go to the end of this uh is that evaluation is very critical here for us to be to able to understand how this delivers value to customers. Um we're looking at a variety of things here. So the agents themselves the knowledge graph digital twin and we're looking at the what can we actually measure quantifiably. Now for the knowledge graph, we're looking at extrinsic metrics particularly not intrinsic ones because we want to map this back to the customer's use case. So this is the summary of the of what we see in terms of evaluation metrics. Um we are still learning this is a this is for for now it's it's an MVP u but what we are learning so far is that those two key building blocks the knowledge graph and the open framework for building agents is very critical for us to actually build a scalable system for our customers. And so, um, I'm going to stop. It's 8 seconds to go. Thank you for listening to me. And then if you have questions, I'll be out there. [Music]