Original source: Outshift by Cisco
This video from Outshift by Cisco covered a lot of ground. 17 segments stood out as worth your time. Everything below links directly to the timestamp in the original video.
This development could transform how organizations handle system diagnostics and compliance, potentially reducing the time and resources spent on identifying complex technical issues.
Cisco Develops AI Agents for Troubleshooting and Compliance in Cisco IQ
Cisco is integrating advanced AI capabilities into its Cisco IQ platform, transitioning from traditional workflow-based solutions to sophisticated deep agent systems. These agents leverage inductive, data-driven reasoning to identify root causes in complex, previously unseen scenarios across diverse domains and vendors. This approach moves beyond classical expert systems that are limited to known issues, enabling the AI to troubleshoot much like a human, even if the results are probabilistic and may involve some speculation.
"I can go and do inductive reasoning just datadriven much like I do deductive reasoning using classic expert systems in the past."
Time Series Foundational Models Revolutionize Forecasting with Zero-Shot Capabilities
Time series foundational models are bringing a "massive" transformation to data analysis, enabling zero-shot forecasting across multiple domains without the need for dataset-specific tuning. These models utilize a decoder-only transformer architecture, enhanced with patching and long context windows, to efficiently learn and predict temporal patterns. This advancement eliminates the tedious, dataset-by-dataset tuning that previously consumed significant time for analysts.
"This is really massive guys. This helps us big time. And it's this one thing when I saw this where I said, 'Boy, I wasted 10 years of my life fiddling around with all these individual like per data set tuning things.'"
Cisco's New Time Series Model Boosts Observability Benchmarks
Cisco has introduced a new time series model that significantly lowers the mean absolute error on observability benchmarks while maintaining comparable performance on general benchmarks. This model, which will be available in Splunk with a single command, integrates both coarse and fine time blocks with resolution embeddings to handle varying data frequencies and ensure accurate forecasting. The system aims to simplify complex time series analysis by providing a highly accurate, streamlined solution.
"We come in lower than anybody else. So that means if you're coming up with the right architecture and the right training data, we can undercut everybody."
Agent-Driven System Resolves 'Vote App' Issue, Provides Audit Trail
A prototype system demonstrated an agent-driven investigation successfully resolving a "vote app not reachable" issue, tracing the root cause to disk pressure and associated image pull-back errors. The system initiated with the original problem statement, then iteratively generated and evaluated hypotheses, gathered evidence, and converged on a solution, much like a human troubleshooter. This process included an audit trail of its investigative steps and the capability to generate a detailed report for management.
"We have an investigation for exactly this vote app. And um, what the system came back with here is like this is the boring piece. This is the ultimate resolution."
Deep Agent System Diagnoses 'Vote App' Outage to Disk Pressure
A deep agent system successfully diagnosed a "vote application not reachable" issue, tracing the root cause to disk pressure and ephemeral storage problems within a Kubernetes environment. The system operates with a planner agent and multiple evidence agents that can generate code to ingest and analyze data, identify dependencies, and iteratively refine hypotheses. An investigation graph visualized the process, demonstrating how the agents moved from initial symptoms to the ultimate problem through a series of logical steps.
"So I'm forming I I gather new data and I'm reformulating things... Then I find something like the vote app failed to be admitted to a pot due to disk pressure. Ah we're getting close, right?"
AWS and DataDog Advance Time Series Forecasting with Novel AI Architectures
AWS and DataDog are employing distinct, advanced approaches to time series forecasting. AWS quantizes time series data into a fixed vocabulary of tokens, treating them like words for transformer model training, and has integrated group attention blocks to capture correlations across different time series. DataDog, meanwhile, has scaled its training data to over two trillion points and developed models that prioritize time attention over group attention, reflecting the belief that historical trends within a single series are more indicative of future behavior.
"DataDog went crazy like well they have data like we have with Splunk they have their own observability system so they said well we have that data let's go and use it so they trained on more than two trillion data points so size matters right."
Cisco Develops Time Series Model for Observability with Massive Data Scale
Cisco has developed a new time series foundational model, leveraging a decoder-only approach and training on 300 billion time points, combined with extensive data cleaning. This model specifically addresses observability challenges by integrating both hourly rollups and granular observations, mimicking human memory by summarizing historical data without retaining every detail. The architecture is designed to handle different input frequencies and focus on predicting future events in dynamic environments.
"We went up to 300 billion individual time points and we cleaned the data big time to go and make sure that the entropy is nicely balanced."
Patching Method Reduces Transformer Computational Cost for Time Series
A significant architectural advancement for transformers involves treating groups of data points, known as patches, as single tokens to manage the high computational cost associated with long time series. This technique drastically reduces computational complexity from being proportional to the square of the sequence length to the square of the number of patches. For instance, processing 10,000 data points as individual tokens would involve 100 million computations, but with 32-point patches, this reduces to approximately 100,000 computations, offering substantial savings, especially with multiple attention heads.
"So it's the sequence length times the sequence length. So it's the square of the sequence length i.e. the number of tokens in your context window."
AI Agents Proposed for Collaborative Troubleshooting, Mimicking Human Problem-Solving
Traditional troubleshooting methods, often involving data consolidation in data lakes or cross-departmental meetings, struggle with the complexity of modern IT environments. A new approach proposes an agentic system where AI agents collaborate, mirroring human problem-solving. These agents form hypotheses, gather evidence like logs and commands, evaluate relationships between data points, and iteratively refine their understanding to pinpoint root causes, moving away from rigid, computer-centric processes.
"How about we do this like humans do? So we bring these different domains together and we have these guys behave like humans but this is no longer humans but this is agents that would discuss and troubleshoot like humans would do."
ChatGPT Inspires New Era of Generic Time Series Foundational Models
Inspired by the success of ChatGPT's scaling, the field of time series analysis is shifting towards foundational models designed to learn generic temporal patterns from massive datasets. First explored by Nixla in late 2023 using transformer architectures, these models aim to predict across diverse scenarios, from networking data to financial markets, using a single, pre-trained model. This approach promises to replace repetitive, one-off analyses with a more cost-effective, inference-only process after initial training.
"It was scale. It was really the number of parameters plus the size of the training set that made transformers perform really well. We all remember that, right? The the ChatGPT moment happened because some engineers decided to go and scale the overall thing."
New AI Forecasting Models Enhance Accuracy with Larger Patches and Synthetic Data
To improve forecasting accuracy and mitigate accumulated error, new AI models are adopting larger output patches, which reduces the number of prediction steps required to forecast future points. Additionally, these models complement real-world training data with synthetic time series to fill gaps and enhance data balance, totaling an extra three billion data points across three million synthetic series. The architecture leverages a decoder-only, auto-regressive transformer, similar to GPT, optimizing it for efficient sequential prediction in time series analysis.
"How about I don't do that and I do like bigger steps. So I've taken larger output patches and I can reach 512 points into the future with only four steps."
Researchers Develop AI Method to Identify Key System Features from Metric Changes and Sensor Names
Researchers have developed an AI-driven method to automatically identify representative features in complex systems by analyzing metric changes and the rarity of tokens in sensor path names. The approach considers how metrics change—such as step functions, spikes, or variance shifts—and scores the uniqueness of terms within sensor names, inspired by TF-IDF and entropy concepts. This allows the system to highlight highly important tokens that signal significant system events, like an "admin up down" status, which is rare but critical when it occurs.
"Something that is frequently happening in this document but hardly ever used outside or very rarely used that sounds like important and I want to go and put these up as keywords."
Time Series Foundational Models Evolve to Address Real-World Complexities
Early time series foundational models, while performing well on benchmarks, faced significant challenges with real-world complexities such as local seasonality, non-stationary behavior, and heterogeneous inputs. These limitations prompted a wave of architectural and training data advancements aimed at improving generalization, rather than relying on time-consuming fine-tuning for each dataset. Major companies, including Cisco, have since joined the effort to develop their own time series models, introducing innovations to better handle diverse input frequencies and data scales.
"What they've proven is it can be made to work. And then yeah, how do you generalize this? You can obviously say, well, let's go and um fix this by by fine-tuning. Well, fine-tuning just means you're again tuning things on a per case basis. So, you're back to square one. Not a good idea."
Cisco's AI-Driven Telemetry Prioritizes Relevant Time Series with Combined Metric Analysis
Cisco has integrated a feature ranking mechanism into its AI-driven telemetry in iOS XR 731 that combines metric behavior changes with token rarity from names to identify and stream only relevant time series data. This system can hierarchically rank features, for example, highlighting BFD-related counters when a BFD session breaks, even if it's a rare event. By analyzing both the change in metric behavior (like spikes or step functions) and the uniqueness of terms in sensor names, the system intelligently filters data, ensuring that only important information is streamed for monitoring.
"If I do this for like where I broke a BFD session and I hierarchically rank all the features then all counters are spit out that have the name BFD in there. Not a surprise, but this is a cool feature well filtering mechanism that can help me identify what matters in my system if I have no idea."
Time Series Foundational Models Offer Generic Solution for Data Analysis
Inspired by the success of models like ChatGPT, time series foundational models are emerging as a generic solution to simplify the repetitive and costly process of analyzing time series data. These models promise a one-time training investment followed by only inference costs, aiming to provide a "one-size-fits-all" approach that can generalize across various datasets and prediction scenarios. This represents a significant shift from the traditional method of building and tuning individual models for each specific time series problem.
"Can't we have this one sizefits-all rather than I I do this over and over and over again? Yeah, time series foundational models. There is light at the end of the tunnel."
VoIP Outage Highlights Challenges of Root Cause Analysis in Data-Rich Environments
After predicting future events with time series models, the critical next step is root cause analysis, which often involves sifting through vast amounts of data like logs and show commands. A presented scenario illustrates a VoIP application outage in a Kubernetes environment, caused by disk space issues on a worker node. The sheer volume of data, including 127 files across 65 directories of Kubernetes, system, and router logs, underscores the overwhelming challenge faced by IT operations engineers when trying to identify the source of such intermittent outages.
"So, imagine you're an IT ops engineer and you see received the troubleshooting ticket say well my vote application is not reachable anymore."
Researchers Face Practical Hurdles in Advanced Time Series Forecasting Experiments
Researchers have explored various experimental approaches to handle diverse time series problems, including different projection systems for multi-frequency inputs and concatenating multiple time series into one long string for multivariate forecasting. They also experimented with using different output distributions to better match varied data characteristics. However, many of these advanced techniques, while promising in theory, proved difficult to implement effectively in practice, leading some features to be scaled back in subsequent model versions.
"So they tried a load of things and if you look at Morai 2.0, um multi-frequency pro support gone um multivariate support gone."
Also mentioned in this video
- The presentation on cross-domain diagnostics using AI and agentic systems,… (0:09)
- Telemetry is used for two main purposes (0:27)
- The presentation will begin with a level-setting to understand current and past… (1:28)
- The classic approach to time series processing involves collecting data,… (2:28)
- Data collection for time series can come from various sources like Splunk or… (3:17)
- (e.g., BFD session counts vs. interface counters) are crucial for comparability… (5:36)
- After pre-processing, tools like Splunk's AI Toolkit (AITK) are used to further… (8:01)
- Transitions, often achieved using methods like Principal Component Analysis… (9:38)
- Finding the right projection for data is not a one-size-fits-all solution, but… (11:21)
- Identifying the most relevant sensor paths among thousands of constantly… (13:39)
- Time series data, especially network time series, is challenging due to… (19:50)
Summarised from Outshift by Cisco · 59:56. All credit belongs to the original creators. Streamed.News summarises publicly available video content.