PhD project Detection and diagnosis of system issues (e.g. outages, faults, degraded service- level) is important in large-scale systems due to impact on end-user experience and brand reputation.
Besides the accompanying loss of revenue, lots of efforts are required to identify potential root-causes before resolving them. However, the ability to observe systems’ behavior and perform diagnostics is a challenge for large- scale systems and even more so for geographically distributed systems like telecom networks and mobile edge clouds.
The research will take an analytics-driven approach towards addressing the observability problem through improved system visibility and trouble-shooting in large and dynamic systems like mobile edge clouds and mobile networks. The research will focus on instrumentation, data collection, system modeling and analytics to drive automation of anomaly detection and diagnosis by leveraging machine learning techniques and BigData platforms. The goal is to enable shorter lead-time, improve user experience, and minimize the need for experts in problem diagnosis.
The project's initial phase will focus on visibility, that is, addressing the fundamental questions of what and how to instrument, how frequent data should be collected and how it should be aggregated, and how the overall system should be modeled. The level of decentralization required to support distributed storage and analytics will also be examined. With distributed storage and streaming frameworks like Apache Storm1, Spark2 and Hadoop3, it is now possible to process in real-time large datasets from diverse sources with low compute and storage overhead.
The project's 2nd phase addresses observability through proactive autonomous anomaly detection. According to our comprehensive survey of the research area, current research focus is shifting away from simple threshold-based alerting and application-specific modeling approaches towards sophisticated data-driven techniques that account for many more KPIs and inherent temporal behaviour. However, techniques based on supervised learning perform poorly in dynamic environment, as they may not recognize new system behaviour or work with unlabeled traces.
The vision of the proactive approach is to innovatively combine machine learning and forecasting techniques to support anomaly detection. Time-series analysis (e.g. ARIMA) and probabilistic models (e.g. Bayesian networks, Hidden Markov Models) will be explored to predict future states of KPIs , impending anomalies and system-level issues (e.g. bottlenecks and faults). Continuous benchmarking will be performed to produce baseline profiles of the system under varying context while appropriate unsupervised learning technique will be used to detect changes in relevant KPIs and to determine when to update system models.
The project's 3rd phase addresses automated diagnostics. While existing approaches mostly focus on detecting abnormal changes in metric values to pinpoint suspicious metrics, the main challenge is in identifying actual components or nodes of the infrastructure. Since problems manifest differently depending on the execution context and workload, it is important to distinguish between potential causes in order to recommend the right corrective action. The focus is to address the diagnostics problem through automatic multi-layer root-cause attribution and root- cause analysis by using graph theory techniques to explore spatial dependencies across the network and AI (e.g. Fuzzy Logic and Probabilistic reasoning) to exploit expert and domain knowledge.