Ontology Is Trending Again. Can It Improve My AI Agent's Performance?

This article introduces how ontology provides structured domain knowledge to enhance AI agent accuracy and explainability in enterprise O&M scenarios.

By Wang Chen

Ontology, a concept that sounds a bit philosophical, is increasingly gaining attention from Agent Builders.

Unlike buzzwords such as Prompt Engineering and Harness, although ontology may not sound very sexy, it is more concrete and offers a clearer path to implementation.

1. Ontology: From Philosophical Definition to Machine Cognition

The word Ontology originates from the Greek "ontos" (being) and "logos" (study), literally translating to the theory of being. In plain terms, an ontology is a unified, unambiguous cognitive map drawn for the domain you intend to study.

From the first framework of existential analysis constructed by Aristotle in his Metaphysics to today's observability modeling of enterprise IT systems, ontology has spanned over two thousand years. It has evolved step by step from a core branch of metaphysics into the underlying methodology for digital transformation across thousands of industries.

Whether it is ontology in philosophy or in the field of artificial intelligence, key questions must be addressed at its core: What actually exists in this world? How should these things be categorized? What relations exist among these things, and how do they interact? How do these relations and interactions change with environmental variables?

In the field of artificial intelligence, ontology is used to formulate explicit specifications of concepts, entities, properties, and the relationships between them in a specific domain. For example, Alibaba Cloud's UModel, which we shared previously, is designed to build a cognitive map with models that are easier to understand for complex IT systems in the specific domain of Operations and Maintenance (O&M). This map needs to answer:

What entities exist in the system, such as services, Pods, databases, networks, CPUs, GPUs, memory, etc.
What properties each type of entity has, such as metrics, logs, traces, events, etc.
What relationships exist between entities, such as calling, deployment, dependency, inclusion, running, etc.
How these relationships change over time, and so forth.

Ontology, in fact, has long existed in all aspects of our daily lives.

For example, when we go to a hospital to register, we must first choose internal medicine, surgery, or orthopedics. The correspondence between diseases and departments is one of the most basic ontologies. Let's look at another, more complex ontology. Taking the medical diagnosis scenario as an example: first, define "concepts," such as diseases, symptoms, medications, body parts, and demographic characteristics; the second category is "relationships," such as "manifests as," "used to treat," "acts on," "contraindicated in," and "belongs to"; the third category is "instances," which connect concepts and relationships using specific medical knowledge. For instance, cold is a disease, fever is a symptom, while Ibuprofen and Acetaminophen are antipyretic drugs, and Acetaminophen is contraindicated in patients with liver disease. By linking these instances and relationships, when a patient with mild liver disease gets a cold presenting with a fever, and requires an antipyretic drug, Ibuprofen will be recommended instead of Acetaminophen.

Therefore, the core purpose of ontology is to eliminate ambiguity and establish consensus: ensuring that all participants share a consistent understanding of "what this is," "which category it belongs to," and "what relationship it has with other things."

2. In the O&M Domain, What Problems Does Ontology Solve?

Through long-term practice, Alibaba Cloud's Observability team discovered that AIOps in the era of Large Language Models (LLMs) faces two core challenges: cognitive challenges and data challenges.

Cognitive Challenge: The Semantic Gap Between General LLMs and the O&M Domain

General LLMs encountered massive amounts of internet text during their pre-training phase, including technical documentation, blog posts, and open-source projects. They have indeed internalized a vast amount of common knowledge in the O&M domain, such as "502 errors are usually related to gateways or downstream services," "OOM Killed means insufficient container memory," and "high CPU spikes may cause slow service response." However, this knowledge is statistical, probabilistic, and generalized. When facing the specific IT architecture of a business, the model has no idea of the calling relationships between services, the calling types (synchronous or asynchronous), nor does it know that the billing service encounters IO contention every Wednesday at midnight due to database backup tasks, let alone that a core microservice was migrated from virtual machines to a container cluster just last week.

General LLMs will not build a system topology for IT systems either. Modern enterprise IT architectures are often hybrid-deployed, microservice-oriented, and containerized, making the relationships between components extremely complex. Without an explicit topology map, the LLM is highly likely to fail to see the forest for the trees. The model can see database latency rising and also see error logs at the application layer, but it cannot determine the causal propagation path between the two. Therefore, constructing a cognitive map based on ontology is the only way to bridge the semantic gap between general LLMs and the O&M domain.

Data Challenge: Silos and Semantic Fragmentation of Heterogeneous Data

Data in O&M scenarios is naturally multi-source and heterogeneous: Metrics are time-series values, Logs are unstructured text, Traces are tree-like calling chains, and Events are discrete status changes. These data are stored in different systems, using different naming conventions, different data formats, and different query syntaxes. O&M personnel must master multiple query languages like PromQL, SQL, and SPL, jump back and forth between multiple consoles, and manually correlate and analyze them.

After the introduction of LLMs, this problem did not disappear automatically. Instead, it manifested in a new form: facing massive raw data of low value density, the model does not know which metrics belong to which service, does not know which Pod a log corresponds to, and cannot automatically associate a Span from trace tracking with its corresponding container instance. The implicit correlation between data lacks an explicit semantic-level definition. As a result, even if the model understands the meaning of a single data point, it cannot comprehend the business logic relation between those data points.

The way ontology addresses these two major challenges is by shifting the focus from being data-oriented to object-oriented. In the traditional observability system, O&M starts from "what data do I have"—logs, metrics, traces, and events each exist independently. In the view of ontology, the starting point is "what entities do I have"—services, Pods, databases, network devices, storage, as well as the data generated by these objects, the relationships between the objects, and the knowledge carried by the objects. Data is categorized as observation attributes bound to specific entities, and relationships are inference paths explicitly defined in the knowledge graph.

For a medium-sized enterprise, the number of entities can reach thousands, and the topology map constructed by entities and their relationships is highly complex. Presenting this data to the model in a structured way is the prerequisite and difficult part of providing useful AIOps.

Topology Exploration Feature Provided by Alibaba Cloud UModel

UModel was developed precisely from such practical experience. Alibaba Cloud's Observability team began observability data modeling in 2019. Having gone through development stages including data standardization, entities and relationships, knowledge expression, and action capabilities, it has gradually evolved from the collection and display of scattered data like logs, metrics, and traces to unified modeling oriented towards objects, relationships, and time series.

3. Are Foundation Models Combined with Individual Skills Enough?

First, let's talk about foundation models. In a world where models swallow everything, we will naturally wonder: if the pre-training set already contains relevant knowledge bases, will it replace the cognitive map constructed by ontology for model understanding, thereby undermining the long-term value of ontology?

For general and simple scenarios, such as the medical registration we just exemplified, there is indeed no need to build an additional ontology. While LLMs bring powerful natural language understanding, in specific domains—especially scenarios that rely on private data, private concepts, and private experience as well as rigorous business scenarios like O&M—an ontology is the fastest way to guarantee deterministic output from the model.

Second, regarding individual skills: O&M engineers design Skills by combining their own domain experiences and use LLMs to complete tasks such as intelligent data querying, troubleshooting, and alert analysis. This approach is indeed effective in simple scenarios. An experienced O&M engineer, through carefully designed Skills, can have the LLM help him write query statements, explain error logs, and generate troubleshooting steps. However, when we expand our view from single tasks to enterprise-level O&M, the limitations of this model become apparent.

Limitation 1: Enterprise Private Architecture Is a Knowledge Blind Spot of the Model

The knowledge of pre-trained LLMs is limited to the last update time of the training data, while enterprise IT architectures are constantly evolving. More importantly, internal microservice topologies, service dependencies, custom monitoring metric naming conventions, and private cloud architecture designs will mostly not appear in open internet corpora. The model has never seen this private knowledge during pre-training, so no matter how delicate the Skills are, the model cannot accurately answer questions like "In my system, does the call from service A to service B go through the internal network or the public network?" or "Which business module does this custom metric biz_order_latency_p99 correspond to?". While individual skills can make up for some gaps, each engineer only knows their own area of responsibility. When facing complex cross-team or cross-department failures, individual experience is the limit of capability.

Limitation 2: The Gap from Correlation to Causality

What the model learns from the corpus is mainly statistical correlation. It finds that "disk usage exceeding 90%" and "application crash" often appear together in the training data, so it tends to correlate them. However, O&M scenarios require causal inference: a full disk prevents logs from being written; the inability to write logs triggers a health check failure; a health check failure causes the orchestration system to restart the Pod. Each link in this causal chain has clear logical rules, rather than simple co-occurrence frequencies. Without an ontology to define this causal propagation relationship, the model can only give a guess of "seems related," rather than a definitive conclusion of "this is exactly what happened."

Limitation 3: Lack of Explainability and Auditability

When an AIOps system wakes up an on-call engineer in the middle of the night, the engineer will definitely ask, "On what basis did you determine that this service had a problem?" If the system's response relies on the activation weight of some neuron inside the LLM, the engineer cannot verify the reliability of this conclusion. Conversely, if the system's reasoning process is based on an explicit ontology knowledge graph, such as: "According to the topology ontology, the gateway-level 5xx errors rose. Tracing down the call chain, it was found that the query latency of the order service to the database rose from 20ms to 2s, and the database ontology defined that it is currently in a backup window, where backup tasks consumed a large amount of IO." This represents a verifiable and auditable reasoning process. In high-risk O&M scenarios, explainability is not a bonus point, but a hard requirement.

What is the Difference After Having an Ontology?

An AIOps system supported by an ontology essentially gains a semantic navigation system. It knows which entity each data point belongs to, knows the dependencies between entities, knows how failures propagate in the topology, and knows which knowledge applies to which scenarios. In scenarios like root-cause analysis, blast radius estimation, and automated remediation suggestions, this capability translates into higher accuracy, lower false positive rates, faster localization speeds, and, most importantly, human engineers' trust in system decisions. This is particularly crucial in rigorous scenarios like enterprise O&M.

4. UModel: Implementation of Ontology in the O&M Domain

To address the aforementioned challenges, the Alibaba Cloud Observability team released UModel last year. It is a unified IT world modeling framework built with ontological concepts. It is now live in CloudMonitor 2.0, free to use, and has been open-sourced to the public.

CloudMonitor 2.0 Console Interface (Playground Environment)

From Data-Oriented to Object-Oriented

The core philosophy of UModel is to drive the shift of the observability framework from "data-oriented" to "object-oriented". In traditional observability platforms, logs, metrics, traces, and events are independent data types, and O&M personnel need to manually associate them with specific services or resources. In contrast, UModel models with Entity at the center: all observed objects in the system, including services, Pods, databases, Redis instances, K8s nodes, and CI/CD tasks, are defined as specific types inside EntitySet. Each entity has its own properties, state, observation data, and association relationships.

This shift brings a fundamental cognitive upgrade. When an alert occurs, instead of giving a bunch of isolated metric curves and log snippets, UModel directly locates the "order service" entity, and then automatically aggregates all logs, metrics, traces, events associated with this entity, along with upstream/downstream entities having calling, deployment, or dependency relationships with it according to the defined relation chain. When an O&M engineer asks, "What happened to the order service?", the system does not answer with "Here are some abnormal logs and high latency curves," but rather: "Pod-3 of the order service runs on Node-5. The disk IO of Node-5 spiked starting at 2:00, and at the same time, the P99 query latency of the order service to the MySQL master database rose from 20ms to 2.3s, while the MySQL master database started its weekly backup task at 2:00."

Graph-Centered Modeling Framework

UModel adopts a graph-centered modeling approach, defining three core nodes and four core relationships.

Core nodes include: EntitySet (a collection of observable entities that defines the primary keys, properties, and states of a category of entity resources), TelemetryDataSet (a generic representation of observable data, covering LogSet, MetricSet, TraceSet, EventSet, etc.), Storage (storage abstraction), and Explorer (visualization abstraction).
Core relationships include: EntitySetLink (relationships between entities, such as "serves," "calls," "contains," "runs on," "is identical to"), DataLink (association relationships between data and entities, e.g., a certain category of logs belongs to a certain service), StorageLink (association between modeling abstractions and concrete storage), and ExplorerLink (relationships between data nodes and visualization nodes).

The power of this graph model lies in making complex topology queries and causal reasoning possible. For example, when troubleshooting a Pod anomaly, a graph query can progressively expand the context along relation chains like "Pod runs on Node," "Node belongs to K8s cluster," and "Cluster is associated with storage volume." It can also execute backward tracing along the call chain from affected downstream services to locate the root cause upstream.

Unified Query Language and Multi-Modal Data Fusion

Another crucial design of UModel is the unified query language layer. In traditional O&M toolchains, PromQL is used for metrics, SPL for logs, Cypher for topology, and SQL for resources. O&M personnel have to switch between multiple syntaxes, and LLMs also face huge comprehension costs when processing these heterogeneous queries. Through a unified query abstraction, UModel blends multiple query capabilities such as SPL, SQL, PromQL, Cypher into a single query engine, enabling upper-layer applications to access metrics, logs, traces, events, and graph data in a consistent manner.

Furthermore, UModel supports multi-modal data fusion analysis. In actual troubleshooting, a single data type is often insufficient to locate the root cause. UModel allows combining multiple data sources in a single query: first retrieving alert events from event storage, then querying associated entities within a 5-hop range in the topology graph based on entity IDs in the events, next extracting error keywords from the logs of the associated entities, and finally performing anomaly detection on the metrics of these entities. This cross-data-type fusion analysis is precisely what is difficult to achieve purely relying on the text comprehension capabilities of LLMs.

"Contextualized" Accumulation of Knowledge

UModel also proposes an innovative layered design at the knowledge management level. It divides O&M knowledge into three layers: General Knowledge Base (documents and FAQs generic across entities), Agent Rules (instructions and preferences telling the Agent "how to do"), and UModel Knowledge (SOPs, Runbooks, and best practices strongly coupled with specific entities and topologies).

The key insight of this layering is that the value of much O&M knowledge depends heavily on which entity it describes. A "Database Slow Query Troubleshooting Guide" is just an ordinary document in a general knowledge base, but when bound to the specific entity of "MySQL master database that the order service depends on," it becomes an action guide with precise context. By sinking knowledge to the entity level, UModel achieves contextualization. When handling failures, the Agent does not search for needles in a haystack in general documents, but directly retrieves verified O&M knowledge related to the current entity.

5. STAROps Built on UModel is Now Online

STAROps is an AIOps Agent built on UModel and a foundation LLM, which has now gone online on Alibaba Cloud.

Among its components, UModel is responsible for providing the infrastructure for context-awareness and semantic understanding, while the LLM is responsible for autonomous planning and reasoning capabilities. When a user asks "Why is my service slow?" in natural language, the LLM first understands the user’s intent and then calls UModel's API to get the current service's real-time topology, related metrics, and recent events. What UModel returns is a structured context modeled semantically, indicating which entities are involved, what the relationships between them are, and which metrics are abnormal in the corresponding time window. The LLM performs reasoning based on this precise context to output explainable root-cause analysis and remediation suggestions. Let's watch a video to understand the three core capabilities of STAROps: intelligent data querying, fault localization, and proactive O&M.

Community

Ontology Is Trending Again. Can It Improve My AI Agent's Performance?

1. Ontology: From Philosophical Definition to Machine Cognition

2. In the O&M Domain, What Problems Does Ontology Solve?

Cognitive Challenge: The Semantic Gap Between General LLMs and the O&M Domain

Data Challenge: Silos and Semantic Fragmentation of Heterogeneous Data

3. Are Foundation Models Combined with Individual Skills Enough?

Limitation 1: Enterprise Private Architecture Is a Knowledge Blind Spot of the Model

Limitation 2: The Gap from Correlation to Causality

Limitation 3: Lack of Explainability and Auditability

What is the Difference After Having an Ontology?

4. UModel: Implementation of Ontology in the O&M Domain

From Data-Oriented to Object-Oriented

Graph-Centered Modeling Framework

Unified Query Language and Multi-Modal Data Fusion

"Contextualized" Accumulation of Knowledge

5. STAROps Built on UModel is Now Online

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Alibaba Cloud Model Studio

CloudMonitor

Qwen

Alibaba Cloud for Generative AI