ARI solves a structured prediction problem over a bipartite graph between source and target field paths. The objective is to infer a globally consistent mapping via maximum a posteriori (MAP) inference in a constrained graphical model.
- Problem setup
- MAP objective
- Energy decomposition
- Feature representation
- Unary scoring
- Pairwise / structured scoring
- Constrained optimization
- One-to-one constraints
- Type / ontology constraints
- Structural constraints
- Solution
- Training objective
- References
Problem setup
Let the sets of source and target field paths be:
A mapping is a binary relation :
Define indicator variables:
where means that the source and target fields are compatible.
Then, equivalently, we can define as:
MAP objective
ARI defines a Gibbs distribution over mappings:
and seeks the MAP solution:
Energy decomposition
The energy decomposes into unary and pairwise terms:
where:
- encodes local compatibility
- encodes structural consistency
This corresponds to a pairwise Markov random field over candidate matches.
Feature representation
Each candidate pair is mapped to a feature vector:
denoted by for pair :
Typical features include:
- Lexical similarity
- Structural context
- Ontology compatibility
- Embedding similarity:
Unary scoring
Unary potentials (scores) are parameterized as:
Examples:
- Linear:
- Tree/MLP models (GBDT, neural scoring)
Candidate pruning retains:
Pairwise / structured scoring
Pairwise potentials capture dependencies:
Examples:
- Cross-encoder:
- Structured models (CRF / GNN):
These enforce:
- Structural alignment
- Co-occurrence patterns
- Ontological consistency
Constrained optimization
The MAP problem can be written as an integer quadratic program.
Using and , the MAP objective
becomes (equivalently)
subject to:
One-to-one constraints
Type / ontology constraints
Structural constraints
- Mutual exclusion:
- Hierarchical consistency:
This yields an ILP / quadratic optimization problem.
Solution
The optimal mapping is:
Training objective
The models are trained over heterogeneous datasets:
Optimize: