<schemaHandling>
<objectType>
...
<attribute>
<ref>icfs:name</ref>
<correlator/> (2)
<inbound> (1)
<target>
<path>name</path>
</target>
</inbound>
</attribute>
...
</objectType>
</schemaHandling>
Correlation
|
Correlation feature
This page is an introduction to Correlation midPoint feature.
Please see the feature page for more details.
|
Introduction
Correlation (also known as smart correlation) is a mechanism used to correlate identity data to existing focus objects in the repository. It is typically used during the synchronization process to match newly discovered accounts on a resource with midPoint focus objects, or during a manual or automated registration of new users (including self-registration).
The goal of correlation is to provide a configurable correlation mechanism that can provide approximate matching. Then the match can be resolved automatically if it meets a defined confidence threshold, or manually by a human operator.
To see how to configure correlation in GUI, refer to Resource wizard: Object type correlation.
Configuration
The correlation mechanism is based on correlation rules, technically called correlators. For example, a rule can state that "if the family name, date of birth, and the national-wide ID all match, then the identity is the same". Another rule can state that "if (only) the national-wide ID matches, then the identity is the same with the confidence level of 0.7" (i.e., 70% confidence).
| In the future, we plan to provide AI-assisted correlation that will suggest correlation candidates also according to human resolution of previously disputed correlation situations. At that time, the correlation rules will be not the only - or even not the primary - source for correlation suggestions. But currently, they are the only driver of the correlation algorithm. |
Correlation rule types
There are the following types of correlation rules:
| Type | Meaning |
|---|---|
|
Item-based correlation rule (recommended). |
|
Legacy filter-based correlation rule. |
|
Experimental rule, based on an evaluation of a custom expression. |
|
Rule that uses an external ID Match service. See Identity Matching (Correlation) Implementation for more information. |
Precisely speaking, there is also a composite rule that provides an aggregation of the results of its children.
However, currently it is supported only as a top-level rule, i.e., it is present automatically - without the possibility (or need) to be specified explicitly.
|
Correlation configuration placement
Correlation configuration can reside in the following places:
-
A resource object type definition: either in a top-level
correlationitem, or distributed into individual attribute definitions. -
An object template, currently in a top-level
correlationitem. [1]
The reason for such flexibility is that in some scenarios, correlation is bound to a certain type of focus objects, regardless of the origin of identity data we need to correlate. They can come from any resource or (in the future) they may come from registration or self-registration processes. In other scenarios, though, correlation rules are specific to a resource object type.
When present, the configuration attached to the resource object type takes precedence over the one connected to the object template.
| The configuration attached to an object template requires the use of archetypes. See Limitations. |
Configuration examples
Example 1: Attribute-bound definition
The following is the most basic example: an attribute is mapped to a focus property that serves as a correlation item.
icfs:name serving as a correlation attribute| 1 | Means that the icfs:name attribute is mapped to the name focus property. |
| 2 | Means that the account is correlated to the focus objects by searching for the corresponding value of the name property. |
If multiple attributes are marked as correlator, any of them matching is enough for an overall match.
Technically, correlators are evaluated separately; see rule composition for details.
If you need to evaluate two attributes together (i.e., they both have to match), you need to use the explicit items correlator.
Correlation takes place before the regular inbound mappings are evaluated.
That is why there is a special inbound mapping evaluation mode:
correlation-time evaluation.
Even though it is turned off by default, the attribute-level correlator element automatically turns it on for the selected inbound mapping.
|
Example 2: Resource object type bound definition
Here we show the same logic defined at the level of the resource object type:
icfs:name serving as a correlation attribute (defined at the level of resource object type)<schemaHandling>
<objectType>
...
<attribute>
<ref>icfs:name</ref>
<inbound>
<target>
<path>name</path>
</target>
</inbound>
</attribute>
...
<correlation>
<correlators>
<items>
<item>
<ref>name</ref> (1)
</item>
</items>
</correlators>
</correlation>
...
</objectType>
</schemaHandling>
| 1 | Declaring the name to be the correlation item. |
As we have seen in Example 1, mentioning name as a correlation item enables the correlation-time inbound processing.
|
Example 3: Object template based correlation definition
Finally, this is how the correlation can be defined at the level of an object template. Here we show a rule requiring that both given name and family name match.
<objectTemplate oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d" xmlns="...">
...
<correlation>
<correlators>
<items>
<item>
<ref>givenName</ref>
</item>
<item>
<ref>familyName</ref>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
The correlation-time inbound processing is automatically enabled also in this case. The object template must be connected to the resource object type via the archetype declared in the object type definition.[2] An example:
<resource oid="..." xmlns="...">
...
<schemaHandling>
<objectType>
...
<focus>
<type>UserType</type>
<archetypeRef oid="36d04df1-8f81-4442-b576-97b54c716245" />
</focus>
...
</objectType>
</schemaHandling>
</resource>
<archetype oid="36d04df1-8f81-4442-b576-97b54c716245" xmlns="...">
...
<archetypePolicy>
<objectTemplateRef oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d"/>
</archetypePolicy>
...
</archetype>
Example 4: Correlation for outbound resources
Correlation relies on inbound mapping converting resource’s attribute to a property of a midPoint object. Such approach is perfect for inbound resources because it simplifies the configuration. Nevertheless, there are use cases with a strictly outbound resource with existing accounts that need to be correlated. In such cases, having an inbound mapping is not desired.
For this situation, midPoint allows you to configure mapping only for correlation and not for "standard" processing (by clockwork).
<schemaHandling>
<objectType>
...
<attribute>
<ref>icfs:name</ref>
<correlator/>
<inbound>
<target>
<path>name</path>
</target>
<use>correlation</use> (1)
</inbound>
<outbound> (2)
...
</outbound>
</attribute>
...
</objectType>
</schemaHandling>
| 1 | Means that the inbound mapping will be used only for correlation and will not be processed otherwise. |
| 2 | Represents the outbound mapping as usual. |
Advanced concepts
Multiple correlation rules
In more complex deployments, there may be multiple correlation rules. For example, we may want to correlate by given name, family name, date of birth, and national ID using the following rules:
| Rule# | Situation | Resulting confidence |
|---|---|---|
1 |
Family name, date of birth, and national ID exactly match. |
1.0 |
2 |
Given name, family name, and date of birth exactly match. |
0.4 |
3 |
The national ID exactly matches. |
0.4 |
| For details on confidence values, see Rule Composition. |
These rules can be configured like this:
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>familyName</ref>
</item>
<item>
<ref>extension/dateOfBirth</ref>
</item>
<item>
<ref>extension/nationalId</ref>
</item>
<composition>
<weight>1.0</weight> <!-- this is the default -->
</composition>
</items>
<items>
<item>
<ref>givenName</ref>
</item>
<item>
<ref>familyName</ref>
</item>
<item>
<ref>extension/dateOfBirth</ref>
</item>
<composition>
<weight>0.4</weight>
</composition>
</items>
<items>
<item>
<ref>extension/nationalId</ref>
</item>
<composition>
<weight>0.4</weight>
</composition>
</items>
</correlators>
</correlation>
</objectTemplate>
There are a lot of configuration options here.
For example, we can specify the order of rules evaluation and their "A implies B" relations that ensure the correct computation of confidence in case of rule A implying rule B.
For details, see Rule Composition.
Custom indexing
| This feature is available only when using the native repository implementation. |
Sometimes, we need to base the search on data indexed in a specific way. For example, we may need to match only the first five normalized characters of surnames. Or, when searching for a national ID, we may want to take only digits into account.
These requirements can be configured like this:
<objectTemplate>
...
<item>
<ref>familyName</ref>
<indexing>
<normalization>
<steps>
<polyString> (1)
<order>1</order>
</polyString>
<prefix> (2)
<order>2</order>
<length>5</length>
</prefix>
</steps>
</normalization>
</indexing>
</item>
<item>
<ref>extension/nationalId</ref>
<indexing>
<normalization>
<name>digits</name> (3)
<steps>
<custom>
<expression>
<script>
<code>
basic.stringify(input).replaceAll("[^\\d]", "") (4)
</code>
</script>
</expression>
</custom>
</steps>
</normalization>
</indexing>
</item>
...
</objectTemplate>
| 1 | Applies the default PolyString normalizer to the original value. |
| 2 | Takes the first 5 characters of the normalized value. |
| 3 | Name by which this normalization can be referenced. |
| 4 | Removes everything except for digits. |
These indexes are then used automatically when correlating according to familyName and extension/nationalId, respectively.
If there are multiple normalizations defined for a given focus item (and none is defined as the default one), we can select the one to be used by mentioning it within the correlation item definition:
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>extension/nationalId</ref>
<search> (1)
<index>digits</index>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
| 1 | Points to the digits normalization for the extension/nationalId property. |
See Custom Indexing and The Items Correlator for more information.
Fuzzy searching
By default, searching is done using "exact match" criteria, either on original values or values that have gone through the standard or custom normalization. Sometimes, however, we want to search for objects that have a property value similar to the value we have at hand. For example, we get an account for Jack Sparrow, but besides matching users with surname Sparrow we may want to consider also users Sparow, Sparrou, and so on; although potentially with a lower confidence value.
To do this, a fuzzy search logic was implemented. There are two methods available:
| Method | Description |
|---|---|
Levenshtein edit distance |
Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. (From wikipedia.) |
Trigram similarity |
Matches using the ratio of common trigrams to all trigrams in compared strings.
(See PostgreSQL documentation on |
| The fuzzy search is available only when using the native repository implementation. |
See an example below that searches for users with the given name and family name close to the provided names. The given name has to have a Levenshtein edit distance (to the provided one) at most 3. The family name has to have a trigram similarity (to the provided one) at least 0.8.
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>givenName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
</search>
</item>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<similarity>
<threshold>0.8</threshold>
</similarity>
</fuzzy>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
See Fuzzy Searching for more information.
Multiple identity data sources
Advanced correlation often needs to go hand in hand with situations when there are multiple sources of identity data. For example, a university may have its Student Information System (SIS) providing data on students and faculty, a Human Resources (HR) System keeping records of all staff - faculty and others, and an "External persons" (EXT) system for maintaining data about visitors and other persons related to the university in a way other than being a student or employee.
While the data about a person are usually consistent, there may be situations when they differ. For example, the given name may be recorded differently in the SIS and HR systems. Or a title may not be updated in HR. An old record in the "external persons" system may be out-of-date altogether.
This situation leads to the following kinds of requirements:
-
When processing data from these systems, midPoint has to be able to decide which ones are "authoritative", that is, which ones to propagate to the "official" user data stored in the repository.
-
When correlating, we may want to match data from all systems for the candidate owners. (Not only the "official" user data.)
MidPoint supports these requirements. For the first one, the engineer must provide an algorithm for determining the authoritative data source. The second one is provided transparently, by indexing the data from all identity sources.
The following example shows how to configure the givenName, familyName, dateOfBirth, and nationalId as "multi-source" properties.
They are kept separately for each source: SIS, HR, and "external persons" system.
The order of "authoritativeness" is: SIS, HR, external, as can be seen in the defaultAuthoritativeSource mapping.
<objectTemplate>
...
<item>
<ref>givenName</ref>
<multiSource/> (1)
</item>
<item>
<ref>familyName</ref>
<multiSource/>
</item>
<item>
<ref>extension/dateOfBirth</ref>
<multiSource/>
</item>
<item>
<ref>extension/nationalId</ref>
<multiSource/>
</item>
...
<multiSource>
<defaultAuthoritativeSource> (2)
<expression>
<script>
<code>
import com.evolveum.midpoint.util.MiscUtil
def RESOURCE_SIS_OID = '...'
def RESOURCE_HR_OID = '...'
def RESOURCE_EXT_OID = '...'
// The order of authoritativeness is: SIS, HR, external
if (identity == null) {
return null
}
def sources = identity
.collect { it.source }
.findAll { it != null }
def sis = sources.find { it.resourceRef?.oid == RESOURCE_SIS_OID }
def hr = sources.find { it.resourceRef?.oid == RESOURCE_HR_OID }
def external = sources.find { it.resourceRef?.oid == RESOURCE_EXT_OID }
MiscUtil.getFirstNonNull(sis, hr, external)
</code>
</script>
</expression>
</defaultAuthoritativeSource>
</multiSource>
</objectTemplate>
| 1 | Marks a property as "multi-source". |
| 2 | A mapping that selects the most authoritative data source for a given user. |
See Multiple Identity Data Sources for more information.
Limitations
As a general rule, when referencing a configuration related to correlation (including custom indexing or multi-source processing) in an object template, the configuration must be bound to the resource object type in question via statically-defined archetype (see Listing 3 and 4 in [Example 3: Object Template Based Correlation Definition]).
Other limitations are mentioned on pages for individual sub-features:
Compliance
This feature is related to the following compliance frameworks: