Aggregating Results

Results from content and context classifications can be aggregated to provide the highest scoring labels for a given column's data.

In a production environment where input data might not always be clean, aggregation provides additional confidence in category suggestions by providing multiple classification perspectives.

Content Aggregation

Content aggregation provides PII category suggestions for an individual object or column cell, which are then compiled to arrive at a single category label for the column itself.

To aggregate a series of content entries, send a POST request to the /classify endpoint with a list of data to be aggregated:

POST /text/classify
{
    "data": [
        {
            "content": {
                "data": [
                    "sample@emailaddress.com",
                    "data.com",
                    "test@mail.com"
                ],
                "method_params": {
                    "decision_method": "direct-mapping",
                    "aggregation_method": "mean"
                }
            }
        }
    ]
}

Sample Response
[
    {
        "content": [
            {
                "input": [
                    "sample@emailaddress.com",
                    "data.com",
                    "test@mail.com"
                ],
                "labels": [
                    {
                        "label": "user.contact.email",
                        "score": 1.0,
                        "weighted_score": null
                    },
                    {
                        "label": null,
                        "score": 0.333,
                        "weighted_score": null
                    }
                ],
                "aggregation_method": "mean"
            }
        ]
    }
]

System Aggregation

Fidescls can aggregate the results of both Content and Context classification to provide you with a system-level suggestion for a given column's contents and metadata:

To aggregate system data, send a POST request to the /classify endpoint with both content and context data.

POST /text/classify
{
    "data": [
        {
            "content": {
                "data": [
                    "sample@aol.com",
                    "mail.com"
                ],
                "method_params": {
                    "decision_method": "direct-mapping",
                    "aggregation_method": "mean"
                }
            },
            "context": {
                "data": "email_address",
                "method": "similarity",
                "method_params": {
                    "possible_targets": [
                        "user.device.ip_address",
                        "user.financial.account_number",
                        "user.contact.email",
                        "user.contact.phone_number",
                        "user.contact.address.street",
                        "user.contact.address.city",
                        "user.contact.address.state",
                        "user.contact.address.country",
                        "user.contact.address.postal_code"
                    ],
                    "top_n": 2,
                    "remove_stop_words": false,
                    "pii_threshold": 0.6
                }
            }
        }
    ],
    "data_aggregation": {
        "method": "weighted",
        "method_params": {
            "context_weight": 0.6,
            "content_weight": 0.4,
            "top_n": 3
        }
    }
}

Sample Response
[
    {
        "content_input": [
            "sample@aol.com",
            "test.com"
        ],
        "context_input": [
            "email_address"
        ],
        "aggregation_method": "weighted",
        "aggregation_params": {
            "context_weight": 0.6,
            "content_weight": 0.4,
            "top_n": 3
        },
        "aggregated_labels": [
            {
                "label": "user.contact.email",
                "score": 0.7914,
                "weighted_score": 0.4748,
                "position_start": null,
                "position_end": null
            },
            {
                "label": "user.contact.address.postal_code",
                "score": 0.7403,
                "weighted_score": 0.4442,
                "position_start": null,
                "position_end": null
            },
            {
                "label": "user.contact.email",
                "score": 1.0,
                "weighted_score": 0.4
            }
        ]
    }
]

The data_aggregation object uses a weighted scale to compile Content and Context results. This weight can be adjusted in the context_weight and content_weight fields.

You can specify the amount of results you would like returned with top_n.

Classifier Weights

When dealing with system aggregation, Fidescls uses a weighted scale to accommodate the measurement differences between content and context classification methods.

The weight used is a percent-based scale that must add to 1. This scale is adjustable via the method_params field, which then represent multiplication factors applied to the classification results.

By default, context is weighted more heavily than content.