Query Execution

Graphs and Traversals

Fidesops uses your Datasets to generate a graph of the resources. Based on the identity data you provide, fidesops then generates a specific traversal, which is the order of steps that will be taken to fulfill a specific request.

The graph supports both directed and non-directed edges using the optional direction parameter on the relation (non-directional edges may be traversed in either direction). You can preview the queries that will be generated or manually control the order of operations by making relations explicitly directional and with the after Collection parameters. If you specify a Collection that can't be reached, fidesops generates an error.

An example graph

In this example there are three databases: a mysql database that stores users and their comments, a postgres DB that stores purchase information, and a mongoDB that stores user accounts. Each of them may have related data that we'd like to retrieve.

Example tables

The Dataset specification looks like this:

dataset:
  - fides_key: mongo_1
    collections:
      - name: users
        fields:
          - name: _id
            fidesops_meta:
              primary_key: True
          - name: user_name
            fidesops_meta:
              identity: username
           - name: full_name

      - name: accounts
        fields:
          - name: _id
            fidesops_meta:
              primary_key: True
          - name: name
            fidesops_meta:
                references:
                    - dataset: mongo_1
                      name: users.full_name
                      direction: from
          - name: comments

dataset:
  - fides_key: mysql_1
    collections:
      - name: users
        fields:
          - name: id
            fidesops_meta:
              primary_key: True
              references:
                - dataset: postgres_1
                  field: users.id
                  direction: from
          - name: internal_id
          - name: comment

dataset:
  - fides_key: postgres_1
    collections:
      - name: purchase_items
        fields:
          - name: id
            fidesops_meta:
              primary_key: True
          - name: purchase_id
            fidesops_meta:
              references:
                - dataset: postgres_1
                  field: purchases.id
                  direction: from
          - name: amount
          - name: rating

      - name: purchases
        fields:
          - name: id
            fidesops_meta:
              primary_key: True
          - name: user_id
            fidesops_meta:
              references:
                - dataset: postgres_1
                  field: users.id
          - name: amount
      - name: users
        fields:
          - name: id
            fidesops_meta:
              primary_key: True
          - name: email
              fidesops_meta:
                identity: email
          - name: address_id

We trigger a retrieval with identity data, such as an email address or user ID, that's provided by the user. What we do is...

Identify the collections that contain the identity data that belong to the user.
Find all related records.
Use the data to find all connected data.
Continue until we've found all related data.

For the first step, we use the concept of an identity. In the fidesops Dataset specification, any field may be marked with an identity notation:

collection:
  - name: foo
    fields:
    - name: bar
        fidesop_meta:
           identity: email 

What this means is that we will initiate the data retrieval process with provided data that looks like {"email": "user@example.com", "username": "someone"} by looking for values in the collection users where email == user@example.com. Note that the names of the provided starter data do not need to match the field names we're going to use this data to search. Also note that in this case, since we're providing two pieces of data, we can also choose to start a search using the username provided value. In the above diagram, this means we have enough data to search in both postgres_1.users.email and mongo_1.users.user_name.

How does fidesops execute queries?

The next step is to follow any links provided in field relationship information. In the abbreviated dataset declarations below, you can see that since we know that mongo_1.accounts data contains data related to mongo_1.users, we can retrieve data from mongo_1.accounts by generating this set of queries:

# mongo_1
1. db.users.find({"user_name":"someone"},{"_id":1, "full_name":1}) 
2. db.accounts.find({"name":{"$in":[  <full_name value from (1) > ]}},{"_id":1, "comments":1})

# postgres_1
3. select id, address_id from users where email = 'user@example.com';
4. select id, amount from purchases where user_id in [ <id values from (3) >] 
5. select id, amount, rating from purchase_items where purchase_id in [ <id values from (4)> ]

# mysql_1
6. select internal_id, comment from users where id in [ <id values from (3) >]

Logically, we are creating a linked graph using the connections you've specified between your collections to retrieve your data.

Example graph

Notes about Dataset traversals

You can define multiple links between collections, which will generate OR queries like SELECT a,b,c from TABLE_1 where name in (values from TABLE\_2) OR email in (values from TABLE\_3).
It's an error to specify a collection in your Dataset can't be reached through the relations you've specified.
Fidesops uses your Datasets and your input data to "solve" the graph of your collections and how it is traversed. If your Dataset has multiple identity values, you can create a situation where the query behavior depends on the values you provide. In the example above, starting the graph traversal with {"email": "value1", "username":" value2"} is valid, but starting with {"email": "value1"} fails because mongo_1.users is no longer reachable.
As shown in the example, you can create queries between Datasets.