What is a Dataset?
A fidesops Dataset is the configuration you provide for a database or other queryable datastore. We use the term Dataset and not database to emphasize that this will ultimately be applicable to a wide variety of datastores beyond traditional databases. With Datasets, a collection is the term used for a SQL table, mongo database collection, or any other single coherent set values.
Configure a Dataset
Beyond collection and field names, fidesops needs some additional information to fully configure a Dataset. Let's look at a simple example database, and how it would be translated into a configuration in fidesops.
An example database
Here we have a database of customers and addresses (the example is a bit simplified from an actual SQL schema). We have a
customer table that has a foreign key of
address_id to an
1 2 3 4 5 6 7 8 9 10 11 12 13 14
A fidesops Dataset consists of a declaration of fields, with metadata describing how those fields are related. We use the information about their relationship to navigate between different collections. The Dataset declaration for the above schema looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
fides_key: A unique identifier name for the Dataset
collections: A list of addressable collections.
after: An optional list of Datasets that must be fully traversed before this Dataset is queried.
name: The name of the collection in your configuration must correspond to the name used for it in your datastore, since it will be used to generate query and update statements.
fields: A list of addressable fields in the collection. Specifying the fields in the collection tells fidesops what data to address in the collection.
after: An optional list of collections (in the form
[dataset name].[collection name]) that must be fully traversed before this collection is queried.
name: The name of the field will be used to generate query and update statements. Please note that fidesops does not do automated schema discovery. It is only aware of the fields you declare. This means that the only fields that will be addressed and retrieved by fidesops queries are the fields you declare.
data_categories: Annotating data_categories connects fields to policy rules, and determines which actions apply to each field. For more information see Policies
fidesops_meta: The fidesops_meta section specifies some additional fields that control how fidesops manages your data:
references: A declaration of relationships between collections. Where the configuration declares a reference to
mydatabase:address:idit means fidesops will use the values from
mydatabase.address.idto search for related values in
customer. Unlike the SQL declaration, this is not an enforceable relationship, but simply a statement of which values are connected. In the example above, the references from the
mydatabase.address.idis analogous to a SQL statement
customer id REFERENCES address.id, with the exception that any Dataset and collection can be referenced. The relationship requires you to specify the Dataset as well as the collection for relationships, because you may declare a configuration with multiple Datasets, where values in one collection in the first Dataset are searched using values found in the second Dataset.
field: The specified linked field, using the syntax
[dataset name].[collection name ].[field name].
identity: Signifies that this field is an identity value that can be used as the root for a traversal See graph traversal
direction(Optional): Accepted values are
to. This determines how fidesops uses the relationships to discover data. If the direction is
to, fidesops will only use data in the source collection to discover data in the referenced collection. If the direction is
from, fidesops will only use data in the referenced collection to discover data in the source collection. If the direction is omitted, fidesops will traverse the relation in whatever direction works to discover all related data.
primary_key(Optional): A boolean value that means that fidesops will treat this field as a unique row identifier for generating update statements. If no primary key is specified for any field on a collection, no updates will be generated against that collection. If multiple fields are marked as primary keys the combination of their values will be treated as a combined key. In SQL terms, we'd issue a query that looked like
SELECT ... FROM TABLE WHERE primary_key_name_1 = value1 AND primary_key_name_2 = value2.
data_type(Optional): An indication of the type of data held by this field. Data types are used to convert values to the appropriate type when those values are used in queries. This is especially necessary when using data of one type to help locate data of another type. Data types are also used to generate the appropriate masked value when running erasures, since fidesops needs to know the type of data expected by the field in order to generate an appropriate masked value. Available data types are
objecttypes are also supported for MongoDB.
length(Optional): An indicator of field length.
return_all_elements: (Optional): For array entrypoint fields, specify whether the query should return/mask all fields, or just matching fields. By default, we just return/mask matching fields.
return_all_elements=truewill return/mask the entire array.
Configure a manual Dataset
Not all data can be automatically retrieved. When services have no external API, or when user data is held in a physical location, you can define a Dataset to describe the types of manual fields you plan to upload, as well as any dependencies between these manual collections and other collections.
When a manual Dataset is defined, an in-progress access request will pause until the data is added manually, and then resume execution.
Describe a manual Dataset
In the following example, the Manual Dataset contains one
box_id in the storage unit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Resume a paused access privacy request
A privacy request will pause execution when it reaches a manual collection in an access request. An administrator should manually retrieve the data and send it in a POST request. The fields should match the fields on the paused collection.
Erasure requests with manual collections will also need data manually added as well.
1 2 3 4
If no manual data can be found, simply pass in an empty list to resume the privacy request:
Resume a paused erasure privacy request
A privacy request will pause execution when it reaches a manual collection in an erasure request. An administrator should manually mask the records in question and send confirmation of the rows affected in a POST request.
If no manual data was destroyed, pass in a count of 0 to resume the privacy request: