Define Datasets
What is a Dataset?
A fidesops Dataset is the configuration you provide for a database or other queryable datastore. We use the term Dataset and not database to emphasize that this will ultimately be applicable to a wide variety of datastores beyond traditional databases. With Datasets, a collection is the term used for a SQL table, mongo database collection, or any other single coherent set values.
Configure a Dataset
Beyond collection and field names, fidesops needs some additional information to fully configure a Dataset. Let's look at a simple example database, and how it would be translated into a configuration in fidesops.
An example database
Here we have a database of customers and addresses (the example is a bit simplified from an actual SQL schema). We have a customer
table that has a foreign key of address_id
to an address
table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
A fidesops Dataset consists of a declaration of fields, with metadata describing how those fields are related. We use the information about their relationship to navigate between different collections. The Dataset declaration for the above schema looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
Dataset members
fides_key
: A unique identifier name for the Datasetcollections
: A list of addressable collections.after
: An optional list of Datasets that must be fully traversed before this Dataset is queried.
Collection members
name
: The name of the collection in your configuration must correspond to the name used for it in your datastore, since it will be used to generate query and update statements.fields
: A list of addressable fields in the collection. Specifying the fields in the collection tells fidesops what data to address in the collection.after
: An optional list of collections (in the form[dataset name].[collection name]
) that must be fully traversed before this collection is queried.
Field members
name
: The name of the field will be used to generate query and update statements. Please note that fidesops does not do automated schema discovery. It is only aware of the fields you declare. This means that the only fields that will be addressed and retrieved by fidesops queries are the fields you declare.data_categories
: Annotating data_categories connects fields to policy rules, and determines which actions apply to each field. For more information see Policiesfidesops_meta
: The fidesops_meta section specifies some additional fields that control how fidesops manages your data:references
: A declaration of relationships between collections. Where the configuration declares a reference tomydatabase:address:id
it means fidesops will use the values frommydatabase.address.id
to search for related values incustomer
. Unlike the SQL declaration, this is not an enforceable relationship, but simply a statement of which values are connected. In the example above, the references from thecustomer
field tomydatabase.address.id
is analogous to a SQL statementcustomer id REFERENCES address.id
, with the exception that any Dataset and collection can be referenced. The relationship requires you to specify the Dataset as well as the collection for relationships, because you may declare a configuration with multiple Datasets, where values in one collection in the first Dataset are searched using values found in the second Dataset.field
: The specified linked field, using the syntax[dataset name].[collection name ].[field name]
.identity
: Signifies that this field is an identity value that can be used as the root for a traversal See graph traversaldirection
(Optional): Accepted values arefrom
orto
. This determines how fidesops uses the relationships to discover data. If the direction isto
, fidesops will only use data in the source collection to discover data in the referenced collection. If the direction isfrom
, fidesops will only use data in the referenced collection to discover data in the source collection. If the direction is omitted, fidesops will traverse the relation in whatever direction works to discover all related data.primary_key
(Optional): A boolean value that means that fidesops will treat this field as a unique row identifier for generating update statements. If no primary key is specified for any field on a collection, no updates will be generated against that collection. If multiple fields are marked as primary keys the combination of their values will be treated as a combined key. In SQL terms, we'd issue a query that looked likeSELECT ... FROM TABLE WHERE primary_key_name_1 = value1 AND primary_key_name_2 = value2
.data_type
(Optional): An indication of the type of data held by this field. Data types are used to convert values to the appropriate type when those values are used in queries. This is especially necessary when using data of one type to help locate data of another type. Data types are also used to generate the appropriate masked value when running erasures, since fidesops needs to know the type of data expected by the field in order to generate an appropriate masked value. Available data types arestring
,integer
,float
,boolean
, andobject_id
.object
types are also supported for MongoDB.length
(Optional): An indicator of field length.return_all_elements
: (Optional): For array entrypoint fields, specify whether the query should return/mask all fields, or just matching fields. By default, we just return/mask matching fields.return_all_elements=true
will return/mask the entire array.
Configure a manual Dataset
Not all data can be automatically retrieved. When services have no external API, or when user data is held in a physical location, you can define a Dataset to describe the types of manual fields you plan to upload, as well as any dependencies between these manual collections and other collections.
When a manual Dataset is defined, an in-progress access request will pause until the data is added manually, and then resume execution.
Describe a manual Dataset
In the following example, the Manual Dataset contains one storage_unit
collection. email
is
defined as the unit's identity, which will then be used to retrieve the box_id
in the storage unit.
To add a Manual Dataset, first create a Manual ConnectionConfig. The following Manual Dataset can then be added to the new ConnectionConfig:
PATCH {{host}}/connection/ | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Resume a paused access privacy request
A privacy request will pause execution when it reaches a manual collection in an access request. An administrator should manually retrieve the data and send it in a POST request. The fields should match the fields on the paused collection.
Erasure requests with manual collections will also need data manually added as well.
POST {{host}}/privacy-request/{{privacy_request_id}}/manual_input | |
---|---|
1 2 3 4 |
|
If no manual data can be found, simply pass in an empty list to resume the privacy request:
1 |
|
Resume a paused erasure privacy request
A privacy request will pause execution when it reaches a manual collection in an erasure request. An administrator should manually mask the records in question and send confirmation of the rows affected in a POST request.
POST {{host}}/privacy-request/{{privacy_request_id}}/erasure_confirm | |
---|---|
1 |
|
If no manual data was destroyed, pass in a count of 0 to resume the privacy request:
1 |
|