Annotate the Dataset
Making the fidesctl tools available within the app's virtual environment is just the beginning. Next, configure fidesctl for this app by annotating its resources using manifest files.
First, create a
fides_resources directory at the project root. This is where the manifest files will be stored.
Note: In a production app this directory can have any name, but it's a best practice to create a specific directory to house the fidesctl manifest files.
Fundamentally, the data ecosystem is built on data that is stored somewhere. In fidesctl, Datasets are used for granular, field-level annotations of exactly what data your systems are storing and where that data is stored. For example, an app might declare one dataset for a Postgres application database, a second dataset for a Mongo orders collection, and a third dataset for some CSV files in cloud storage. The Dataset resource provides a database-agnostic way to annotate the fields stored in these systems with Data Categories, providing a metadata layer consumable by other tooling.
This app contains a single PostgreSQL dataset. Create a
dataset resource to annotate it by adding a
flaskr_postgres_dataset.yml file to the
fides_resources directory. To annotate this dataset correctly, go through each column of each table and answer the question: "What data categories are stored here?"
For this project, the file should contain the following configuration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
As an alternative to manually authoring the resource file, you can also use the
generate-dataset CLI command. The CLI will connect to the database and automatically generate a non-annotated resource YAML file in the specified location, based on the database schema. For this project, the command is:
1 2 3
Understanding the Dataset Resource
This YAML serves as the foundation of
fideslang, the Fides language; it answers "What data and kinds of data do we have?" and "How is our data organized?". The language is built on declaring the types of data found in storage for your organization.
In traditional SQL, fidesctl defines the following:
- "datasets" as database schemas
- "collections" as database tables
- "fields" as database columns
For NoSQL datasets, fidesctl defines the following:
- "collection" as a logical grouping of data fields (ie: in MongoDB, this is called a "Collection")
- "fields" as a reference to an individual data element (ie: in MongoDB, this is called a "field")
fideslang has attributes that describe what kind of data is contained in this dataset. We use the following attributes to describe the data:
|name||String||The name of this field|
|description||String||A description of what this field contains|
|data_categories||List[FidesKey]||The types of sensitive data, as defined by the taxonomy, that can be found in this field|
|data_qualifier||FidesKey||The level of deidentification for the dataset|
For more detail on Dataset resources, see the full Dataset resource documentation.
As you're progressing with the tutorial, we recommend installing our fidesctl VS Code extension, which will validate the syntax in real-time as you're writing your resource files!
Maintaining a Dataset Resource
As apps add more databases and other services to store potentially sensitive data, it is recommended that updating this resource file becomes a part of the development process when building a new feature.
Next: Annotate the System Resource
With the underlying database resource declared, you must now include the database in an application-level System resource annotation.