Configure Data Docs
Data Docs translate Expectations, Validation Results, and other metadata into human-readable documentation that is saved as static web pages. Automatically compiling your data documentation from your data tests in the form of Data Docs keeps your documentation current. This guide covers how to configure additional locations where Data Docs should be created.
Prerequisites:
- Python version 3.8 to 3.11.
- An installation of GX Core.
- A preconfigured File Data Context. This guide assumes the variable
contextcontains your Data Context.
To host Data Docs in an environment other than a local or networked filesystem, you will also need to install the appropriate dependencies and configure access credentials accordingly:
- Optional. An installation of GX Core with support for Amazon S3 dependencies and credentials configured.
- Optional. An installation of GX Core with support for Google Cloud Storage dependencies and credentials configured.
- Optional. An installation of GX Core with support for Azure Blob Storage dependencies and credentials configured.
Procedure
- Instructions
- Sample code
-
Define a configuration dictionary for your new Data Docs site.
The main component that requires customization in a Data Docs site configuration is its
store_backend. Thestore_backendis a dictionary that tells GX where the Data Docs site will be hosted and how to access that location when the site is updated.The specifics of the
store_backendwill depend on the environment in which the Data Docs will be created. GX Core supports generation of Data Docs in local or networked filesystems, Amazon S3, Google Cloud Service, and Azure Blob Storage.To create a Data Docs site configuration, select one of the following environments and follow the corresponding instructions.
- Filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Service
A local or networked filesystem Data Doc site requires the following
store_backendinformation:class_name: The name of the class to implement for accessing the target environment. For a local or networked filesystem this will beTupleFilesystemStoreBackend.base_directory: A path to the folder where the static sites should be created. This can be an absolute path, or a path relative to the root folder of the Data Context.
To define a Data Docs site configuration for a local or networked filesystem environment, update the value of
base_directoryin the following code and execute it:Pythonbase_directory = "uncommitted/data_docs/local_site/" # this is the default path (relative to the root folder of the Data Context) but can be changed as required
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleFilesystemStoreBackend",
"base_directory": base_directory,
},
}An Amazon S3 Data Doc site requires the following
store_backendinformation:-
class_name: The name of the class to implement for accessing the target environment. For Amazon S3 this will beTupleS3StoreBackend. -
bucket: The name of the Amazon S3 bucket that will host the Data Docs site. -
prefix: The path of the folder that will contain the Data Docs pages relative to the root of the Amazon S3 bucket. The combination ofcontainerandbucketmust be unique accross all Stores used by a Data Context. -
boto3_options: The credentials for accessing your Amazon S3 account. Amazon S3 supports multiple methods of providing credentials, such as use of an endpoint url, access key, or role assignment. For more information on how to configure your Amazon S3 credentials, see Amazon's documentation for how to Configure the AWS CLI.The
boto3_optionsdictionary can contain the following keys, depending on how you have configured your credentials in the AWS CLI:endpoint_url: An AWS endpoint for service requests. Using this also requiresregion_nameto be included in theboto3_options.region_name: The AWS region to send requests to. This must be included in theboto3_optionsifendpoint_urlorassume_role_arnare used.aws_access_key_id: An AWS access key associated with an IAM account. Using this also requiresaws_secret_access_keyto be provided.aws_secret_access_key: The secret key associated with the access key. This is required if yourboto3_optionsuse theaws_access_key_idkey, and can be conscidered the "password" for the access key specified byaws_access_key_id.aws_session_token: The value of the session token you retrieve directly from AWS STS operations when using temporary credentials.assume_role_arn: The Amazon Resource Name (ARN) of an IAM role with your access credentials. Using this also requiresassume_role_durationto be included in theboto3_options.assume_role_duration: The duration of your session, measured in seconds. This is required if yourboto3_optionsuse theassume_role_arnkey.
To define a Data Docs site configuration for S3, update
bucket,prefix, andboto3_optionsin the following code and execute it:Pythonbucket = "my_s3_bucket"
prefix = "data_docs/"
boto3_options = {
"endpoint_url": "${S3_ENDPOINT}", # Uses string substitution to get the endpoint url form the S3_ENDPOINT environment variable.
"region_name": "<your>", # Use the name of your AWS region.
}
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": bucket,
"prefix": prefix,
"boto3_options": boto3_options,
},
}An Azure Blob Storage Data Doc site requires the following
store_backendinformation:class_name: The name of the class to implement for accessing the target environment. For Azure Blob Storage this will beTupleAzureBlobStoreBackend.container: The name of the Azure Blob Storage container that will host the Data Docs site.prefix: The path of the folder that will contain the Data Docs pages relative to the root of the Azure Blob Storage container. The combination ofcontainerandprefixmust be unique accross all Stores used by a Data Context.connection_string: The connection string for your Azure Blob Storage. For more information on how to securely store your connection string, see Configure credentials.
To define a Data Docs site configuration in Azure Blob Storage, update the values of
container,prefix, andconnection_stringin the following code and execute it:Pythoncontainer = "my_abs_container"
prefix = "data_docs/"
connection_string = "${AZURE_STORAGE_CONNECTION_STRING}" # This uses string substitution to get the actual connection string from an environment variable or config file.
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleAzureBlobStoreBackend",
"container": container,
"prefix": prefix,
"connection_string": connection_string,
},
}A Google Cloud Service Data Doc site requires the following
store_backendinformation:class_name: The name of the class to implement for accessing the target environment. For Google Cloud Storage this will beTupleGCSStoreBackend.project: The name of the GCS project that will host the Data Docs site.bucket: The name of the bucket that will contain the Data Docs pages.prefix: The path of the folder that will contain the Data Docs pages relative to the root of the GCS bucket. The combination ofbucketandprefixmust be unique accross all Stores used by a Data Context.
To define a Data Docs site configuration for Google Cloud Storage, update the values of
project,bucket, andprefixin the following code and execute it:project = "my_project"
bucket = "my_gcs_bucket"
prefix = "data_docs_site/"
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleGCSStoreBackend",
"project": project,
"bucket": bucket,
"prefix": prefix,
},
}For GX to access your Google Cloud Services environment, you will also need to configure the appropriate credentials. By default, GCS credentials are handled through the gcloud command line tool and the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable. The gcloud command line tool is used to set up authentication credentials, and theGOOGLE_APPLICATION_CREDENTIALSenvironment variable provides the path to the json file with those credentials.For more information on using the gcloud command line tool, see Google Cloud's Cloud Storage client libraries documentation.
-
Add your configuration to your Data Context.
All Data Docs sites have a unique name within a Data Context. Once your Data Docs site configuration has been defined, add it to the Data Context by updating the value of
site_namein the following to something more descriptive and then execute the code::Pythonsite_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config) -
Optional. Build your Data Docs sites manually.
You can manually build a Data Docs site by executing the following code:
Pythoncontext.build_data_docs(site_names=site_name) -
Optional. Automate Data Docs site updates with Checkpoint Actions.
You can automate the creation and update of Data Docs sites by including the
UpdateDataDocsActionin your Checkpoints. This Action will automatically trigger a Data Docs site build whenever the Checkpoint it is included in completes itsrun()method.Pythoncheckpoint_name = "my_checkpoint"
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run() -
Optional. View your Data Docs.
Once your Data Docs have been created, you can view them with:
Pythoncontext.open_data_docs()
GX Core supports the Data Docs configurations for the following environments:
- Filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Service
import great_expectations as gx
context = gx.get_context(mode="file")
# Define a Data Docs site configuration dictionary
base_directory = "uncommitted/data_docs/local_site/" # this is the default path (relative to the root folder of the Data Context) but can be changed as required
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleFilesystemStoreBackend",
"base_directory": base_directory,
},
}
# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
# Manually build the Data Docs
context.build_data_docs(site_names=site_name)
# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run()
# View the Data Docs
context.open_data_docs()
import great_expectations as gx
context = gx.get_context(mode="file")
# Build Data Docs configuration dictionary
bucket = "my_s3_bucket"
prefix = "data_docs/"
boto3_options = {
"endpoint_url": "${S3_ENDPOINT}", # Uses string substitution to get the endpoint url form the S3_ENDPOINT environment variable.
"region_name": "<your>", # Use the name of your AWS region.
}
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": bucket,
"prefix": prefix,
"boto3_options": boto3_options,
},
}
# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
# Manually build the Data Docs
context.build_data_docs(site_names=site_name)
# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run()
# View the Data Docs
context.open_data_docs()
import great_expectations as gx
context = gx.get_context(mode="file")
# Define a Data Docs configuration dictionary
container = "my_abs_container"
prefix = "data_docs/"
connection_string = "${AZURE_STORAGE_CONNECTION_STRING}" # This uses string substitution to get the actual connection string from an environment variable or config file.
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleAzureBlobStoreBackend",
"container": container,
"prefix": prefix,
"connection_string": connection_string,
},
}
# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
# Manually build the Data Docs
context.build_data_docs(site_names=site_name)
# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run()
# View the Data Docs
context.open_data_docs()
import great_expectations as gx
context = gx.get_context(mode="file")
# Define a Data Docs configuration dictionary
project = "my_project"
bucket = "my_gcs_bucket"
prefix = "data_docs_site/"
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleGCSStoreBackend",
"project": project,
"bucket": bucket,
"prefix": prefix,
},
}
# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
# Manually build the Data Docs
context.build_data_docs(site_names=site_name)
# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run()
# View the Data Docs
context.open_data_docs()