Version: 1.0.4

Configure Data Docs

Data Docs translate Expectations, Validation Results, and other metadata into human-readable documentation that is saved as static web pages. Automatically compiling your data documentation from your data tests in the form of Data Docs keeps your documentation current. This guide covers how to configure additional locations where Data Docs should be created.

Prerequisites:

Python version 3.8 to 3.11.
An installation of GX Core.
A preconfigured File Data Context. This guide assumes the variable context contains your Data Context.

To host Data Docs in an environment other than a local or networked filesystem, you will also need to install the appropriate dependencies and configure access credentials accordingly:

Optional. An installation of GX Core with support for Amazon S3 dependencies and credentials configured.
Optional. An installation of GX Core with support for Google Cloud Storage dependencies and credentials configured.
Optional. An installation of GX Core with support for Azure Blob Storage dependencies and credentials configured.

Procedure

Instructions
Sample code

Define a configuration dictionary for your new Data Docs site.

The main component that requires customization in a Data Docs site configuration is its store_backend. The store_backend is a dictionary that tells GX where the Data Docs site will be hosted and how to access that location when the site is updated.

The specifics of the store_backend will depend on the environment in which the Data Docs will be created. GX Core supports generation of Data Docs in local or networked filesystems, Amazon S3, Google Cloud Service, and Azure Blob Storage.

To create a Data Docs site configuration, select one of the following environments and follow the corresponding instructions.
- Filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Service
A local or networked filesystem Data Doc site requires the following store_backend information:
- class_name: The name of the class to implement for accessing the target environment. For a local or networked filesystem this will be TupleFilesystemStoreBackend.
- base_directory: A path to the folder where the static sites should be created. This can be an absolute path, or a path relative to the root folder of the Data Context.
To define a Data Docs site configuration for a local or networked filesystem environment, update the value of base_directory in the following code and execute it:
Python
base_directory = "uncommitted/data_docs/local_site/" # this is the default path (relative to the root folder of the Data Context) but can be changed as required site_config = { "class_name": "SiteBuilder", "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"}, "store_backend": { "class_name": "TupleFilesystemStoreBackend", "base_directory": base_directory, }, }
An Amazon S3 Data Doc site requires the following store_backend information:
- class_name: The name of the class to implement for accessing the target environment. For Amazon S3 this will be TupleS3StoreBackend.
- bucket: The name of the Amazon S3 bucket that will host the Data Docs site.
- prefix: The path of the folder that will contain the Data Docs pages relative to the root of the Amazon S3 bucket. The combination of container and bucket must be unique accross all Stores used by a Data Context.
- boto3_options: The credentials for accessing your Amazon S3 account. Amazon S3 supports multiple methods of providing credentials, such as use of an endpoint url, access key, or role assignment. For more information on how to configure your Amazon S3 credentials, see Amazon's documentation for how to Configure the AWS CLI.
  
  The boto3_options dictionary can contain the following keys, depending on how you have configured your credentials in the AWS CLI:
  
  endpoint_url: An AWS endpoint for service requests. Using this also requires region_name to be included in the boto3_options.
  
  region_name: The AWS region to send requests to. This must be included in the boto3_options if endpoint_url or assume_role_arn are used.
  
  aws_access_key_id: An AWS access key associated with an IAM account. Using this also requires aws_secret_access_key to be provided.
  
  aws_secret_access_key: The secret key associated with the access key. This is required if your boto3_options use the aws_access_key_id key, and can be conscidered the "password" for the access key specified by aws_access_key_id.
  
  aws_session_token: The value of the session token you retrieve directly from AWS STS operations when using temporary credentials.
  
  assume_role_arn: The Amazon Resource Name (ARN) of an IAM role with your access credentials. Using this also requires assume_role_duration to be included in the boto3_options.
  
  assume_role_duration: The duration of your session, measured in seconds. This is required if your boto3_options use the assume_role_arn key.
To define a Data Docs site configuration for S3, update bucket, prefix, and boto3_options in the following code and execute it:
Python
bucket = "my_s3_bucket" prefix = "data_docs/" boto3_options = { "endpoint_url": "${S3_ENDPOINT}", # Uses string substitution to get the endpoint url form the S3_ENDPOINT environment variable. "region_name": "<your>", # Use the name of your AWS region. } site_config = { "class_name": "SiteBuilder", "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"}, "store_backend": { "class_name": "TupleS3StoreBackend", "bucket": bucket, "prefix": prefix, "boto3_options": boto3_options, }, }
An Azure Blob Storage Data Doc site requires the following store_backend information:
- class_name: The name of the class to implement for accessing the target environment. For Azure Blob Storage this will be TupleAzureBlobStoreBackend.
- container: The name of the Azure Blob Storage container that will host the Data Docs site.
- prefix: The path of the folder that will contain the Data Docs pages relative to the root of the Azure Blob Storage container. The combination of container and prefix must be unique accross all Stores used by a Data Context.
- connection_string: The connection string for your Azure Blob Storage. For more information on how to securely store your connection string, see Configure credentials.
To define a Data Docs site configuration in Azure Blob Storage, update the values of container, prefix, and connection_string in the following code and execute it:
Python
container = "my_abs_container" prefix = "data_docs/" connection_string = "${AZURE_STORAGE_CONNECTION_STRING}" # This uses string substitution to get the actual connection string from an environment variable or config file. site_config = { "class_name": "SiteBuilder", "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"}, "store_backend": { "class_name": "TupleAzureBlobStoreBackend", "container": container, "prefix": prefix, "connection_string": connection_string, }, }
A Google Cloud Service Data Doc site requires the following store_backend information:
- class_name: The name of the class to implement for accessing the target environment. For Google Cloud Storage this will be TupleGCSStoreBackend.
- project: The name of the GCS project that will host the Data Docs site.
- bucket: The name of the bucket that will contain the Data Docs pages.
- prefix: The path of the folder that will contain the Data Docs pages relative to the root of the GCS bucket. The combination of bucket and prefix must be unique accross all Stores used by a Data Context.
To define a Data Docs site configuration for Google Cloud Storage, update the values of project, bucket, and prefix in the following code and execute it:
project = "my_project" bucket = "my_gcs_bucket" prefix = "data_docs_site/" site_config = { "class_name": "SiteBuilder", "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"}, "store_backend": { "class_name": "TupleGCSStoreBackend", "project": project, "bucket": bucket, "prefix": prefix, }, }
For GX to access your Google Cloud Services environment, you will also need to configure the appropriate credentials. By default, GCS credentials are handled through the gcloud command line tool and the GOOGLE_APPLICATION_CREDENTIALS environment variable. The gcloud command line tool is used to set up authentication credentials, and the GOOGLE_APPLICATION_CREDENTIALS environment variable provides the path to the json file with those credentials.

For more information on using the gcloud command line tool, see Google Cloud's Cloud Storage client libraries documentation.
Add your configuration to your Data Context.

All Data Docs sites have a unique name within a Data Context. Once your Data Docs site configuration has been defined, add it to the Data Context by updating the value of site_name in the following to something more descriptive and then execute the code::
Python
```
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
```
Optional. Build your Data Docs sites manually.

You can manually build a Data Docs site by executing the following code:
Python
```
context.build_data_docs(site_names=site_name)
```

Optional. Automate Data Docs site updates with Checkpoint Actions.

You can automate the creation and update of Data Docs sites by including the UpdateDataDocsAction in your Checkpoints. This Action will automatically trigger a Data Docs site build whenever the Checkpoint it is included in completes its run() method.

Python
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
    gx.checkpoint.actions.UpdateDataDocsAction(
        name="update_my_site", site_names=[site_name]
    )
]
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name=checkpoint_name,
        validation_definitions=[validation_definition],
        actions=actions,
    )
)

result = checkpoint.run()

Optional. View your Data Docs.

Once your Data Docs have been created, you can view them with:
Python
```
context.open_data_docs()
```

GX Core supports the Data Docs configurations for the following environments:

Filesystem
Amazon S3
Azure Blob Storage
Google Cloud Service

Python
import great_expectations as gx

context = gx.get_context(mode="file")
set_up_context_for_example(context)

# Define a Data Docs site configuration dictionary
base_directory = "uncommitted/data_docs/local_site/"  # this is the default path (relative to the root folder of the Data Context) but can be changed as required
site_config = {
    "class_name": "SiteBuilder",
    "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
    "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": base_directory,
    },
}

# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)

# Manually build the Data Docs
context.build_data_docs(site_names=site_name)

# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
    gx.checkpoint.actions.UpdateDataDocsAction(
        name="update_my_site", site_names=[site_name]
    )
]
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name=checkpoint_name,
        validation_definitions=[validation_definition],
        actions=actions,
    )
)

result = checkpoint.run()

# View the Data Docs
context.open_data_docs()

Python
import great_expectations as gx

context = gx.get_context(mode="file")
set_up_context_for_example(context)

# Build Data Docs configuration dictionary
bucket = "my_s3_bucket"
prefix = "data_docs/"
boto3_options = {
    "endpoint_url": "${S3_ENDPOINT}",  # Uses string substitution to get the endpoint url form the S3_ENDPOINT environment variable.
    "region_name": "<your>",  # Use the name of your AWS region.
}
site_config = {
    "class_name": "SiteBuilder",
    "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
    "store_backend": {
        "class_name": "TupleS3StoreBackend",
        "bucket": bucket,
        "prefix": prefix,
        "boto3_options": boto3_options,
    },
}

# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)

# Manually build the Data Docs
context.build_data_docs(site_names=site_name)

# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
    gx.checkpoint.actions.UpdateDataDocsAction(
        name="update_my_site", site_names=[site_name]
    )
]
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name=checkpoint_name,
        validation_definitions=[validation_definition],
        actions=actions,
    )
)

result = checkpoint.run()

# View the Data Docs
context.open_data_docs()

Python
import great_expectations as gx

context = gx.get_context(mode="file")
set_up_context_for_example(context)

# Define a Data Docs configuration dictionary
container = "my_abs_container"
prefix = "data_docs/"
connection_string = "${AZURE_STORAGE_CONNECTION_STRING}"  # This uses string substitution to get the actual connection string from an environment variable or config file.

site_config = {
    "class_name": "SiteBuilder",
    "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
    "store_backend": {
        "class_name": "TupleAzureBlobStoreBackend",
        "container": container,
        "prefix": prefix,
        "connection_string": connection_string,
    },
}

# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)

# Manually build the Data Docs
context.build_data_docs(site_names=site_name)

# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
    gx.checkpoint.actions.UpdateDataDocsAction(
        name="update_my_site", site_names=[site_name]
    )
]
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name=checkpoint_name,
        validation_definitions=[validation_definition],
        actions=actions,
    )
)

result = checkpoint.run()

# View the Data Docs
context.open_data_docs()

Python
import great_expectations as gx

context = gx.get_context(mode="file")
set_up_context_for_example(context)

# Define a Data Docs configuration dictionary
project = "my_project"
bucket = "my_gcs_bucket"
prefix = "data_docs_site/"
site_config = {
    "class_name": "SiteBuilder",
    "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
    "store_backend": {
        "class_name": "TupleGCSStoreBackend",
        "project": project,
        "bucket": bucket,
        "prefix": prefix,
    },
}

# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)

# Manually build the Data Docs
context.build_data_docs(site_names=site_name)

# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
    gx.checkpoint.actions.UpdateDataDocsAction(
        name="update_my_site", site_names=[site_name]
    )
]
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name=checkpoint_name,
        validation_definitions=[validation_definition],
        actions=actions,
    )
)

result = checkpoint.run()

# View the Data Docs
context.open_data_docs()

Prerequisites:​

Procedure​

Prerequisites:

Procedure