Google publishes JSON specifications for all of their APIs (including GCP services as well as other APIs associated with other products - like analytics or workspace). These specifications can be accessed without authentication starting with the root document (https://discovery.googleapis.com/discovery/v1/apis) which contains metadata and the URL for each service specific document (for services like compute or storage).
The program fetches the service document for each service that is included and not explicitly excluded (configured through variables in the program). Non preferred services (beta or alpha versions) can be included by setting the variable get_preferred_only to False.
An OpenAPI spec is constructed for each service based upon the data in the service discovery doc. In many cases this is a straightforward one to one mapping, such as to top level info, title and description values, it gets more complicated with parameters and schemas where some extra logic is required to keep the json pointers ($ref) valid.
The real magic is in extracting paths and verbs in a compliant OpenAPI format, as Google nests this data (potentially multiple levels deep) under resources.
The first step is to identify methods nested under a resources object (which can be mapped to operations - with a path and HTTP verb required to populate an OpenAPI spec), this function does this:
Now each method can be processed yielding an operation (combination of path and verb), this is done using this function:
You can do this in a straightforward function implementing the "tail call recursion" pattern to search for a key (key) from the root of an object (obj), excluding any keys in excludeKeys.
This will return a list of values for the given key, searching all levels in all branches of the object.
Container images provide an ideal software packaging solution for DataOps and python based data pipeline workloads. Containers enable Data Scientists and Data Engineers to incorporate the latest packages and libraries without the issues associated with introducing breaking changes into shared environments. A Data Engineer or Data Scienctist can quickly release new functionality with the best tools available.
Container images provide safer developer environments but as the number of container images used for production workloads grow, a maintenance challenge can emerge. Whether using pip or poetry to manage python packages and dependencies, updating a container definition requires edits to the explicit package versions as well as to the pinned or locked versions of the package dependencies. This process can be error prone without automation and a repeatable CICD workflow.
A workflow pattern based on docker buildkit / moby buildkit multi-stage builds provides an approach that maintains all the build specifications in a single Dockerfile, while build tools like make provide a simple and consistent interface into the container build stages. The data pipeline challenges addresses with a multi-stage build pattern include:
automating lifecycle management of the Python packages used by data pipelines
integrating smoke testing of container images to weed out compatibility issues early
simplifying the developer experience with tools like make that can be used both locally and in CI/CD pipelines
The Dockerfile contains the definitions of the different target build stages and order of execution from one stage to the next. The Makefile wraps the Dockerfile build targets into a standard set of workflow activities, following a similar to $ config && make && make install
A typical dataops/gitops style workflow for maintaining container images includes actions in the local environment to define the required packages and produce the pinned dependency poetry.lock file or requirements.txt packages list containing the full set of pinned dependent packages.
Given and existing project in a remote git repository with a CI/CD pipeline defined, the following workflow would be used to update package versions and dependencies:
Workflow
PlantUML
@startuml Multi-stage build workflow |Local Maintainer| start :Clone git repository and create a feature branch; :Update declared dependencies; :Run build with refresh option; :Update new pinned packages file in the git repository; :Commit changes and push to remote repository; |Remote Git Service| :Validate feature branch changes; :Merge changes into main branch; :build target image and push to package registry; |Package Registry| :publish new image; stop @enduml
The image maintainer selects the packages to update or refresh using a local development environment, working from a feature branch. This includes performing an image smoke-test to validate the changes within the container image.
Once refreshed image has been validated, the lock file or full pinned package list is commited back to the repository and pushed to the remote repository. The CI/CD pipeline performs a trial build and conducts smoke testing. On merge into the main branch, the target image is built, re-validated, and pushed to the container image registry.
The multi-stage build pattern can support both defining both the declared packages for an environment as well as the dependent packages, but poetry splits the two into distinct files, a pyproject.toml file containing the declated packages and a poetry.lock file that contains the full set of declared and dependent packages, including pinned versions. pip supports loading packages from different files, but requires a convention for which requirements file contains the declared packages and while contains the full set of pinned package versions produced by pip freeze. The example code repo contains examples using both pip and poetry.
The following example uses poetry in a python:3.8 base image to illustrate managing the dependencies and version pinning of python packages.
The Dockerfile makes use of the docker build arguments feature to pass in whether the build should refresh package versions or build the image from pinned packages.
Any image setup and pre-python package installation steps. For poetry, this includes setting the config option to skip the creation of a virtual environment as the container already provides the required isolation.
ARG PYTHON_PKG_VERSIONS=pinned FROM python:3.8 as base-pre-pkg RUN install -d /src && \ pip install --no-cache-dir poetry==1.1.13 && \ poetry config virtualenvs.create false WORKDIR /src
A consolidation build target that can refer to either the python-pkg-refresh or the python-pkg-pinned stages, depending on the docker build argument and includes any post-package installation steps.
FROM python-pkg-${PYTHON_PKG_VERSIONS} as base-post-pkg
The final build target container image. Listing the target-image as the last stage in the Dockerfile has the effect of also making this the default build target.
The Makefile provides a workflow oriented wrapper over the Dockerfile build stage targets. The Makefile targets can be executed both in a local development environment as well as via a CICD pipeline. The Makefile includes several variables that can either be run using default values, or overridden by the CI/CD pipeline.
The python-pkg-refresh target builds a version of the target image with refreshed package versions. A temporary container instance is created from the target image and the poetry.lock file is copied into the local file system. The smoke-test docker build target is used to ensure image validation is also performed.
The temporary container as well as the package refresh image are removed after the build.
python-pkg-refresh: @echo ">> Update python packages in container image" docker build ${DOCKER_BUILD_ARGS} \ --target smoke-test \ --build-arg PYTHON_PKG_VERSIONS=refresh \ --tag ${TARGET_IMAGE_NAME}:$@ . @echo ">> Copy the new poetry.lock file with updated package versions" docker create --name ${TARGET_IMAGE_NAME}-$@ ${TARGET_IMAGE_NAME}:$@ docker cp ${TARGET_IMAGE_NAME}-$@:/src/poetry.lock . @echo ">> Clean working container and refresh image" docker rm ${TARGET_IMAGE_NAME}-$@ docker rmi ${TARGET_IMAGE_NAME}:$@
The toolchain combination of multi-stage container image builds with make provides a codified method for the lifecycle management of the containers used in data science and data engineering workloads.
The maintainer:
git checkout -b my-refresh-feature make python-pkg-refresh make smoke-test gitadd pyproject.toml poetry.lock git commit -m "python package versions updated" git push
The CICD pipeline:
make build make smoke-test docker push <target-image>:<build-tag>
This article demonstrates how to use the Snowflake REST API to retrieve data for a web application using TypeScript, in this case we are using keypair authentication with Snowflake.
Snowflake’s SQL API allows you to access snowflake objects using SQL via a REST API request, this allows for easy integration with your applications and deployment pipelines. You can use the API to execute most DDL and DML statements.
There are some limitations you need to be aware of however, for example interactions with stages (using PUT and GET aren’t supported via the Snowflake API) or stored procedure operations (using CALL), you can read more on this here.
There are two types of Authentication methods for the API, OAuth and Key Pair. For OAuth method, you can choose to use X-Snowflake-Authorization-Token-Type header, if this header is not present, Snowflake assumes that the token in the Authorization header is an OAuth token. For Key Pair method, the JWT token will be in the Authorization header as Bearer <your token>.
Let’s walk through how to generate and use the JWT.
We will need to include the region and account identifier, for instance if your account identifier includes a region (e.g. xy12345.us-east2.aws.snowflakecomputing.com).
Large result sets are paginated into partitions, each partition is a set of rows.
note
Note that the pages (referred to as partitions) are NOT based on row count, instead they are based on the compressed batch size, so they will not be uniform in terms of the number of rows.
To get a partition, send a GET request with Url https://<accountIdentifier>.snowflakecomputing.com/api/v2/statements/?partition=<partitionId>.
Open API specifications can get quite large, especially for providers with upwards of 500 routes or operations.
The challenge is to create standalone documents scoped by a service or path within the parent API specification and include only the components (schemas, responses, etc.) that pertain to operations included in the child document.
When I went looking for library or utility to do this, I couldn’t find one... so I have developed one myself.
It's a simple command (nodejs based but can be run in a bash terminal or from the Windows command line) which requires a few options, including:
the provider name (e.g. github)
a provider version which is a version you set - allowing you to make minor modifications to the output documents (e.g. v0.1.0)
a service discriminator which is a JSONPath expression to identify a service name within each route in the parent file, this is used to assign operations to services in separate documents (e.g. '$["x-github"].category')
an output directory (e.g. ./dev)
and of course, the openapi spec document you are splitting up.