Skip to main content

15 posts tagged with "googlecloudplatform"

View All Tags

· 4 min read
Tom Klimovski

So you're using BigQuery (BQ). It's all set up and humming perfectly. Maybe now, you want to run an ELT job whenever a new table partition is created, or maybe you want to retrain your ML model whenever new rows are inserted into the BQ table.

In my previous article on EventArc, we went through how Logging can help us create eventing-type functionality in your application. Let's take it a step further and walk through how we can couple BigQuery and Cloud Run.

In this article you will learn how to

  • Tie together BigQuery and Cloud Run
  • Use BigQuery's audit log to trigger Cloud Run
  • With those triggers, run your required code

Let's go!

Let's create a temporary dataset within BigQuery named tmp_bq_to_cr.

In that same dataset, let's create a table in which we will insert some rows to test our BQ audit log. Let's grab some rows from a BQ public dataset to create this table:

CREATE OR REPLACE TABLE tmp_bq_to_cr.cloud_run_trigger AS
SELECT
date, country_name, new_persons_vaccinated, population
from `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name='Australia'
AND
date > '2021-05-31'
LIMIT 100

Following this, let's run an insert query that will help us build our mock database trigger:

INSERT INTO tmp_bq_to_cr.cloud_run_trigger
VALUES('2021-06-18', 'Australia', 3, 1000)

Now, in another browser tab let's navigate to BQ Audit Events and look for our INSERT INTO event:

BQ-insert-event

There will be several audit logs for any given BQ action. Only after a query is parsed does BQ know which table we want to interact with, so the initial log will, for e.g., not have the table name.

We don't want any old audit log, so we need to ensure we look for a unique set of attributes that clearly identify our action, such as in the diagram above.

In the case of inserting rows, the attributes are a combination of

  • The method is google.cloud.bigquery.v2.JobService.InsertJob
  • The name of the table being inserted to is the protoPayload.resourceName
  • The dataset id is available as resource.labels.dataset_id
  • The number of inserted rows is protoPayload.metadata.tableDataChanged.insertedRowsCount

Time for some code

Now that we've identified the payload that we're looking for, we can write the action for Cloud Run. We've picked Python and Flask to help us in this instance. (full code is on GitHub).

First, let's filter out the noise and find the event we want to process

@app.route('/', methods=['POST'])
def index():
# Gets the Payload data from the Audit Log
content = request.json
try:
ds = content['resource']['labels']['dataset_id']
proj = content['resource']['labels']['project_id']
tbl = content['protoPayload']['resourceName']
rows = int(content['protoPayload']['metadata']
['tableDataChange']['insertedRowsCount'])
if ds == 'cloud_run_tmp' and \
tbl.endswith('tables/cloud_run_trigger') and rows > 0:
query = create_agg()
return "table created", 200
except:
# if these fields are not in the JSON, ignore
pass
return "ok", 200

Now that we've found the event we want, let's execute the action we need. In this example, we'll aggregate and write out to a new table created_by_trigger:

def create_agg():
client = bigquery.Client()
query = """
CREATE OR REPLACE TABLE tmp_bq_to_cr.created_by_trigger AS
SELECT
count_name, SUM(new_persons_vaccinated) AS n
FROM tmp_bq_to_cr.cloud_run_trigger
"""
client.query(query)
return query

The Dockerfile for the container is simply a basic Python container into which we install Flask and the BigQuery client library:

FROM python:3.9-slim
RUN pip install Flask==1.1.2 gunicorn==20.0.4 google-cloud-bigquery
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY *.py ./
CMD exec gunicorn --bind :$PORT main:app

Now we Cloud Run

Build the container and deploy it using a couple of gcloud commands:

SERVICE=bq-cloud-run
PROJECT=$(gcloud config get-value project)
CONTAINER="gcr.io/${PROJECT}/${SERVICE}"
gcloud builds submit --tag ${CONTAINER}
gcloud run deploy ${SERVICE} --image $CONTAINER --platform managed

I always forget about the permissions

In order for the trigger to work, the Cloud Run service account will need the following permissions:

gcloud projects add-iam-policy-binding $PROJECT \
--member="serviceAccount:service-${PROJECT_NO}@gcp-sa-pubsub.iam.gserviceaccount.com"\
--role='roles/iam.serviceAccountTokenCreator'

gcloud projects add-iam-policy-binding $PROJECT \
--member=serviceAccount:${SVC_ACCOUNT} \
--role='roles/eventarc.admin'

Finally, the event trigger

gcloud eventarc triggers create ${SERVICE}-trigger \
--location ${REGION} --service-account ${SVC_ACCOUNT} \
--destination-run-service ${SERVICE} \
--event-filters type=google.cloud.audit.log.v1.written \
--event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \
--event-filters serviceName=bigquery.googleapis.com

Important to note here is that we're triggering on any Insert log created by BQ That's why in this action we had to filter these events based on the payload.

Take it for a spin

Now, try out the BigQuery -> Cloud Run trigger and action. Go to the BigQuery console and insert a row or two:

INSERT INTO tmp_bq_to_cr.cloud_run_trigger
VALUES('2021-06-18', 'Australia', 5, 25000)

Watch as a new table called created_by_trigger gets created! You have successfully triggered a Cloud Run action on a database event in BigQuery.

Enjoy!

· One min read
Jeffrey Aven

Mulitcloud Diagramming

Following on from the recent post GCP Templates for C4 Diagrams using PlantUML, cloud architects are often challenged with producing diagrams for architectures spanning multiple cloud providers, particularly as you elevate to enterprise level diagrams.

In this post, with the magic of !includeurl we have brought PlantUML template libraries together for AWS, Azure and GCP icon sets, allowing us to produce multi cloud C4 diagrams using PlantUML like this one:

Multi Cloud Architecture Diagram using PlantUML

Creating a multi cloud diagram is simple, start by adding the following include statements after the @startuml label in a new PlantUML C4 diagram:

Then add references to the required services from different providers…

Then include the predefined resources from your different cloud providers in your diagram as shown here (describing a client server application over a cloud to cloud VPN between Azure and GCP)...

Happy multi-cloud diagramming!

Full source code is available at:

https://github.com/gamma-data/plantuml-multi-cloud-diagrams

if you have enjoyed this post, please consider buying me a coffee ☕ to help me keep writing!

· 4 min read
Jeffrey Aven

Cloud BigTable

This is a follow up to the original Cloud Bigtable primer where we discussed the basics of Cloud Bigtable:

Cloud Bigtable Primer - Part I

In this article we will cover schema design and row key selection in Bigtable – arguably the most critical design decision to make when employing Bigtable in a cloud data architecture.

Quick Review

Recall from the previous post where the Bigtable data model was introduced that tables in Bigtable are comprised of rows and columns - much like a table in any other RDBMS. Every row is uniquely identified by a rowkey – again akin to a primary key in a table in an RDBMS. But this is where the similarities end.

Unlike a table in an RDBMS, columns only ever exist when they are inserted, and NULLs are not stored. See the illustration below:

Row Key Selection

Data in Bigtable is distributed by row keys. Row keys are physically stored in tablets in lexographic order. Recall that row keys are your ONLY indexes to data in Bigtable.

Selection Considerations

As row keys are your only indexes to retrieve or update rows in Bigtable, row key design must take the access patterns for the data to be stored and served via Bigtable into consideration, specifically the following must be considered when designing a Bigtable application:

  • Search patterns (returning data for a specific entity)
  • Scan patterns (returning batches of data)

Queries that use the row key, a row prefix, or a row range are the most efficient. Queries that do not include a row key will typically scan GB or TB of data and would not be suitable for operational use cases.

Row Key Performance

Row key performance will be biased towards your specific access patterns and application functional requirements. For example if you are performing sequential reads or scan operations then sequential keys will perform the best, however their write performance will not be optimal. Conversely, random keys (such as a uuid) will perform best for writes but poor for scan or sequential read operations.

Adding salts to keys (or additional data), similar to the use of salts in cryptography as well as promoting other field keys to be part of a composite row key can help achieve a “Goldilocks” scenario for both reads and writes, see the diagram below:

Using Reverse Timestamps

Use reverse timestamps when your most common query is for the latest values. Typically you would append the reverse timestamp to the key, this will ensure that the same related records are grouped together, for instance if you are storing events for a customer using the customer id along with an appended reverse timestamp (for example <customer_id>#<reverse_ts>) would allow you to quickly serve the latest events for a customer in descending order as within each group (customer_id), rows will be sorted so most recent insert will be located at the top.
A reverse timestamp can be generalised as:

Long.MAX_VALUE - System.currentTimeMillis()

Schema Design Tips

Some general tips for good schema design using Bigtable are summarised below:

  • Group related data for more efficient reads using column families
  • Distribute data evenly for more efficient writes
  • Place identical values in the adjoining rows for more efficient compression using row keys

Following these tips will give you the best possible performance using Bigtable.

Use the Key Visualizer to profile performance

Google provides a neat tool to visualize your row key distribution in Cloud Bigtable. You need to have at least 30 GB of data in your table to enable this feature.

The Key Visualizer is shown here:

Bigtable Key Visualizer

The Key Visualizer will help you find and prevent hotspots, find rows with too much data and show if your key schema is balanced.

Summary

Bigtable is one of the original and best (massively) distributed NoSQL platforms available. Schema and moreover row key design play a massive part in ensuring low latency and query performance. Go forth and conquer with Cloud Bigtable!

if you have enjoyed this post, please consider buying me a coffee ☕ to help me keep writing!

· 2 min read
Jeffrey Aven

GCP C4 Diagramming

I am a believer in the mantra of “Everything-as-Code”, this includes diagrams and other architectural artefacts. Enter PlantUML…

PlantUML

PlantUML is an open-source tool which allows users to create UML diagrams from an intuitive DSL (Domain Specific Language). PlantUML is built on top of Graphviz and enables software architects and designers to use code to create Sequence Diagrams, Use Case Diagrams, Class Diagrams, State and Activity Diagrams and much more.

C4 Diagrams

PlantUML can be extended to support the C4 model for visualising software architecture. Which describes software in different layers including Context, Container, Component and Code diagrams.

GCP Architecture Diagramming using C4

PlantUML and C4 can be used to produce cloud architectures, there are official libraries available through PlantUML for Azure and AWS service icons, however these do not exist for GCP yet. There are several open source libraries available, however I have made an attempt to simplify the implementation.

The code below can be used to generate a C4 diagram describing a GCP architecture including official GCP service icons:

@startuml
!define GCPPuml https://raw.githubusercontent.com/gamma-data/GCP-C4-PlantUML/master/templates

!includeurl GCPPuml/C4\_Context.puml
!includeurl GCPPuml/C4\_Component.puml
!includeurl GCPPuml/C4\_Container.puml
!includeurl GCPPuml/GCPC4Integration.puml
!includeurl GCPPuml/GCPCommon.puml

!includeurl GCPPuml/Networking/CloudDNS.puml
!includeurl GCPPuml/Networking/CloudLoadBalancing.puml
!includeurl GCPPuml/Compute/ComputeEngine.puml
!includeurl GCPPuml/Storage/CloudStorage.puml
!includeurl GCPPuml/Databases/CloudSQL.puml

title Sample C4 Diagram with GCP Icons

Person(publisher, "Publisher")
System\_Ext(device, "User")

Boundary(gcp,"gcp-project") {
CloudDNS(dns, "Managed Zone", "Cloud DNS")
CloudLoadBalancing(lb, "L7 Load Balancer", "Cloud Load Balancing")
CloudStorage(bucket, "Static Content Bucket", "Cloud Storage")
Boundary(region, "gcp-region") {
Boundary(zonea, "zone a") {
ComputeEngine(gcea, "Content Server", "Compute Engine")
CloudSQL(csqla, "Dynamic Content", "Cloud SQL")
}
Boundary(zoneb, "zone b") {
ComputeEngine(gceb, "Content Server", "Compute Engine")
CloudSQL(csqlb, "Dynamic Content\\n(Read Replica)", "Cloud SQL")
}
}
}

Rel(device, dns, "resolves name")
Rel(device, lb, "makes request")
Rel(lb, gcea, "routes request")
Rel(lb, gceb, "routes request")
Rel(gcea, bucket, "get static content")
Rel(gceb, bucket, "get static content")
Rel(gcea, csqla, "get dynamic content")
Rel(gceb, csqla, "get dynamic content")
Rel(csqla, csqlb, "replication")
Rel(publisher,bucket,"publish static content")

@enduml

The preceding code generates the diagram below:

Additional services can be added and used in your diagrams by adding them to your includes, such as:

!includeurl GCPPuml/DataAnalytics/BigQuery.puml
!includeurl GCPPuml/DataAnalytics/CloudDataflow.puml
!includeurl GCPPuml/AIandMachineLearning/AIHub.puml
!includeurl GCPPuml/AIandMachineLearning/CloudAutoML.puml
!includeurl GCPPuml/DeveloperTools/CloudBuild.puml
!includeurl GCPPuml/HybridandMultiCloud/Stackdriver.puml
!includeurl GCPPuml/InternetofThings/CloudIoTCore.puml
!includeurl GCPPuml/Migration/TransferAppliance.puml
!includeurl GCPPuml/Security/CloudIAM.puml
' and more…

The complete template library is available at:

https://github.com/gamma-data/GCP-C4-PlantUML

if you have enjoyed this post, please consider buying me a coffee ☕ to help me keep writing!

· 7 min read
Jeffrey Aven

Cloud BigTable

Bigtable is one of the foundational services in the Google Cloud Platform and to this day one of the greatest contributions to the big data ecosystem at large. It is also one of the least known services available, with all the headlines and attention going to more widely used services such as BigQuery.

Background

In 2006 (pre Google Cloud Platform), Google released a white paper called “Bigtable: A Distributed Storage System for Structured Data”, this paper set out the reference architecture for what was to become Cloud Bigtable. This followed several other whitepapers including the GoogleFS and MapReduce whitepapers released in 2003 and 2004 which provided abstract reference architectures for the Google File System (now known as Colossus) and the MapReduce algorithm. These whitepapers inspired a generation of open source distributed processing systems including Hadoop. Google has long had a pattern of publicising a generalized overview of their approach to solving different storage and processing challenges at scale through white papers.

Bigtable Whitepaper 2006

The Bigtable white paper inspired a wave of open source distributed key/value oriented NoSQL data stores including Apache HBase and Apache Cassandra.

What is Bigtable?

Bigtable is a distributed, petabyte scale NoSQL database. More specifically, Bigtable is…

a map

At its core Bigtable is a distributed map or an associative array indexed by a row key, with values in columns which are created only when they are referenced. Each value is an uninterpreted byte array.

sorted

Row keys are stored in lexographic order akin to a clustered index in a relational database.

sparse

A given row can have any number of columns, not all columns must have values and NULLs are not stored. There may also be gaps between keys.

multi-dimensional

All values are versioned with a timestamp (or configurable integer). Data is not updated in place, it is instead superseded with another version.

When (and when not) to use Bigtable

  • You need to do many thousands of operations per second on TB+ scale data
  • Your access patterns are well known and simple
  • You need to support random write or random read operations (or sequential reads) - each using a row key as the primary identifier

Don’t use Bigtable if…

  • You need explicit JOIN capability, that is joining one or more tables
  • You need to do ad-hoc analytics
  • Your access patterns are unknown or not well defined

Bigtable vs Relational Database Systems

The following table compares and contrasts Bigtable against relational databases (both transaction oriented and analytic oriented databases):

 BigtableRDBMS (OLTP)RDBMS (DSS/MPP)
Data LayoutColumn Family OrientedRow OrientedColumn Oriented
Transaction SupportSingle Row OnlyYesDepends (but usually no)
Query DSLget/put/scan/deleteSQLSQL
IndexesRow Key OnlyYesYes (typically PI based)
Max Data SizePB+'00s GB to TBTB+
Read/Write Throughput"'000000s queries/s"'000s queries/s

Bigtable Data Model

Tables in Bigtable are comprised of rows and columns (sounds familiar so far..). Every row is uniquely identified by a rowkey (like a primary key..again sounds familiar so far).

Columns belong to Column Families and only exist when inserted, NULLs are not stored - this is where it starts to differ from a traditional RDBMS. The following image demonstrates the data model for a fictitious table in Bigtable.

Bigtable Data Model

In the previous example, we created two Column Families (cf1 and cf2). These are created during table definition or update operations (akin to DDL operations in the relational world). In this case, we have chosen to store primary attributes, like name, etc in cf1 and features (or derived attributes) in cf2 like indicators.

Cell versioning

Each cell has a timestamp/version associated with it, multiple versions of a row can exist. Versions are naturally stored in descending order.

Properties such as the max age for a cell or the maximum number of versions to be stored for any given cell are set on the Column Family. Versions are compacted through a process called Garbage Collection - not to be confused with Java Garbage Collection (albeit same idea).

Row KeyColumnValueTimestamp
123cf1:statusACTIVE2020-06-30T08.58.27.560
123cf1:statusPENDING2020-06-28T06.20.18.330
123cf1:statusINACTIVE2020-06-27T07.59.20.460

Bigtable Instances, Clusters, Nodes and Tables

Bigtable is a "no-ops" service, meaning you do not need to configure machine types or details about the underlying infrastructure, save a few sizing or performance options - such as the number of nodes in a cluster or whether to use solid state hard drives (SSD) or the magnetic alternative (HDD). The following diagram shows the relationships and cardinality for Cloud Bigtable.

Bigtable Instances, Clusters and Nodes

Clusters and nodes are the physical compute layer for Bigtable, these are zonal assets, zonal and regional availability can be achieved through replication which we will discuss later in this article.

Instances are a virtual abstraction for clusters, Tables belong to instances (not clusters). This is due to Bigtables underlying architecture which is based upon a separation of storage and compute as shown below.

Bigtable Separation of Storage and Compute

Bigtables separation of storage and compute allow it to scale horizontally, as nodes are stateless they can be increased to increase query performance. The underlying storage system in inherently scalable.

Physical Storage & Column Families

Data (Columns) for Bigtable is stored in Tablets (as shown in the previous diagram), which store "regions" of row keys for a particular Column Family. Columns consist of a column family prefix and qualifier, for instance:

cf1:col1

A table can have one or more Column Families. Column families must be declared at schema definition time (could be a create or alter operation). A cell is an intersection of a row key and a version of a column within a column family.

Storage settings (such as the compaction/garbage collection properties mentioned before) can be specified for each Column Family - which can differ from other column families in the same table.

Bigtable Availability and Replication

Replication is used to increase availability and durability for Cloud Bigtable – this can also be used to segregate read and write operations for the same table.

Data and changes to tables are replicated across multiple regions or multiple zones within the same region, this replication can be blocking (single row transactions) or non blocking (eventually consistent). However all clusters within a Bigtable instance are considered primary (writable).

Requests are routed using Application Profiles, a single-cluster routing policy can be used for manual failover, whereas a multi-cluster routing is used for automatic failover.

Backup and Export Options for Bigtable

Managed backups can be taken at a table level, new tables can be created from backups. The backups cannot be exported, however table level export and import operations are available via pre-baked Dataflow templates for data stored in GCS in the following formats:

  • SequenceFiles
  • Avro Files
  • Parquet Files
  • CSV Files

Accessing Bigtable

Bigtable data and admin functions are available via:

  • cbt (optional component of the Google SDK)
  • hbase shell (REPL shell)
  • Happybase API (Python API for Hbase)
  • SDK libraries for:
    • Golang
    • Python
    • Java
    • Node.js
    • Ruby
    • C#, C++, PHP, and more

As Bigtable is not a cheap service, there is a local emulator available which is great for application development. This is part of the Cloud SDK, and can be started using the following command:

gcloud beta emulators bigtable start

In the next article in this series we will demonstrate admin and data functions as well as the local emulator.

Next Up : Part II - Row Key Selection and Schema Design in Bigtable

if you have enjoyed this post, please consider buying me a coffee ☕ to help me keep writing!