Let's try: Apache Beam part 6 - instant IO
in this series
- Let's try: Apache Beam part 1 – simple batch
- Let's try: Apache Beam part 2 – draw the graph
- Let's try: Apache Beam part 3 - my own functions
- Let's try: Apache Beam part 4 - live on Google Dataflow
- Let's try: Apache Beam part 5 - transform it with Beam functions
- Let's try: Apache Beam part 6 - instant IO
- Let's try: Apache Beam part 7 - custom IO
- Let's try: Apache Beam part 8 - Tags & Side inputs
Apache Beam provides inputs and outputs for PCollection in many packages. We just import and call them properly and get the job done.
This blog we will see 3 IO (input/output) modules that I usually work with.
1. Text (Google Cloud Storage)
A very basic one.
Beam has beam.io
library and there are ReadFromText()
and WriteToText()
in order to read and write a text file respectively.
We also use them to work with files in Google Cloud Storage as they are text files.
- line 14:
ReadFromText()
to read a file atinput_file
which is fromargparse
. - line 16:
WriteToText()
to create a file atoutput_file
.
This pipeline can be drawn to diagram like this.
2. Database (Google BigQuery)
Another Google Cloud service that I use so often.
- line 14:
ReadFromBigQuery()
and supplyquery=
to run the query. - line 18: supply
temp_dataset=
in order to allow Beam can use the given dataset to store temporary data generated by Beam.
temp_dataset=
, Beam will automatically create a new dataset every time it runs.This pipeline is as the diagram below:
3. Messaging (Google Cloud Pub/Sub)
Come to real-time things.
Google Cloud Pub/Sub is one of source and sink integrated with Beam. We are able to setup Beam to listen to a publisher or a subscriber by design.
For this time, I setup Beam to read data from a subscriber then transform before send to another topic.
- line 10: set the option with
streaming=True
. Allow Beam to run as a streaming pipeline. - line 15:
ReadFromPubSub()
by reading from a specific subscriber. - line 26: After transforming, share the result to a topic through
WriteToPubSub()
.
.encode()
before throw them to WriteToPubSub()
.We can test publish something on the topic and pull from the subscription. Like this.
originMessageId
is parsed from the topic in transformation step.
This diagram describes the flow of this Beam pipeline.
Repository
References
- Beam IO: Text https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html
- Beam IO: Google BigQuery https://beam.apache.org/releases/pydoc/2.36.0/apache_beam.io.gcp.bigquery.html
- Beam IO: Google BigQuery - autodetect schema https://stackoverflow.com/a/67643669
- Beam IO: Google Cloud Pub/Sub https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.io.gcp.pubsub.html