Apache Beam provides inputs and outputs for PCollection in many packages. We just import and call them properly and get the job done.

This blog we will see 3 IO (input/output) modules that I usually work with.

1. Text (Google Cloud Storage)

A very basic one.

Beam has beam.io library and there are ReadFromText() and WriteToText() in order to read and write a text file respectively.

We also use them to work with files in Google Cloud Storage as they are text files.

line 14: ReadFromText() to read a file at input_file which is from argparse.
line 16: WriteToText() to create a file at output_file.

This pipeline can be drawn to diagram like this.

2. Database (Google BigQuery)

Another Google Cloud service that I use so often.

line 14: ReadFromBigQuery() and supply query= to run the query.
line 18: supply temp_dataset= in order to allow Beam can use the given dataset to store temporary data generated by Beam.

ℹ️

If we don't supply temp_dataset=, Beam will automatically create a new dataset every time it runs.

This pipeline is as the diagram below:

Come to real-time things.

Google Cloud Pub/Sub is one of source and sink integrated with Beam. We are able to setup Beam to listen to a publisher or a subscriber by design.

For this time, I setup Beam to read data from a subscriber then transform before send to another topic.

line 10: set the option with streaming=True. Allow Beam to run as a streaming pipeline.
line 15: ReadFromPubSub() by reading from a specific subscriber.
line 26: After transforming, share the result to a topic through WriteToPubSub().

ℹ️

Make sure the PCollection is a byte string by using .encode() before throw them to WriteToPubSub().

We can test publish something on the topic and pull from the subscription. Like this.