Hi~

New series of "Let's try" is here. We are talking about one of popular transformation tools – Apache Beam.


What's Apache Beam

Apache Beam is a tool from Apache for processing data, both batch and real-time basis. This is popular as its open-source license and capability to work on data processing tasks. This is about "T" - Transform in "ETL" term as if we are cooking fried eggs for our nice consumers.

Apache Beam can be performed on mainstream languages such as Java, Python, and Go.

Apache Beam®
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the…

Of course, we will take it in Python.


Concept of this blog

We need some backgrounds before going to the hand-on.

Terms

Apache Beam consists of 3 simple terms:

  1. Pipeline: This is the overall of tasks.
  2. PCollection: This is a set of data. It can be bounded, a finite dataset a.k.a. batch, or unbounded, an infinite dataset a.k.a. streaming.
  3. PTranform: This is an operation to transform data.

A Pipeline starts from PCollection from bounded or unbounded datasource and transform using PTransform, and so on until the end.

Runner

Apache Beam requires "runner" for executing the task. There are several runners for different platforms/engines. For more info, please visit the link below.

Apache Beam Capability Matrix
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the…

In this blog, we are using Python Direct Runner which is one of simplest runners.


Begin the hand-on

This hand-on is reading a CSV file and do some transformations.

1. Install Apache Beam

Prepare your environment and install Apache Beam with this command.

pip install apache-beam

2. Prepare a CSV file

Now we made a sample CSV file.

3. Read the file

We can write Python code to read the file as follows.

We want to read it locally so we import DirectRunner() or parse 'DirectRunner' for the parameter runner. After that we add pipe ( | ) in order to add new step for beam.io.ReadFromText().  This function read texts from a given file.

I usually pipe beam.Map(print) to debug. This statement mean commanding Beam to execute PTransform, print to print values, via beam.Map() and parse PCollection from the previous step, which is what we have read from the file.

And here is the result.

4. Map to dict

We want more step. I've prepare function mapToDict() that get a string and parse to CSV dict. This function is called by beam.Map(mapToDict) and then print again.

5. Filter

We want only F, the female person, therefore we add the step beam.Filter(). This function will filter only records in PCollection that meet the condition.

6. Map to CSV rows

Say we have processed completely and want to save the output into a file, so that we need to transform from dict to str in CSV format.

The method mapToCSVRow will product a string of CSV format.

7. Write to new CSV file

Right here, we use beam.io.WriteToText() to write the PCollection from the last PTransform into a file. Add shard_name_template="" in order to produce file without sharding name, otherwise it will be like "-00000-of-00001" at the end of the filename.

8. The new CSV file

Now the output is ready like this.