in this series
New series of "Let's try" is here. We are talking about one of popular transformation tools – Apache Beam.
What's Apache Beam
Apache Beam is a tool from Apache for processing data, both batch and real-time basis. This is popular as its open-source license and capability to work on data processing tasks. This is about "T" - Transform in "ETL" term as if we are cooking fried eggs for our nice consumers.
Apache Beam can be performed on mainstream languages such as Java, Python, and Go.
Of course, we will take it in Python.
Concept of this blog
We need some backgrounds before going to the hand-on.
Apache Beam consists of 3 simple terms:
Pipeline: This is the overall of tasks.
PCollection: This is a set of data. It can be bounded, a finite dataset a.k.a. batch, or unbounded, an infinite dataset a.k.a. streaming.
PTranform: This is an operation to transform data.
Pipeline starts from
PCollection from bounded or unbounded datasource and transform using
PTransform, and so on until the end.
Apache Beam requires "runner" for executing the task. There are several runners for different platforms/engines. For more info, please visit the link below.
In this blog, we are using Python Direct Runner which is one of simplest runners.
Begin the hand-on
This hand-on is reading a CSV file and do some transformations.
1. Install Apache Beam
Prepare your environment and install Apache Beam with this command.
pip install apache-beam
2. Prepare a CSV file
Now we made a sample CSV file.
3. Read the file
We can write Python code to read the file as follows.
We want to read it locally so we import
DirectRunner() or parse
'DirectRunner' for the parameter
runner. After that we add pipe (
| ) in order to add new step for
beam.io.ReadFromText(). This function read texts from a given file.
I usually pipe
beam.Map(print) to debug. This statement mean commanding Beam to execute
beam.Map() and parse
PCollection from the previous step, which is what we have read from the file.
And here is the result.
4. Map to dict
We want more step. I've prepare function
mapToDict() that get a string and parse to CSV dict. This function is called by
beam.Map(mapToDict) and then
We want only
F, the female person, therefore we add the step
beam.Filter(). This function will filter only records in
PCollection that meet the condition.
6. Map to CSV rows
Say we have processed completely and want to save the output into a file, so that we need to transform from
str in CSV format.
mapToCSVRow will product a string of CSV format.
7. Write to new CSV file
Right here, we use
beam.io.WriteToText() to write the
PCollection from the last
PTransform into a file. Add
shard_name_template="" in order to produce file without sharding name, otherwise it will be like "-00000-of-00001" at the end of the filename.
8. The new CSV file
Now the output is ready like this.