Let's try: Apache Beam part 1 – simple batch
in this series
- Let's try: Apache Beam part 1 – simple batch
- Let's try: Apache Beam part 2 – draw the graph
- Let's try: Apache Beam part 3 - my own functions
- Let's try: Apache Beam part 4 - live on Google Dataflow
- Let's try: Apache Beam part 5 - transform it with Beam functions
- Let's try: Apache Beam part 6 - instant IO
- Let's try: Apache Beam part 7 - custom IO
- Let's try: Apache Beam part 8 - Tags & Side inputs
Hi~
New series of "Let's try" is here. We are talking about one of popular transformation tools – Apache Beam.
What's Apache Beam
Apache Beam is a tool from Apache for processing data, both batch and real-time basis. This is popular as its open-source license and capability to work on data processing tasks. This is about "T" - Transform in "ETL" term as if we are cooking fried eggs for our nice consumers.
Apache Beam can be performed on mainstream languages such as Java, Python, and Go.
Of course, we will take it in Python.
Concept of this blog
We need some backgrounds before going to the hand-on.
Terms
Apache Beam consists of 3 simple terms:
Pipeline
: This is the overall of tasks.PCollection
: This is a set of data. It can be bounded, a finite dataset a.k.a. batch, or unbounded, an infinite dataset a.k.a. streaming.PTranform
: This is an operation to transform data.
A Pipeline
starts from PCollection
from bounded or unbounded datasource and transform using PTransform
, and so on until the end.
Runner
Apache Beam requires "runner" for executing the task. There are several runners for different platforms/engines. For more info, please visit the link below.
In this blog, we are using Python Direct Runner which is one of simplest runners.
Begin the hand-on
This hand-on is reading a CSV file and do some transformations.
1. Install Apache Beam
Prepare your environment and install Apache Beam with this command.
pip install apache-beam
2. Prepare a CSV file
Now we made a sample CSV file.
3. Read the file
We can write Python code to read the file as follows.
We want to read it locally so we import DirectRunner()
or parse 'DirectRunner'
for the parameter runner
. After that we add pipe ( |
) in order to add new step for beam.io.ReadFromText()
. This function read texts from a given file.
I usually pipe beam.Map(print)
to debug. This statement mean commanding Beam to execute PTransform
, print
to print values, via beam.Map()
and parse PCollection
from the previous step, which is what we have read from the file.
And here is the result.
4. Map to dict
We want more step. I've prepare function mapToDict()
that get a string and parse to CSV dict. This function is called by beam.Map(mapToDict)
and then print
again.
5. Filter
We want only F
, the female person, therefore we add the step beam.Filter()
. This function will filter only records in PCollection
that meet the condition.
6. Map to CSV rows
Say we have processed completely and want to save the output into a file, so that we need to transform from dict
to str
in CSV format.
The method mapToCSVRow
will product a string of CSV format.
7. Write to new CSV file
Right here, we use beam.io.WriteToText()
to write the PCollection
from the last PTransform
into a file. Add shard_name_template=""
in order to produce file without sharding name, otherwise it will be like "-00000-of-00001" at the end of the filename.
8. The new CSV file
Now the output is ready like this.