in this series
When it comes to complex transformations, we would design our flows to be more organized and clean. Yes, I’m talking about functions and classes.
How should we create and use them in Apache Beam?
ParDo is a function of Apache Beam to execute a given class to do some transformations. To implement
ParDo, we can start from this syntax.
We start from defining as class and inherits
beam.DoFn. A required function here is
process(). We need to implement this and Beam will call the function automatically.
There are some other functions we will see later.
We can just import it and execute the custom class by
Let’s say we are running Beam to transform a CSV as same as the last part but we want to use our own functions, also wrapped in a class.
Here it is. We have a class
CSVToDictFn. It inherits
beam.DoFn here. This class transforms from a row of CSV to a dict.
Then we go back to main and call the class, like this.
It will call the function
Here is the output.
beam.ParDo will accumulate the result in the end, we should use
yield here where
yield produces an iterator of each element while
return returns an iterator of all elements.
This is what we will get if we use
return. It is only the keys of the dictionary.
Example with parameters
Let's move to another level.
If we want to parse parameters to the class, we can implement
__init__() to set them.
Here we have
CSVToDictFn class. This version has
__init__() to receive variable
schema. The variable will be used to cast CSV fields to different types.
And we can call the class like this.
schema at line 8 and call the class at line 23. When calling the class, we also parse
schema in the parentheses.
The logic of this version of
CSVToDictFn is to get field names, field types, and value, map with
zip and construct a dictionary. This can be illustrated as the following figure.