Data Integration (EP 2) - Take it out
Hello myself and around the world!!
As we talked in the last episode, here now we go build a simple workflow to integrate data on Talend.
Let’s say, we need to sum up the data in 10 files into one. How can we do?
Left panel is called Repository. It is a bank of all our jobs and their compositions e.g. custom source code and metadata.
And the right one is Palette which displays a list of available components
Below one is Property:
- Property – Job for job description: creation date and version
- Property – Context or job variables
- Property – Component displays settings and configurations depend on selected component
- Property – Run for manually run the program
Let’s do the exercise!
There are 2 boxes of the operations. It means we need at least 2 components that are for reading 10 files and for writing 1 file.
tFileInputDelimited for reading one CSV file and
tFileOutputDelimited for writing one CSV file.
Next, we connected both by Right-click and select
Row > Main. therefore, a row from
tFileInputDelimited will be a row in
Right now we have a design for a single file. Then we check if it is needed to transform data and luckily no.
Next, we have to define the schema. For example, the file contain 2 columns; first name and last name.
Click the ellipsis button (…) and a schema box will be appeared.
first_nameas String (text)
last_namealso as String
Don’t forget to set the exact schema on both
tFileOutputDelimited (we can click “Sync Columns” to immediately copy schema)
tFileInputDelimited supports only a single file so we need another component that is
tFileList for retrieving a list of files in a specific folder.
tFileList, we fill the folder path that contains those 10 files in the “Directory” box. The figure below is my 10 files prepared in the place.
After that, we connect
Row > Iterate for read each file in the folder.
tFileList will list all files that meet our conditions and expose as the variable
CURRENT_FILEPATH so we will read them one by one on
tFileInputDelimited. One thing, CSV files use comma as a separator by default. Don’t forget to check it.
tFileOutputDelimited, we want a destination file store all data of 10 source file so check “Append” to allow program add data at the end. Checking “Include Header” when we need headers.
Putting destination filename at our desire and finally we go run the program as below.
And here is the result.
See, the program works properly. Next time we will find the way how to schedule the program at a time of clock we want.
See you again