Data Integration (EP 2) - Take it out

Data Integration (EP 2) - Take it out

Hello myself and around the world!!

As we talked in the last episode, here now we go build a simple workflow to integrate data on Talend.

Let’s say, we need to sum up the data in 10 files into one. How can we do?

Talend’s interface

Left panel is called Repository. It is a bank of all our jobs and their compositions e.g. custom source code and metadata.

And the right one is Palette which displays a list of available components

Below one is Property:

  • Property – Job for job description: creation date and version
  • Property – Context or job variables
  • Property – Component displays settings and configurations depend on selected component
  • Property – Run for manually run the program

Let’s do the exercise!

There are 2 boxes of the operations. It means we need at least 2 components that are for reading 10 files and for writing 1 file.

Those are tFileInputDelimited for reading one CSV file and tFileOutputDelimited for writing one CSV file.

Next, we connected both by Right-click and select Row > Main. therefore, a row from tFileInputDelimited will be a row in tFileOutputDelimited.

Right now we have a design for a single file. Then we check if it is needed to transform data and luckily no.

Next, we have to define the schema. For example, the file contain 2 columns; first name and last name.

Click the ellipsis button (…) and a schema box will be appeared.

  • first_name as String (text)
  • last_name also as String

Don’t forget to set the exact schema on both tFileInputDelimited and tFileOutputDelimited (we can click “Sync Columns” to immediately copy schema)

tFileInputDelimited supports only a single file so we need another component that is tFileList for retrieving a list of files in a specific folder.

On tFileList, we fill the folder path that contains those 10 files in the “Directory” box. The figure below is my 10 files prepared in the place.

After that, we connect tFileList and tFileInputDelimited with Row > Iterate for read each file in the folder.

tFileList will list all files that meet our conditions and expose as the variable CURRENT_FILEPATH so we will read them one by one on tFileInputDelimited. One thing, CSV files use comma as a separator by default. Don’t forget to check it.

On tFileOutputDelimited, we want a destination file store all data of 10 source file so check “Append” to allow program add data at the end. Checking “Include Header” when we need headers.

Putting destination filename at our desire and finally we go run the program as below.

And here is the result.

See, the program works properly. Next time we will find the way how to schedule the program at a time of clock we want.

See you again

Show Comments