Working in data pipelines sometimes are with very large files and we need a solution to ensure the process must be done with less frustrations like lack of resources or glitches in files.
Typically we write files
Here is a sample Python script. Open a target file in w (write) mode then put some contents in it.
Then we meet chunks
Let's say we want to read a large file and write it to the destination but we can't read all at once. Here is an example of chunk processing we can use. Chunk means a small piece of something big so we are trying to split that big thing into pieces and transfer them one-by-one until finished. Example below shows that we can read (
r) from a file at a specific size then write into another file.
chunk and decompression
In case we need some extra operation, we are able to execute it to each chunk. For example, this is how we can do to decompress a gzip file with chunk processing. We, this time, use
rb to read as binary from the gzip source file and
wb to write as binary into the target text file. Decompression can be completed thanks to
Chunk from database connectors
Simply dumping database data. First, we need to connect to the database and execute a query then
fetchmany() to get each chunk so you can process the chunk as you desire. This time we append each chunk into the file with
a mode and use
csv library for csv formatting.
These are sample codes you can try and adapt to your work for performance optimization.