Photo by Wesley Tingey on Unsplash
Manipulating Large Files Via CommandLine
Always look for a command-line solution first
It is pretty standard in software development to work with large files at one point or another. Often they are in a format you don't want, like CSV instead of JSON. Often there is extra data that needs to be filtered out. Sometimes you have to split the CSV file into multiple, smaller files. So you start to look for a way to do this.
There have been times when I have thought. "Oh, I will have to make a program to filter and transform these files for me." This was because of my ignorance and stupidity.
Although there is more data now than in the past, people have already created simple command-line tools to deal with big data.
The three tools I use the most are
- Sed is focused on general test processing
- awk is focused on column data(CSV,TSV)
- jq is focused on JSON data
There is already a solution that you can google more often than not. Specifically, google with the name of tools like sed, awk, jq, or even just command-line.
Here are some simple recent queries I have done.
Google Search: "sed Remove all " from file"
Result: "sed 's/"//g'"
Google Search: "awk sum second column"
Result: "awk -F',' '{sum+=$4;}END{print sum;}'"
Google Search: "Remove json field jq"
Result: "jq 'del(.a)' data.json"
I am NOT an expert, but luckily there are a lot of experts out there. They have created documentation, answered questions, and maintained these tools. Leverage them and try to remember them next time you take that 50 GB CSV and do something with it.