Split files in Linux by context

0

the csplit The command is unusual because it lets you split text files into chunks based on their content. The command allows you to specify a context string and use it as a delimiter to identify the chunks to save as separate files.

For example, if you wanted to separate the log entries into a series of files each with a single entry, you could do something like this.

$ csplit -z diary '/^Dear/' '{*}'
153
123
136

In this example, “log” is the name of the file to split. The command searches for lines that begin with the word “Dear” as in “Dear Diary” to determine where each chunk begins. the -z the option indicates csplit so as not to bother saving files that would be empty.

You can list newly created files using a command like this which limits the output of the ls command to the most recent files. The three numbers displayed show the length of each of the three separate files that were created.

$ ls -ltr | tail -3
-rw-r--r--.  1 shs  shs        136 Jan  1 15:02 xx02
-rw-r--r--.  1 shs  shs        123 Jan  1 15:02 xx01
-rw-r--r--.  1 shs  shs        153 Jan  1 15:02 xx00

You can also use the full sentence for the separator line:

$ csplit -z diary '/^Dear Diary,/' '{*}'

In both cases, the xx00 file will look like this:

$ cat xx00
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

The file naming xx00, xx01, xx02, etc. is the default. Splitting an additional file and those output files would be overwritten by newer files unless you use the -F Where –prefix option to replace “xx” with something more meaningful like in the example below in which the word “log” is used to name the files.

$ csplit -zf diary diary '/^Dear/' '{*}'
153
123
136
$ ls -ltr | tail -3
-rw-r--r--.  1 shs  shs        123 Jan  1 15:11 diary01
-rw-r--r--.  1 shs  shs        153 Jan  1 15:11 diary00
-rw-r--r--.  1 shs  shs        136 Jan  1 15:11 diary02

If the file you want to split is separated by dates, you can try a command like this which searches for part of the date field:

$ csplit -zf diary diary '/, 202/' '{*}'
166
136
149
$ cat diary00
Dec 11, 2021
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

If you want to add a file extension to your output files, you can specify it like in the command below which uses “.txt” as the file extension. The 02d specifies that two digits should be used. This is the default, but if you want 4 digits, just change the 2 to 4.

$ $ csplit -z -b "%02d.txt" diary '/, 20/' '{*}'
10
166
136
149
$ ls -ltr | tail -4
-rw-r--r--.  1 shs  shs        149 Jan  1 15:53 xx03.txt
-rw-r--r--.  1 shs  shs        136 Jan  1 15:53 xx02.txt
-rw-r--r--.  1 shs  shs        166 Jan  1 15:53 xx01.txt
-rw-r--r--.  1 shs  shs         10 Jan  1 15:53 xx00.txt
$ cat xx01.txt
Dec 11, 2021
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

Conclude

the csplit The command can make it fairly easy to split files into chunks based on meaningful breaks and includes enough options to help you get exactly the result you want.

Join the Network World communities on Facebook and LinkedIn to comment on the topics that matter to you.

Copyright © 2022 IDG Communications, Inc.

Share.

About Author

Comments are closed.