Americas

  • United States
sandra_henrystocker
Unix Dweeb

Splitting files on Linux by context

How-To
Jan 03, 20223 mins
LinuxUnix

The csplit command is unusual in that allows you to split text files into pieces based on their content. The command allows you to specify a contextual string and use it as a delimiter for identifying the chunks to be saved as separate files.

As an example, if you wanted to separate diary entries into a series of files each with a single entry, you might do something like this.

$ csplit -z diary '/^Dear/' '{*}'
153
123
136

In this example, “diary” is the name of the file to be split. The command is looking for lines that begin with the word “Dear” as in “Dear Diary” to determine where each chunk begins. The -z option tells csplit to not bother saving files that would be empty.

You can list the files that were just created by using a command like the following that limits the output of the ls command to the most recent files. The three numbers shown display the length of each of the three separate files that were created.

$ ls -ltr | tail -3
-rw-r--r--.  1 shs  shs        136 Jan  1 15:02 xx02
-rw-r--r--.  1 shs  shs        123 Jan  1 15:02 xx01
-rw-r--r--.  1 shs  shs        153 Jan  1 15:02 xx00

You could also use the full phrase for the separator line:

$ csplit -z diary '/^Dear Diary,/' '{*}'

In either case, the xx00 file will look like this:

$ cat xx00
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

The xx00, xx01, xx02, etc. file naming is the default. Split an additional file and these output files would be overwritten by the newer files unless you use the -f or –prefix option to replace “xx” with something more meaningful as in the example below in which the word “diary” is used to name the files.

$ csplit -zf diary diary '/^Dear/' '{*}'
153
123
136
$ ls -ltr | tail -3
-rw-r--r--.  1 shs  shs        123 Jan  1 15:11 diary01
-rw-r--r--.  1 shs  shs        153 Jan  1 15:11 diary00
-rw-r--r--.  1 shs  shs        136 Jan  1 15:11 diary02

If the file you want to split is separated by dates, you might try a command like this that looks for a portion of the date field:

$ csplit -zf diary diary '/, 202/' '{*}'
166
136
149
$ cat diary00
Dec 11, 2021
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

If you want to add a file extension to your output files, you can specify it as in the command shown below that uses “.txt” as the file extension. The 02d specifies that two digits are to be used. This is the default, but if you want 4 digits, just change the 2 to a 4.

$ $ csplit -z -b "%02d.txt" diary '/, 20/' '{*}'
10
166
136
149
$ ls -ltr | tail -4
-rw-r--r--.  1 shs  shs        149 Jan  1 15:53 xx03.txt
-rw-r--r--.  1 shs  shs        136 Jan  1 15:53 xx02.txt
-rw-r--r--.  1 shs  shs        166 Jan  1 15:53 xx01.txt
-rw-r--r--.  1 shs  shs         10 Jan  1 15:53 xx00.txt
$ cat xx01.txt
Dec 11, 2021
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

Wrap-Up

The csplit command can make splitting files into pieces based on meaningful breaks fairly easy and includes enough options to help you get exactly the result you want.

sandra_henrystocker
Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.