Split and Join Large Files

This post explains some useful combinations of commands that you can use on Linux (or sometimes also in other operating systems) to split large files into smaller pieces, and then how to rebuild the original file from those pieces.

Generic Approach

Let’s say you want to split a large file called input into chunks no bigger than 500 MB each; here’s how to do it.

$ split --bytes 500M --numeric-suffixes \
    --suffix-length=3 input pattern.

You’ll end up with a few files called pattern.000, pattern.001, etc.

The command below reconstructs the file from the pieces:

$ cat pattern.* > output

Et voilĂ .

Compressed Archive Files

Using ZIP

Supposedly, the zip command should be able to directly split archives into smaller chunks. This is how the theory goes:

First, create a set of ZIP files of a certain size, instead of one big file (Source):

$ zip -r -s 5m input.zip example-dir/

Second, to decompress, first “join” all files into a big one, and then unzip it:

$ zip -FF input.zip --out output.zip
$ unzip output.zip

The truth is that this might or might not work depending on your version of zip and your operating system (the behavior is different in zip version 2 or 3!). There are lots of questions about this on Stack Overflow and its sibling websites, and there is a wide variety of answers.

In my case (Fedora 40) the commands above simply did not work; the final output.zip file was always somehow corrupt, and cannot be used to retrieve the original files.

Using 7z

So, instead, I recommend you use p7zip, which generates archives in the very efficient 7-zip format, with software available for Windows, Linux, and macOS:

$ sudo dnf install p7zip p7zip-doc p7zip-plugins
$ 7z -v10m a input.7z example-dir
$ 7z x input.7z.001

As the last command above shows, just ask 7z to extract the first chunk, and 7z will automatically figure out the rest for you.

Having said that, here’s a copy of the man 7z documentation, with stuff you should be aware of when using p7zip:

Backup and limitations
DO NOT USE the 7-zip format for backup purpose on Linux/Unix because :
- 7-zip does not store the owner/group of the file.
On Linux/Unix, in order to backup directories you must use tar :
- to backup a directory : tar cf - directory | 7z a -si directory.tar.7z
- to restore your backup : 7z x -so directory.tar.7z | tar xf -
If you want to send files and directories (not the owner of file) to others Unix/MacOS/Windows users, you can use the 7-zip format.
example : 7z a directory.7z directory
Do not use -r because this flag does not do what you think.
Do not use directory/* because of .* files (example : directory/* does not match directory/.profile)

You’ve been warned.

PDF Files

Using PDFtk

Install the PDFtk package (available in various distros, as well as Snap and Flatpak) to manipulate PDF files on the command line.

Use the pdftk command below to export pages 4 until the end of a hypothetically large PDF file.

$ pdftk input.pdf cat 4-end output output.pdf

There is also a pdftk burst command that splits the input into individual pages whose filenames follow a pattern:

$ pdftk document.pdf burst output page_%02d.pdf

The same pdftk cat command can also be used to join several PDF files into one.

$ pdftk page_01.pdf page_02.pdf cat output output.pdf

Using Poppler

Another option is to install Poppler on your system (sudo dnf install poppler on Fedora 40) and use the pdfseparate and pdfunite commands to (you guessed it) split and merge various PDF files into one. Their usage is trivial:

$ pdfseparate original.pdf -f 23 -l 25 pages-%d.pdf
$ pdfunite page-23.pdf page-24.pdf page-25.pdf output.pdf

Bonus: Splitting PDF Pages into Images

Here are two commands you can use to split PDF files into individual images, one per page. The first one uses Ghostscript and generates PNG files:

$ gs -dBATCH -dNOPAUSE -sDEVICE=png16m -dUseCropBox \
    -sOutputICCProfile=default_rgb.icc -r300 \
    -sOutputFile=images/slides-%03d.png slides.pdf

The second one uses pdftoppm, another tool part of the aforementioned Poppler package, and generates JPG files of relatively decent quality:

$ pdftoppm -jpeg -r 300 input.pdf output-prefix

The result is a series of files output-prefix-N.jpg, one for each page of the PDF.

Bonus 2: OCR your PDFs

Nothing to do with splitting and merging, but useful nevertheless: the OCRmyPDF project, available as a container, adds a text layer to scanned PDF files, allowing them to be searched.

$ podman run --rm --volume "$(pwd):/data" \
    docker.io/jbarlow83/ocrmypdf:latest \
    -l deu+eng --deskew --pdf-renderer hocr /data/input.pdf \
    /data/output_OCR.pdf

Et voilĂ . Your PDF is ready for Zotero now. You’re welcome.