Shell Script zum untersuchen und aufräumen von Archiven, die über Disk Recover wiederhergestellt wurden.
Go to file
Reindl David (IT-PTR-CEN2-SL10) 94bfc77c11 intial state
Co-authored-by: Copilot <copilot@github.com>
2026-05-02 17:06:52 +02:00
tests intial state 2026-05-02 17:06:52 +02:00
.gitignore intial state 2026-05-02 17:06:52 +02:00
classify-recovered-archives.zsh intial state 2026-05-02 17:06:52 +02:00
README.md intial state 2026-05-02 17:06:52 +02:00
salvage-damaged-zips.zsh intial state 2026-05-02 17:06:52 +02:00

classify-recovered-archives

classify-recovered-archives.zsh scans a directory for recovered .zip and .gz files, tries to infer the original document type from archive internals, renames the files with a more useful suffix, and moves them into labeled subdirectories.

The current iWork heuristics are intentionally conservative: modern iWork bundles are promoted to .pages when the package exposes a Pages-specific multipage marker and to .numbers when the ZIP layout exposes Index/Tables/. Other modern iWork bundles stay in Apple-iWork/ as .zip, and the script also creates .pages, .numbers, and .key probe copies so they can be opened directly for manual confirmation.

What it detects

  • Apple Pages packages from Metadata/DocumentProperties.plist plus Pages/ markers.
  • Modern Apple iWork packages from Index/Document.iwa and Metadata/Properties.plist, with .pages assigned when the package carries a multipage marker and .numbers assigned when the ZIP contains Index/Tables/.
  • Damaged Apple iWork packages from embedded path markers, including likely Numbers files when Index/Tables/ markers survive.
  • Microsoft Word, Excel, and PowerPoint OOXML packages from [Content_Types].xml plus word/, xl/, or ppt/ markers.
  • OpenDocument text, spreadsheet, and presentation packages from the mimetype entry.
  • EPUB packages from the mimetype entry.
  • APK and JAR archives from clear Java and Android markers.
  • Gzip-wrapped TAR archives.
  • Gzip-wrapped single files such as PDF, text, HTML, XML, JSON, RTF, common image formats, and some legacy Microsoft Office binaries.

Unknown or weakly identified files are moved into Unknown/ and keep their original basename. Damaged ZIP files with unreadable central directories are moved into Damaged-Zip/, Damaged-Apple-iWork/, Damaged-Pages/, or Damaged-Numbers/ based on any surviving embedded path markers, and they keep the .zip suffix.

Naming rules

  • For ZIP-based document formats such as .pages or .docx, the script renames the file to the detected document suffix.
  • For modern iWork packages where the subtype is still ambiguous, the script keeps the .zip suffix and files them under Apple-iWork.
  • For each ambiguous Apple-iWork file, the script also creates sibling probe copies with .pages, .numbers, and .key suffixes so Finder or the target application can be used as the final discriminator.
  • A dedicated Damaged-Numbers bucket is used only when damaged ZIPs still contain Index/Tables/ markers. Damaged iWork archives without that evidence stay under Damaged-Apple-iWork.
  • For gzip-wrapped payloads, the script keeps the compression visible in the suffix, for example .pdf.gz, .doc.gz, or .tar.gz.
  • When metadata exposes a likely title or original name, the script uses that basename for classified files.
  • When classification is not confident enough, the script keeps the original basename and moves the file to Unknown/.

Usage

./classify-recovered-archives.zsh --dry-run /path/to/recovered-files
./classify-recovered-archives.zsh --verbose /path/to/recovered-files
./classify-recovered-archives.zsh --overwrite /path/to/recovered-files
./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files

Options:

  • --dry-run prints planned actions without modifying files.
  • --verbose prints detection reasons and confidence levels.
  • --overwrite allows replacing an existing destination file instead of appending -2, -3, and so on.
  • --salvage-damaged runs salvage-damaged-zips.zsh after classification for any damaged ZIP output folders under the scan root and writes the salvage output into Salvaged/.

Output folders

The script creates subdirectories under the scanned directory as needed. Current labels include:

  • Pages
  • Numbers
  • Apple-iWork
  • Damaged-Apple-iWork
  • Damaged-Numbers
  • Damaged-Pages
  • Damaged-Zip
  • Word
  • Excel
  • PowerPoint
  • OpenDocument-Text
  • OpenDocument-Sheet
  • OpenDocument-Presentation
  • EPUB
  • PDF
  • Text
  • HTML
  • XML
  • RichText
  • Image
  • JSON
  • Jar
  • APK
  • Tar
  • Unknown

Notes

  • The script uses only list or stream inspection for supported archive types. It does not fully extract ZIP files.
  • xmllint is optional. When unavailable, the script falls back to lightweight XML parsing.
  • Because recovered files may be damaged, some archives will remain in Unknown/ even if they look close to a supported format.
  • For damaged ZIP files, the classifier falls back to embedded path names found by strings, which is useful for iWork packages but should still be treated as heuristic evidence.
  • Current limitation: healthy modern iWork bundles are not yet reliably split between Keynote and every remaining iWork variant when they lack a Pages or Numbers marker. In practice, those files stay in Apple-iWork/, with probe copies created for manual opening.
  • Re-running the script skips files already placed in one of its managed output folders.

Salvage damaged ZIPs

Run the salvage workflow against a directory of damaged ZIPs to rebuild partial archives with zip -FF, extract whatever payload is still readable, and write a report:

./salvage-damaged-zips.zsh subset/Damaged-Zip

By default this writes into a sibling directory such as subset/Damaged-Zip.salvaged/ with:

  • repaired partial ZIPs under repaired/
  • extracted payloads grouped by likely family under extracted/
  • salvage-report.md and salvage-report.tsv

If you want the main classifier to trigger this automatically, run:

./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files

That writes salvage outputs into family-specific subdirectories under Salvaged/ inside the scan root, for example Salvaged/Damaged-Zip/ or Salvaged/Damaged-Numbers/.

Smoke test

Run the bundled smoke test to generate a temporary fixture set and verify the current detection pipeline:

./tests/smoke-test.zsh

The smoke test builds representative Pages, Numbers, ambiguous iWork, Word, unknown ZIP, PDF-in-gzip, and tar-in-gzip samples, then runs the classifier in dry-run mode.

Recent sample runs

  • subset/ has been processed and includes salvage output for damaged ZIP samples.
  • subset2/ currently resolves to 4 files in Apple-iWork/, 1 file in Pages/, 2 files in Damaged-Apple-iWork/, 1 file in Unknown/, and a salvage report under Salvaged/Damaged-Apple-iWork/.
  • In subset2/, the damaged files were repairable enough to recover visible assets, but neither healthy nor damaged samples exposed Index/Tables/, so none were promoted to Numbers or Damaged-Numbers; the ambiguous healthy iWork files now also get .pages, .numbers, and .key probe copies.