|
|
||
|---|---|---|
| tests | ||
| .gitignore | ||
| classify-recovered-archives.zsh | ||
| README.md | ||
| salvage-damaged-zips.zsh | ||
classify-recovered-archives
classify-recovered-archives.zsh scans a directory for recovered .zip and .gz files, tries to infer the original document type from archive internals, renames the files with a more useful suffix, and moves them into labeled subdirectories.
The current iWork heuristics are intentionally conservative: modern iWork bundles are promoted to .pages when the package exposes a Pages-specific multipage marker and to .numbers when the ZIP layout exposes Index/Tables/. Other modern iWork bundles stay in Apple-iWork/ as .zip, and the script also creates .pages, .numbers, and .key probe copies so they can be opened directly for manual confirmation.
What it detects
- Apple Pages packages from
Metadata/DocumentProperties.plistplusPages/markers. - Modern Apple iWork packages from
Index/Document.iwaandMetadata/Properties.plist, with.pagesassigned when the package carries a multipage marker and.numbersassigned when the ZIP containsIndex/Tables/. - Damaged Apple iWork packages from embedded path markers, including likely Numbers files when
Index/Tables/markers survive. - Microsoft Word, Excel, and PowerPoint OOXML packages from
[Content_Types].xmlplusword/,xl/, orppt/markers. - OpenDocument text, spreadsheet, and presentation packages from the
mimetypeentry. - EPUB packages from the
mimetypeentry. - APK and JAR archives from clear Java and Android markers.
- Gzip-wrapped TAR archives.
- Gzip-wrapped single files such as PDF, text, HTML, XML, JSON, RTF, common image formats, and some legacy Microsoft Office binaries.
Unknown or weakly identified files are moved into Unknown/ and keep their original basename.
Damaged ZIP files with unreadable central directories are moved into Damaged-Zip/, Damaged-Apple-iWork/, Damaged-Pages/, or Damaged-Numbers/ based on any surviving embedded path markers, and they keep the .zip suffix.
Naming rules
- For ZIP-based document formats such as
.pagesor.docx, the script renames the file to the detected document suffix. - For modern iWork packages where the subtype is still ambiguous, the script keeps the
.zipsuffix and files them underApple-iWork. - For each ambiguous
Apple-iWorkfile, the script also creates sibling probe copies with.pages,.numbers, and.keysuffixes so Finder or the target application can be used as the final discriminator. - A dedicated
Damaged-Numbersbucket is used only when damaged ZIPs still containIndex/Tables/markers. Damaged iWork archives without that evidence stay underDamaged-Apple-iWork. - For gzip-wrapped payloads, the script keeps the compression visible in the suffix, for example
.pdf.gz,.doc.gz, or.tar.gz. - When metadata exposes a likely title or original name, the script uses that basename for classified files.
- When classification is not confident enough, the script keeps the original basename and moves the file to
Unknown/.
Usage
./classify-recovered-archives.zsh --dry-run /path/to/recovered-files
./classify-recovered-archives.zsh --verbose /path/to/recovered-files
./classify-recovered-archives.zsh --overwrite /path/to/recovered-files
./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files
Options:
--dry-runprints planned actions without modifying files.--verboseprints detection reasons and confidence levels.--overwriteallows replacing an existing destination file instead of appending-2,-3, and so on.--salvage-damagedrunssalvage-damaged-zips.zshafter classification for any damaged ZIP output folders under the scan root and writes the salvage output intoSalvaged/.
Output folders
The script creates subdirectories under the scanned directory as needed. Current labels include:
PagesNumbersApple-iWorkDamaged-Apple-iWorkDamaged-NumbersDamaged-PagesDamaged-ZipWordExcelPowerPointOpenDocument-TextOpenDocument-SheetOpenDocument-PresentationEPUBPDFTextHTMLXMLRichTextImageJSONJarAPKTarUnknown
Notes
- The script uses only list or stream inspection for supported archive types. It does not fully extract ZIP files.
xmllintis optional. When unavailable, the script falls back to lightweight XML parsing.- Because recovered files may be damaged, some archives will remain in
Unknown/even if they look close to a supported format. - For damaged ZIP files, the classifier falls back to embedded path names found by
strings, which is useful for iWork packages but should still be treated as heuristic evidence. - Current limitation: healthy modern iWork bundles are not yet reliably split between Keynote and every remaining iWork variant when they lack a Pages or Numbers marker. In practice, those files stay in
Apple-iWork/, with probe copies created for manual opening. - Re-running the script skips files already placed in one of its managed output folders.
Salvage damaged ZIPs
Run the salvage workflow against a directory of damaged ZIPs to rebuild partial archives with zip -FF, extract whatever payload is still readable, and write a report:
./salvage-damaged-zips.zsh subset/Damaged-Zip
By default this writes into a sibling directory such as subset/Damaged-Zip.salvaged/ with:
- repaired partial ZIPs under
repaired/ - extracted payloads grouped by likely family under
extracted/ salvage-report.mdandsalvage-report.tsv
If you want the main classifier to trigger this automatically, run:
./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files
That writes salvage outputs into family-specific subdirectories under Salvaged/ inside the scan root, for example Salvaged/Damaged-Zip/ or Salvaged/Damaged-Numbers/.
Smoke test
Run the bundled smoke test to generate a temporary fixture set and verify the current detection pipeline:
./tests/smoke-test.zsh
The smoke test builds representative Pages, Numbers, ambiguous iWork, Word, unknown ZIP, PDF-in-gzip, and tar-in-gzip samples, then runs the classifier in dry-run mode.
Recent sample runs
subset/has been processed and includes salvage output for damaged ZIP samples.subset2/currently resolves to 4 files inApple-iWork/, 1 file inPages/, 2 files inDamaged-Apple-iWork/, 1 file inUnknown/, and a salvage report underSalvaged/Damaged-Apple-iWork/.- In
subset2/, the damaged files were repairable enough to recover visible assets, but neither healthy nor damaged samples exposedIndex/Tables/, so none were promoted toNumbersorDamaged-Numbers; the ambiguous healthy iWork files now also get.pages,.numbers, and.keyprobe copies.