# classify-recovered-archives `classify-recovered-archives.zsh` scans a directory for recovered `.zip` and `.gz` files, tries to infer the original document type from archive internals, renames the files with a more useful suffix, and moves them into labeled subdirectories. The current iWork heuristics are intentionally conservative: modern iWork bundles are promoted to `.pages` when the package exposes a Pages-specific multipage marker and to `.numbers` when the ZIP layout exposes `Index/Tables/`. Other modern iWork bundles stay in `Apple-iWork/` as `.zip`, and the script also creates `.pages`, `.numbers`, and `.key` probe copies so they can be opened directly for manual confirmation. ## What it detects - Apple Pages packages from `Metadata/DocumentProperties.plist` plus `Pages/` markers. - Modern Apple iWork packages from `Index/Document.iwa` and `Metadata/Properties.plist`, with `.pages` assigned when the package carries a multipage marker and `.numbers` assigned when the ZIP contains `Index/Tables/`. - Damaged Apple iWork packages from embedded path markers, including likely Numbers files when `Index/Tables/` markers survive. - Microsoft Word, Excel, and PowerPoint OOXML packages from `[Content_Types].xml` plus `word/`, `xl/`, or `ppt/` markers. - OpenDocument text, spreadsheet, and presentation packages from the `mimetype` entry. - EPUB packages from the `mimetype` entry. - APK and JAR archives from clear Java and Android markers. - Gzip-wrapped TAR archives. - Gzip-wrapped single files such as PDF, text, HTML, XML, JSON, RTF, common image formats, and some legacy Microsoft Office binaries. Unknown or weakly identified files are moved into `Unknown/` and keep their original basename. Damaged ZIP files with unreadable central directories are moved into `Damaged-Zip/`, `Damaged-Apple-iWork/`, `Damaged-Pages/`, or `Damaged-Numbers/` based on any surviving embedded path markers, and they keep the `.zip` suffix. ## Naming rules - For ZIP-based document formats such as `.pages` or `.docx`, the script renames the file to the detected document suffix. - For modern iWork packages where the subtype is still ambiguous, the script keeps the `.zip` suffix and files them under `Apple-iWork`. - For each ambiguous `Apple-iWork` file, the script also creates sibling probe copies with `.pages`, `.numbers`, and `.key` suffixes so Finder or the target application can be used as the final discriminator. - A dedicated `Damaged-Numbers` bucket is used only when damaged ZIPs still contain `Index/Tables/` markers. Damaged iWork archives without that evidence stay under `Damaged-Apple-iWork`. - For gzip-wrapped payloads, the script keeps the compression visible in the suffix, for example `.pdf.gz`, `.doc.gz`, or `.tar.gz`. - When metadata exposes a likely title or original name, the script uses that basename for classified files. - When classification is not confident enough, the script keeps the original basename and moves the file to `Unknown/`. ## Usage ```sh ./classify-recovered-archives.zsh --dry-run /path/to/recovered-files ./classify-recovered-archives.zsh --verbose /path/to/recovered-files ./classify-recovered-archives.zsh --overwrite /path/to/recovered-files ./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files ``` Options: - `--dry-run` prints planned actions without modifying files. - `--verbose` prints detection reasons and confidence levels. - `--overwrite` allows replacing an existing destination file instead of appending `-2`, `-3`, and so on. - `--salvage-damaged` runs `salvage-damaged-zips.zsh` after classification for any damaged ZIP output folders under the scan root and writes the salvage output into `Salvaged/`. ## Output folders The script creates subdirectories under the scanned directory as needed. Current labels include: - `Pages` - `Numbers` - `Apple-iWork` - `Damaged-Apple-iWork` - `Damaged-Numbers` - `Damaged-Pages` - `Damaged-Zip` - `Word` - `Excel` - `PowerPoint` - `OpenDocument-Text` - `OpenDocument-Sheet` - `OpenDocument-Presentation` - `EPUB` - `PDF` - `Text` - `HTML` - `XML` - `RichText` - `Image` - `JSON` - `Jar` - `APK` - `Tar` - `Unknown` ## Notes - The script uses only list or stream inspection for supported archive types. It does not fully extract ZIP files. - `xmllint` is optional. When unavailable, the script falls back to lightweight XML parsing. - Because recovered files may be damaged, some archives will remain in `Unknown/` even if they look close to a supported format. - For damaged ZIP files, the classifier falls back to embedded path names found by `strings`, which is useful for iWork packages but should still be treated as heuristic evidence. - Current limitation: healthy modern iWork bundles are not yet reliably split between Keynote and every remaining iWork variant when they lack a Pages or Numbers marker. In practice, those files stay in `Apple-iWork/`, with probe copies created for manual opening. - Re-running the script skips files already placed in one of its managed output folders. ## Salvage damaged ZIPs Run the salvage workflow against a directory of damaged ZIPs to rebuild partial archives with `zip -FF`, extract whatever payload is still readable, and write a report: ```sh ./salvage-damaged-zips.zsh subset/Damaged-Zip ``` By default this writes into a sibling directory such as `subset/Damaged-Zip.salvaged/` with: - repaired partial ZIPs under `repaired/` - extracted payloads grouped by likely family under `extracted/` - `salvage-report.md` and `salvage-report.tsv` If you want the main classifier to trigger this automatically, run: ```sh ./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files ``` That writes salvage outputs into family-specific subdirectories under `Salvaged/` inside the scan root, for example `Salvaged/Damaged-Zip/` or `Salvaged/Damaged-Numbers/`. ## Smoke test Run the bundled smoke test to generate a temporary fixture set and verify the current detection pipeline: ```sh ./tests/smoke-test.zsh ``` The smoke test builds representative Pages, Numbers, ambiguous iWork, Word, unknown ZIP, PDF-in-gzip, and tar-in-gzip samples, then runs the classifier in dry-run mode. ## Recent sample runs - `subset/` has been processed and includes salvage output for damaged ZIP samples. - `subset2/` currently resolves to 4 files in `Apple-iWork/`, 1 file in `Pages/`, 2 files in `Damaged-Apple-iWork/`, 1 file in `Unknown/`, and a salvage report under `Salvaged/Damaged-Apple-iWork/`. - In `subset2/`, the damaged files were repairable enough to recover visible assets, but neither healthy nor damaged samples exposed `Index/Tables/`, so none were promoted to `Numbers` or `Damaged-Numbers`; the ambiguous healthy iWork files now also get `.pages`, `.numbers`, and `.key` probe copies.