classify-recovered-archives/README.md
Reindl David (IT-PTR-CEN2-SL10) 94bfc77c11 intial state
Co-authored-by: Copilot <copilot@github.com>
2026-05-02 17:06:52 +02:00

124 lines
6.7 KiB
Markdown

# classify-recovered-archives
`classify-recovered-archives.zsh` scans a directory for recovered `.zip` and `.gz` files, tries to infer the original document type from archive internals, renames the files with a more useful suffix, and moves them into labeled subdirectories.
The current iWork heuristics are intentionally conservative: modern iWork bundles are promoted to `.pages` when the package exposes a Pages-specific multipage marker and to `.numbers` when the ZIP layout exposes `Index/Tables/`. Other modern iWork bundles stay in `Apple-iWork/` as `.zip`, and the script also creates `.pages`, `.numbers`, and `.key` probe copies so they can be opened directly for manual confirmation.
## What it detects
- Apple Pages packages from `Metadata/DocumentProperties.plist` plus `Pages/` markers.
- Modern Apple iWork packages from `Index/Document.iwa` and `Metadata/Properties.plist`, with `.pages` assigned when the package carries a multipage marker and `.numbers` assigned when the ZIP contains `Index/Tables/`.
- Damaged Apple iWork packages from embedded path markers, including likely Numbers files when `Index/Tables/` markers survive.
- Microsoft Word, Excel, and PowerPoint OOXML packages from `[Content_Types].xml` plus `word/`, `xl/`, or `ppt/` markers.
- OpenDocument text, spreadsheet, and presentation packages from the `mimetype` entry.
- EPUB packages from the `mimetype` entry.
- APK and JAR archives from clear Java and Android markers.
- Gzip-wrapped TAR archives.
- Gzip-wrapped single files such as PDF, text, HTML, XML, JSON, RTF, common image formats, and some legacy Microsoft Office binaries.
Unknown or weakly identified files are moved into `Unknown/` and keep their original basename.
Damaged ZIP files with unreadable central directories are moved into `Damaged-Zip/`, `Damaged-Apple-iWork/`, `Damaged-Pages/`, or `Damaged-Numbers/` based on any surviving embedded path markers, and they keep the `.zip` suffix.
## Naming rules
- For ZIP-based document formats such as `.pages` or `.docx`, the script renames the file to the detected document suffix.
- For modern iWork packages where the subtype is still ambiguous, the script keeps the `.zip` suffix and files them under `Apple-iWork`.
- For each ambiguous `Apple-iWork` file, the script also creates sibling probe copies with `.pages`, `.numbers`, and `.key` suffixes so Finder or the target application can be used as the final discriminator.
- A dedicated `Damaged-Numbers` bucket is used only when damaged ZIPs still contain `Index/Tables/` markers. Damaged iWork archives without that evidence stay under `Damaged-Apple-iWork`.
- For gzip-wrapped payloads, the script keeps the compression visible in the suffix, for example `.pdf.gz`, `.doc.gz`, or `.tar.gz`.
- When metadata exposes a likely title or original name, the script uses that basename for classified files.
- When classification is not confident enough, the script keeps the original basename and moves the file to `Unknown/`.
## Usage
```sh
./classify-recovered-archives.zsh --dry-run /path/to/recovered-files
./classify-recovered-archives.zsh --verbose /path/to/recovered-files
./classify-recovered-archives.zsh --overwrite /path/to/recovered-files
./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files
```
Options:
- `--dry-run` prints planned actions without modifying files.
- `--verbose` prints detection reasons and confidence levels.
- `--overwrite` allows replacing an existing destination file instead of appending `-2`, `-3`, and so on.
- `--salvage-damaged` runs `salvage-damaged-zips.zsh` after classification for any damaged ZIP output folders under the scan root and writes the salvage output into `Salvaged/`.
## Output folders
The script creates subdirectories under the scanned directory as needed. Current labels include:
- `Pages`
- `Numbers`
- `Apple-iWork`
- `Damaged-Apple-iWork`
- `Damaged-Numbers`
- `Damaged-Pages`
- `Damaged-Zip`
- `Word`
- `Excel`
- `PowerPoint`
- `OpenDocument-Text`
- `OpenDocument-Sheet`
- `OpenDocument-Presentation`
- `EPUB`
- `PDF`
- `Text`
- `HTML`
- `XML`
- `RichText`
- `Image`
- `JSON`
- `Jar`
- `APK`
- `Tar`
- `Unknown`
## Notes
- The script uses only list or stream inspection for supported archive types. It does not fully extract ZIP files.
- `xmllint` is optional. When unavailable, the script falls back to lightweight XML parsing.
- Because recovered files may be damaged, some archives will remain in `Unknown/` even if they look close to a supported format.
- For damaged ZIP files, the classifier falls back to embedded path names found by `strings`, which is useful for iWork packages but should still be treated as heuristic evidence.
- Current limitation: healthy modern iWork bundles are not yet reliably split between Keynote and every remaining iWork variant when they lack a Pages or Numbers marker. In practice, those files stay in `Apple-iWork/`, with probe copies created for manual opening.
- Re-running the script skips files already placed in one of its managed output folders.
## Salvage damaged ZIPs
Run the salvage workflow against a directory of damaged ZIPs to rebuild partial archives with `zip -FF`, extract whatever payload is still readable, and write a report:
```sh
./salvage-damaged-zips.zsh subset/Damaged-Zip
```
By default this writes into a sibling directory such as `subset/Damaged-Zip.salvaged/` with:
- repaired partial ZIPs under `repaired/`
- extracted payloads grouped by likely family under `extracted/`
- `salvage-report.md` and `salvage-report.tsv`
If you want the main classifier to trigger this automatically, run:
```sh
./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files
```
That writes salvage outputs into family-specific subdirectories under `Salvaged/` inside the scan root, for example `Salvaged/Damaged-Zip/` or `Salvaged/Damaged-Numbers/`.
## Smoke test
Run the bundled smoke test to generate a temporary fixture set and verify the current detection pipeline:
```sh
./tests/smoke-test.zsh
```
The smoke test builds representative Pages, Numbers, ambiguous iWork, Word, unknown ZIP, PDF-in-gzip, and tar-in-gzip samples, then runs the classifier in dry-run mode.
## Recent sample runs
- `subset/` has been processed and includes salvage output for damaged ZIP samples.
- `subset2/` currently resolves to 4 files in `Apple-iWork/`, 1 file in `Pages/`, 2 files in `Damaged-Apple-iWork/`, 1 file in `Unknown/`, and a salvage report under `Salvaged/Damaged-Apple-iWork/`.
- In `subset2/`, the damaged files were repairable enough to recover visible assets, but neither healthy nor damaged samples exposed `Index/Tables/`, so none were promoted to `Numbers` or `Damaged-Numbers`; the ambiguous healthy iWork files now also get `.pages`, `.numbers`, and `.key` probe copies.