124 lines
6.7 KiB
Markdown
124 lines
6.7 KiB
Markdown
# classify-recovered-archives
|
|
|
|
`classify-recovered-archives.zsh` scans a directory for recovered `.zip` and `.gz` files, tries to infer the original document type from archive internals, renames the files with a more useful suffix, and moves them into labeled subdirectories.
|
|
|
|
The current iWork heuristics are intentionally conservative: modern iWork bundles are promoted to `.pages` when the package exposes a Pages-specific multipage marker and to `.numbers` when the ZIP layout exposes `Index/Tables/`. Other modern iWork bundles stay in `Apple-iWork/` as `.zip`, and the script also creates `.pages`, `.numbers`, and `.key` probe copies so they can be opened directly for manual confirmation.
|
|
|
|
## What it detects
|
|
|
|
- Apple Pages packages from `Metadata/DocumentProperties.plist` plus `Pages/` markers.
|
|
- Modern Apple iWork packages from `Index/Document.iwa` and `Metadata/Properties.plist`, with `.pages` assigned when the package carries a multipage marker and `.numbers` assigned when the ZIP contains `Index/Tables/`.
|
|
- Damaged Apple iWork packages from embedded path markers, including likely Numbers files when `Index/Tables/` markers survive.
|
|
- Microsoft Word, Excel, and PowerPoint OOXML packages from `[Content_Types].xml` plus `word/`, `xl/`, or `ppt/` markers.
|
|
- OpenDocument text, spreadsheet, and presentation packages from the `mimetype` entry.
|
|
- EPUB packages from the `mimetype` entry.
|
|
- APK and JAR archives from clear Java and Android markers.
|
|
- Gzip-wrapped TAR archives.
|
|
- Gzip-wrapped single files such as PDF, text, HTML, XML, JSON, RTF, common image formats, and some legacy Microsoft Office binaries.
|
|
|
|
Unknown or weakly identified files are moved into `Unknown/` and keep their original basename.
|
|
Damaged ZIP files with unreadable central directories are moved into `Damaged-Zip/`, `Damaged-Apple-iWork/`, `Damaged-Pages/`, or `Damaged-Numbers/` based on any surviving embedded path markers, and they keep the `.zip` suffix.
|
|
|
|
## Naming rules
|
|
|
|
- For ZIP-based document formats such as `.pages` or `.docx`, the script renames the file to the detected document suffix.
|
|
- For modern iWork packages where the subtype is still ambiguous, the script keeps the `.zip` suffix and files them under `Apple-iWork`.
|
|
- For each ambiguous `Apple-iWork` file, the script also creates sibling probe copies with `.pages`, `.numbers`, and `.key` suffixes so Finder or the target application can be used as the final discriminator.
|
|
- A dedicated `Damaged-Numbers` bucket is used only when damaged ZIPs still contain `Index/Tables/` markers. Damaged iWork archives without that evidence stay under `Damaged-Apple-iWork`.
|
|
- For gzip-wrapped payloads, the script keeps the compression visible in the suffix, for example `.pdf.gz`, `.doc.gz`, or `.tar.gz`.
|
|
- When metadata exposes a likely title or original name, the script uses that basename for classified files.
|
|
- When classification is not confident enough, the script keeps the original basename and moves the file to `Unknown/`.
|
|
|
|
## Usage
|
|
|
|
```sh
|
|
./classify-recovered-archives.zsh --dry-run /path/to/recovered-files
|
|
./classify-recovered-archives.zsh --verbose /path/to/recovered-files
|
|
./classify-recovered-archives.zsh --overwrite /path/to/recovered-files
|
|
./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files
|
|
```
|
|
|
|
Options:
|
|
|
|
- `--dry-run` prints planned actions without modifying files.
|
|
- `--verbose` prints detection reasons and confidence levels.
|
|
- `--overwrite` allows replacing an existing destination file instead of appending `-2`, `-3`, and so on.
|
|
- `--salvage-damaged` runs `salvage-damaged-zips.zsh` after classification for any damaged ZIP output folders under the scan root and writes the salvage output into `Salvaged/`.
|
|
|
|
## Output folders
|
|
|
|
The script creates subdirectories under the scanned directory as needed. Current labels include:
|
|
|
|
- `Pages`
|
|
- `Numbers`
|
|
- `Apple-iWork`
|
|
- `Damaged-Apple-iWork`
|
|
- `Damaged-Numbers`
|
|
- `Damaged-Pages`
|
|
- `Damaged-Zip`
|
|
- `Word`
|
|
- `Excel`
|
|
- `PowerPoint`
|
|
- `OpenDocument-Text`
|
|
- `OpenDocument-Sheet`
|
|
- `OpenDocument-Presentation`
|
|
- `EPUB`
|
|
- `PDF`
|
|
- `Text`
|
|
- `HTML`
|
|
- `XML`
|
|
- `RichText`
|
|
- `Image`
|
|
- `JSON`
|
|
- `Jar`
|
|
- `APK`
|
|
- `Tar`
|
|
- `Unknown`
|
|
|
|
## Notes
|
|
|
|
- The script uses only list or stream inspection for supported archive types. It does not fully extract ZIP files.
|
|
- `xmllint` is optional. When unavailable, the script falls back to lightweight XML parsing.
|
|
- Because recovered files may be damaged, some archives will remain in `Unknown/` even if they look close to a supported format.
|
|
- For damaged ZIP files, the classifier falls back to embedded path names found by `strings`, which is useful for iWork packages but should still be treated as heuristic evidence.
|
|
- Current limitation: healthy modern iWork bundles are not yet reliably split between Keynote and every remaining iWork variant when they lack a Pages or Numbers marker. In practice, those files stay in `Apple-iWork/`, with probe copies created for manual opening.
|
|
- Re-running the script skips files already placed in one of its managed output folders.
|
|
|
|
## Salvage damaged ZIPs
|
|
|
|
Run the salvage workflow against a directory of damaged ZIPs to rebuild partial archives with `zip -FF`, extract whatever payload is still readable, and write a report:
|
|
|
|
```sh
|
|
./salvage-damaged-zips.zsh subset/Damaged-Zip
|
|
```
|
|
|
|
By default this writes into a sibling directory such as `subset/Damaged-Zip.salvaged/` with:
|
|
|
|
- repaired partial ZIPs under `repaired/`
|
|
- extracted payloads grouped by likely family under `extracted/`
|
|
- `salvage-report.md` and `salvage-report.tsv`
|
|
|
|
If you want the main classifier to trigger this automatically, run:
|
|
|
|
```sh
|
|
./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files
|
|
```
|
|
|
|
That writes salvage outputs into family-specific subdirectories under `Salvaged/` inside the scan root, for example `Salvaged/Damaged-Zip/` or `Salvaged/Damaged-Numbers/`.
|
|
|
|
## Smoke test
|
|
|
|
Run the bundled smoke test to generate a temporary fixture set and verify the current detection pipeline:
|
|
|
|
```sh
|
|
./tests/smoke-test.zsh
|
|
```
|
|
|
|
The smoke test builds representative Pages, Numbers, ambiguous iWork, Word, unknown ZIP, PDF-in-gzip, and tar-in-gzip samples, then runs the classifier in dry-run mode.
|
|
|
|
## Recent sample runs
|
|
|
|
- `subset/` has been processed and includes salvage output for damaged ZIP samples.
|
|
- `subset2/` currently resolves to 4 files in `Apple-iWork/`, 1 file in `Pages/`, 2 files in `Damaged-Apple-iWork/`, 1 file in `Unknown/`, and a salvage report under `Salvaged/Damaged-Apple-iWork/`.
|
|
- In `subset2/`, the damaged files were repairable enough to recover visible assets, but neither healthy nor damaged samples exposed `Index/Tables/`, so none were promoted to `Numbers` or `Damaged-Numbers`; the ambiguous healthy iWork files now also get `.pages`, `.numbers`, and `.key` probe copies.
|