commit 94bfc77c113471577eb3f7782cc55a332a82e4f8 Author: Reindl David (IT-PTR-CEN2-SL10) Date: Sat May 2 17:06:52 2026 +0200 intial state Co-authored-by: Copilot diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..860f418 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +.DS_Store +.vscode/ \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..2c21da2 --- /dev/null +++ b/README.md @@ -0,0 +1,123 @@ +# classify-recovered-archives + +`classify-recovered-archives.zsh` scans a directory for recovered `.zip` and `.gz` files, tries to infer the original document type from archive internals, renames the files with a more useful suffix, and moves them into labeled subdirectories. + +The current iWork heuristics are intentionally conservative: modern iWork bundles are promoted to `.pages` when the package exposes a Pages-specific multipage marker and to `.numbers` when the ZIP layout exposes `Index/Tables/`. Other modern iWork bundles stay in `Apple-iWork/` as `.zip`, and the script also creates `.pages`, `.numbers`, and `.key` probe copies so they can be opened directly for manual confirmation. + +## What it detects + +- Apple Pages packages from `Metadata/DocumentProperties.plist` plus `Pages/` markers. +- Modern Apple iWork packages from `Index/Document.iwa` and `Metadata/Properties.plist`, with `.pages` assigned when the package carries a multipage marker and `.numbers` assigned when the ZIP contains `Index/Tables/`. +- Damaged Apple iWork packages from embedded path markers, including likely Numbers files when `Index/Tables/` markers survive. +- Microsoft Word, Excel, and PowerPoint OOXML packages from `[Content_Types].xml` plus `word/`, `xl/`, or `ppt/` markers. +- OpenDocument text, spreadsheet, and presentation packages from the `mimetype` entry. +- EPUB packages from the `mimetype` entry. +- APK and JAR archives from clear Java and Android markers. +- Gzip-wrapped TAR archives. +- Gzip-wrapped single files such as PDF, text, HTML, XML, JSON, RTF, common image formats, and some legacy Microsoft Office binaries. + +Unknown or weakly identified files are moved into `Unknown/` and keep their original basename. +Damaged ZIP files with unreadable central directories are moved into `Damaged-Zip/`, `Damaged-Apple-iWork/`, `Damaged-Pages/`, or `Damaged-Numbers/` based on any surviving embedded path markers, and they keep the `.zip` suffix. + +## Naming rules + +- For ZIP-based document formats such as `.pages` or `.docx`, the script renames the file to the detected document suffix. +- For modern iWork packages where the subtype is still ambiguous, the script keeps the `.zip` suffix and files them under `Apple-iWork`. +- For each ambiguous `Apple-iWork` file, the script also creates sibling probe copies with `.pages`, `.numbers`, and `.key` suffixes so Finder or the target application can be used as the final discriminator. +- A dedicated `Damaged-Numbers` bucket is used only when damaged ZIPs still contain `Index/Tables/` markers. Damaged iWork archives without that evidence stay under `Damaged-Apple-iWork`. +- For gzip-wrapped payloads, the script keeps the compression visible in the suffix, for example `.pdf.gz`, `.doc.gz`, or `.tar.gz`. +- When metadata exposes a likely title or original name, the script uses that basename for classified files. +- When classification is not confident enough, the script keeps the original basename and moves the file to `Unknown/`. + +## Usage + +```sh +./classify-recovered-archives.zsh --dry-run /path/to/recovered-files +./classify-recovered-archives.zsh --verbose /path/to/recovered-files +./classify-recovered-archives.zsh --overwrite /path/to/recovered-files +./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files +``` + +Options: + +- `--dry-run` prints planned actions without modifying files. +- `--verbose` prints detection reasons and confidence levels. +- `--overwrite` allows replacing an existing destination file instead of appending `-2`, `-3`, and so on. +- `--salvage-damaged` runs `salvage-damaged-zips.zsh` after classification for any damaged ZIP output folders under the scan root and writes the salvage output into `Salvaged/`. + +## Output folders + +The script creates subdirectories under the scanned directory as needed. Current labels include: + +- `Pages` +- `Numbers` +- `Apple-iWork` +- `Damaged-Apple-iWork` +- `Damaged-Numbers` +- `Damaged-Pages` +- `Damaged-Zip` +- `Word` +- `Excel` +- `PowerPoint` +- `OpenDocument-Text` +- `OpenDocument-Sheet` +- `OpenDocument-Presentation` +- `EPUB` +- `PDF` +- `Text` +- `HTML` +- `XML` +- `RichText` +- `Image` +- `JSON` +- `Jar` +- `APK` +- `Tar` +- `Unknown` + +## Notes + +- The script uses only list or stream inspection for supported archive types. It does not fully extract ZIP files. +- `xmllint` is optional. When unavailable, the script falls back to lightweight XML parsing. +- Because recovered files may be damaged, some archives will remain in `Unknown/` even if they look close to a supported format. +- For damaged ZIP files, the classifier falls back to embedded path names found by `strings`, which is useful for iWork packages but should still be treated as heuristic evidence. +- Current limitation: healthy modern iWork bundles are not yet reliably split between Keynote and every remaining iWork variant when they lack a Pages or Numbers marker. In practice, those files stay in `Apple-iWork/`, with probe copies created for manual opening. +- Re-running the script skips files already placed in one of its managed output folders. + +## Salvage damaged ZIPs + +Run the salvage workflow against a directory of damaged ZIPs to rebuild partial archives with `zip -FF`, extract whatever payload is still readable, and write a report: + +```sh +./salvage-damaged-zips.zsh subset/Damaged-Zip +``` + +By default this writes into a sibling directory such as `subset/Damaged-Zip.salvaged/` with: + +- repaired partial ZIPs under `repaired/` +- extracted payloads grouped by likely family under `extracted/` +- `salvage-report.md` and `salvage-report.tsv` + +If you want the main classifier to trigger this automatically, run: + +```sh +./classify-recovered-archives.zsh --salvage-damaged /path/to/recovered-files +``` + +That writes salvage outputs into family-specific subdirectories under `Salvaged/` inside the scan root, for example `Salvaged/Damaged-Zip/` or `Salvaged/Damaged-Numbers/`. + +## Smoke test + +Run the bundled smoke test to generate a temporary fixture set and verify the current detection pipeline: + +```sh +./tests/smoke-test.zsh +``` + +The smoke test builds representative Pages, Numbers, ambiguous iWork, Word, unknown ZIP, PDF-in-gzip, and tar-in-gzip samples, then runs the classifier in dry-run mode. + +## Recent sample runs + +- `subset/` has been processed and includes salvage output for damaged ZIP samples. +- `subset2/` currently resolves to 4 files in `Apple-iWork/`, 1 file in `Pages/`, 2 files in `Damaged-Apple-iWork/`, 1 file in `Unknown/`, and a salvage report under `Salvaged/Damaged-Apple-iWork/`. +- In `subset2/`, the damaged files were repairable enough to recover visible assets, but neither healthy nor damaged samples exposed `Index/Tables/`, so none were promoted to `Numbers` or `Damaged-Numbers`; the ambiguous healthy iWork files now also get `.pages`, `.numbers`, and `.key` probe copies. diff --git a/classify-recovered-archives.zsh b/classify-recovered-archives.zsh new file mode 100755 index 0000000..c026bcb --- /dev/null +++ b/classify-recovered-archives.zsh @@ -0,0 +1,885 @@ +#!/bin/zsh + +emulate -L zsh +setopt extended_glob no_nomatch no_unset pipefail + +typeset -gr SCRIPT_NAME=${0:t} +typeset -g SCAN_ROOT="" +typeset -g DRY_RUN=0 +typeset -g VERBOSE=0 +typeset -g OVERWRITE=0 +typeset -g SALVAGE_DAMAGED=0 +typeset -g SALVAGE_SCRIPT_PATH="${0:A:h}/salvage-damaged-zips.zsh" +typeset -g TMP_ROOT="" + +typeset -ga MANAGED_DIRS=( + Apple-iWork + Damaged-Apple-iWork + Damaged-Numbers + Damaged-Pages + Damaged-Zip + Numbers + Pages + Word + Excel + PowerPoint + OpenDocument-Text + OpenDocument-Sheet + OpenDocument-Presentation + EPUB + PDF + Text + HTML + XML + RichText + Image + JSON + Jar + APK + Tar + Unknown +) + +typeset -gi PROCESSED_COUNT=0 +typeset -gi CLASSIFIED_COUNT=0 +typeset -gi UNKNOWN_COUNT=0 +typeset -gi RENAMED_COUNT=0 +typeset -gi SKIPPED_COUNT=0 +typeset -gi FAILED_COUNT=0 +typeset -g ACTION_LABEL="Renamed" + +typeset -g DETECTED_GROUP="Unknown" +typeset -g DETECTED_SUFFIX="" +typeset -g DETECTED_BASENAME="" +typeset -g DETECTED_CONFIDENCE="low" +typeset -g DETECTED_REASON="" +typeset -ga IWORK_AMBIGUOUS_SUFFIXES=(pages numbers key) + +usage() { + cat <<'EOF' +Usage: classify-recovered-archives.zsh [options] DIRECTORY + +Scan DIRECTORY for recovered .zip and .gz files, infer the original document type, +rename them with the proper suffix, and move them into labeled subdirectories. + +Options: + -n, --dry-run Print planned actions without modifying files. + -v, --verbose Print extra diagnostics while scanning. + --overwrite Allow overwriting an existing destination file. + --salvage-damaged + After classification, run salvage-damaged-zips.zsh on any + damaged ZIP output folders under the scan root. + -h, --help Show this help text. +EOF +} + +log() { + print -r -- "$*" +} + +verbose() { + if (( VERBOSE )); then + print -r -- "$*" + fi +} + +warn() { + print -u2 -r -- "warning: $*" +} + +die() { + print -u2 -r -- "error: $*" + exit 1 +} + +cleanup() { + if [[ -n ${TMP_ROOT:-} && -d ${TMP_ROOT:-} ]]; then + rm -rf -- "$TMP_ROOT" + fi +} + +trap cleanup EXIT INT TERM + +ensure_tools() { + local tool + for tool in file unzip gzip tar plutil perl mktemp find strings; do + command -v "$tool" >/dev/null 2>&1 || die "required tool not found: $tool" + done + + if ! command -v xmllint >/dev/null 2>&1; then + verbose "xmllint not found; falling back to lightweight XML parsing" + fi +} + +make_temp_root() { + TMP_ROOT=$(mktemp -d "${TMPDIR:-/tmp}/classify-recovered-archives.XXXXXX") || die "failed to create temp directory" +} + +reset_detection() { + DETECTED_GROUP="Unknown" + DETECTED_SUFFIX="" + DETECTED_BASENAME="" + DETECTED_CONFIDENCE="low" + DETECTED_REASON="no strong signature found" +} + +set_detection() { + DETECTED_GROUP=$1 + DETECTED_SUFFIX=$2 + DETECTED_CONFIDENCE=$3 + DETECTED_REASON=$4 + DETECTED_BASENAME=${5:-} +} + +trim_value() { + local value="$1" + value="${value//$'\r'/ }" + value="${value//$'\n'/ }" + value="${value//$'\t'/ }" + value=${value##[[:space:]]##} + value=${value%%[[:space:]]##} + print -r -- "$value" +} + +sanitize_name() { + local value="$1" + value=$(trim_value "$value") + value=${value//$'\0'/} + value=${value//\//-} + value=${value//:/-} + value=${value//\\/-} + value=$(print -r -- "$value" | tr -s ' ') + value=${value##.##} + value=${value%%[[:space:]]##} + value=${value##[[:space:]]##} + if [[ -z "$value" ]]; then + value="Untitled" + fi + print -r -- "$value" +} + +xml_extract_title() { + local xml_file="$1" + + perl -0ne ' + my $text = $_; + if ($text =~ m{<(?:[[:alnum:]_]+:)?title\b[^>]*>(.*?)}is) { + my $value = $1; + $value =~ s/<[^>]+>//g; + $value =~ s/&/&/g; + $value =~ s/<//g; + $value =~ s/"/"/g; + $value =~ s/'/'"'"'/g; + $value =~ s/[\r\n\t]+/ /g; + $value =~ s/^\s+|\s+$//g; + print $value if length $value; + } + ' -- "$xml_file" +} + +plist_extract_title() { + local plist_file="$1" + local xml_file="$TMP_ROOT/${RANDOM}-plist.xml" + + if ! plutil -convert xml1 -o "$xml_file" "$plist_file" >/dev/null 2>&1; then + rm -f -- "$xml_file" + return 0 + fi + + perl -0ne ' + my $xml = $_; + my %pairs; + while ($xml =~ m{([^<]+)\s*(?:(.*?)|(.*?))}sg) { + my ($key, $string, $date) = ($1, $2, $3); + my $value = defined $string ? $string : $date; + next unless defined $value; + $value =~ s/&/&/g; + $value =~ s/<//g; + $value =~ s/"/"/g; + $value =~ s/'/'"'"'/g; + $value =~ s/[\r\n\t]+/ /g; + $value =~ s/^\s+|\s+$//g; + push @{ $pairs{$key} }, $value if length $value; + } + + for my $preferred (qw(kMDItemTitle DocumentTitle documentTitle Title title kMDItemDisplayName displayName Name name)) { + if (exists $pairs{$preferred} && @{ $pairs{$preferred} }) { + print $pairs{$preferred}[0]; + exit 0; + } + } + + for my $key (sort keys %pairs) { + next unless $key =~ /(title|name)/i; + if (@{ $pairs{$key} }) { + print $pairs{$key}[0]; + exit 0; + } + } + ' -- "$xml_file" + + rm -f -- "$xml_file" +} + +plist_extract_value() { + local plist_file="$1" + local key_name="$2" + local xml_file="$TMP_ROOT/${RANDOM}-plist-value.xml" + + if ! plutil -convert xml1 -o "$xml_file" "$plist_file" >/dev/null 2>&1; then + rm -f -- "$xml_file" + return 0 + fi + + TARGET_PLIST_KEY="$key_name" perl -0ne ' + my $target_key = $ENV{TARGET_PLIST_KEY}; + my $xml = $_; + if ($xml =~ m{\Q$target_key\E\s*<(string|date|true|false)>(.*?)|\Q$target_key\E\s*<(true|false)\s*/>}sg) { + my $tag = defined $1 ? $1 : $3; + my $value = defined $2 ? $2 : $tag; + $value = $tag if $tag eq q{true} || $tag eq q{false}; + $value =~ s/&/&/g; + $value =~ s/<//g; + $value =~ s/"/"/g; + $value =~ s/'/'"'"'/g; + $value =~ s/[\r\n\t]+/ /g; + $value =~ s/^\s+|\s+$//g; + print $value; + } + ' -- "$xml_file" + + rm -f -- "$xml_file" +} + +extract_zip_entry_to_temp() { + local archive="$1" + local entry_name="$2" + local destination="$TMP_ROOT/${RANDOM}-${entry_name:t}" + + if unzip -p "$archive" "$entry_name" > "$destination" 2>/dev/null; then + print -r -- "$destination" + return 0 + fi + + rm -f -- "$destination" + return 1 +} + +zip_listing() { + unzip -Z1 "$1" 2>/dev/null +} + +zip_has_entry() { + local listing="$1" + local pattern="$2" + print -r -- "$listing" | grep -E -q -- "$pattern" +} + +zip_string_markers() { + strings -a "$1" 2>/dev/null +} + +text_has_marker() { + local text="$1" + local pattern="$2" + print -r -- "$text" | grep -E -q -- "$pattern" +} + +archive_has_binary_marker() { + local archive="$1" + local pattern="$2" + LC_ALL=C grep -aE -q -- "$pattern" "$archive" +} + +extract_zip_title() { + local archive="$1" + local entry_name="$2" + local extracted="" + local title="" + + extracted=$(extract_zip_entry_to_temp "$archive" "$entry_name") || return 0 + + case "$entry_name" in + *.plist) + title=$(plist_extract_title "$extracted") + ;; + *.xml) + title=$(xml_extract_title "$extracted") + ;; + esac + + rm -f -- "$extracted" + print -r -- "$title" +} + +map_odf_mimetype() { + case "$1" in + application/vnd.oasis.opendocument.text) + print -r -- "OpenDocument-Text|odt" + ;; + application/vnd.oasis.opendocument.spreadsheet) + print -r -- "OpenDocument-Sheet|ods" + ;; + application/vnd.oasis.opendocument.presentation) + print -r -- "OpenDocument-Presentation|odp" + ;; + application/epub+zip) + print -r -- "EPUB|epub" + ;; + *) + return 1 + ;; + esac +} + +classify_zip() { + local archive="$1" + local listing="" + local title="" + local mime_file="" + local mime_value="" + local mapped="" + local iwork_properties="" + local is_multi_page="" + + listing=$(zip_listing "$archive") || { + if archive_has_binary_marker "$archive" 'Metadata/DocumentProperties\.plist|Pages/'; then + set_detection "Damaged-Pages" "zip" "medium" "damaged ZIP contains Apple Pages package markers" + elif archive_has_binary_marker "$archive" 'Index/Tables/'; then + set_detection "Damaged-Numbers" "zip" "medium" "damaged ZIP contains Apple Numbers table markers" + elif archive_has_binary_marker "$archive" 'Index/Document\.iwa' && archive_has_binary_marker "$archive" 'Index/CalculationEngine'; then + set_detection "Damaged-Apple-iWork" "zip" "medium" "damaged ZIP contains Apple iWork internal markers" + else + set_detection "Damaged-Zip" "zip" "low" "failed to read ZIP central directory" + fi + return 0 + } + + if zip_has_entry "$listing" '^Metadata/DocumentProperties\.plist$' && zip_has_entry "$listing" '^Pages/'; then + title=$(extract_zip_title "$archive" 'Metadata/DocumentProperties.plist') + set_detection "Pages" "pages" "high" "Apple Pages package markers found" "$title" + return 0 + fi + + if zip_has_entry "$listing" '^Index/Document\.iwa$' && zip_has_entry "$listing" '^Metadata/Properties\.plist$'; then + iwork_properties=$(extract_zip_entry_to_temp "$archive" 'Metadata/Properties.plist') || iwork_properties="" + if [[ -n "$iwork_properties" ]]; then + is_multi_page=$(plist_extract_value "$iwork_properties" 'isMultiPage') + title=$(plist_extract_title "$iwork_properties") + rm -f -- "$iwork_properties" + fi + + if zip_has_entry "$listing" '^Index/Tables/'; then + set_detection "Numbers" "numbers" "high" "modern iWork package contains Numbers table markers" "$title" + elif [[ "$is_multi_page" == true ]]; then + set_detection "Pages" "pages" "medium" "modern iWork package with multipage marker" "$title" + else + set_detection "Apple-iWork" "zip" "medium" "modern iWork package detected but subtype is ambiguous" "$title" + fi + return 0 + fi + + if zip_has_entry "$listing" '^\[Content_Types\]\.xml$' && zip_has_entry "$listing" '^word/'; then + title=$(extract_zip_title "$archive" 'docProps/core.xml') + set_detection "Word" "docx" "high" "WordprocessingML markers found" "$title" + return 0 + fi + + if zip_has_entry "$listing" '^\[Content_Types\]\.xml$' && zip_has_entry "$listing" '^xl/'; then + title=$(extract_zip_title "$archive" 'docProps/core.xml') + set_detection "Excel" "xlsx" "high" "SpreadsheetML markers found" "$title" + return 0 + fi + + if zip_has_entry "$listing" '^\[Content_Types\]\.xml$' && zip_has_entry "$listing" '^ppt/'; then + title=$(extract_zip_title "$archive" 'docProps/core.xml') + set_detection "PowerPoint" "pptx" "high" "PresentationML markers found" "$title" + return 0 + fi + + if zip_has_entry "$listing" '^mimetype$'; then + mime_file=$(extract_zip_entry_to_temp "$archive" 'mimetype') || mime_file="" + if [[ -n "$mime_file" ]]; then + mime_value=$(trim_value "$(head -c 255 -- "$mime_file" 2>/dev/null)") + rm -f -- "$mime_file" + + mapped=$(map_odf_mimetype "$mime_value") || mapped="" + if [[ -n "$mapped" ]]; then + local detected_group=${mapped%%|*} + local detected_suffix=${mapped##*|} + if [[ "$detected_suffix" == odt || "$detected_suffix" == ods || "$detected_suffix" == odp ]]; then + title=$(extract_zip_title "$archive" 'meta.xml') + fi + set_detection "$detected_group" "$detected_suffix" "high" "mimetype entry identified package type" "$title" + return 0 + fi + fi + fi + + if zip_has_entry "$listing" '^AndroidManifest\.xml$' && zip_has_entry "$listing" '^classes\.dex$'; then + set_detection "APK" "apk" "high" "Android APK markers found" + return 0 + fi + + if zip_has_entry "$listing" '^META-INF/MANIFEST\.MF$'; then + set_detection "Jar" "jar" "medium" "Java archive manifest found" + return 0 + fi + + set_detection "Unknown" "zip" "low" "ZIP archive lacks a strong application signature" +} + +gzip_original_name() { + perl -e ' + use strict; + use warnings; + + my $file = shift @ARGV; + open my $fh, q{<:raw}, $file or exit 0; + read($fh, my $header, 10) == 10 or exit 0; + my @bytes = unpack(q{C10}, $header); + exit 0 unless $bytes[0] == 0x1f && $bytes[1] == 0x8b; + + my $flags = $bytes[3]; + + if ($flags & 0x04) { + read($fh, my $xlen_raw, 2) == 2 or exit 0; + my $xlen = unpack(q{v}, $xlen_raw); + read($fh, my $discard, $xlen) == $xlen or exit 0; + } + + if ($flags & 0x08) { + my $name = q{}; + while (read($fh, my $char, 1) == 1) { + last if $char eq "\0"; + $name .= $char; + } + print $name if length $name; + } + ' -- "$1" +} + +derive_basename_from_hint() { + local hint="$1" + local suffix="$2" + local base="${hint:t}" + local inner_suffix="$suffix" + + base=${base%.gz} + if [[ "$inner_suffix" == *.gz ]]; then + inner_suffix=${inner_suffix%.gz} + fi + if [[ -n "$inner_suffix" ]]; then + base=${base%.${inner_suffix}} + else + base=${base%.*} + fi + + print -r -- "$base" +} + +classify_payload_by_file_info() { + local payload_file="$1" + local description="$2" + local mime_type="$3" + + case "$mime_type" in + application/pdf) + set_detection "PDF" "pdf.gz" "high" "gzip payload detected as PDF" + return 0 + ;; + text/plain) + set_detection "Text" "txt.gz" "medium" "gzip payload detected as plain text" + return 0 + ;; + text/html) + set_detection "HTML" "html.gz" "medium" "gzip payload detected as HTML" + return 0 + ;; + application/xml|text/xml) + set_detection "XML" "xml.gz" "medium" "gzip payload detected as XML" + return 0 + ;; + application/json|text/json) + set_detection "JSON" "json.gz" "medium" "gzip payload detected as JSON" + return 0 + ;; + application/rtf) + set_detection "RichText" "rtf.gz" "medium" "gzip payload detected as RTF" + return 0 + ;; + image/png) + set_detection "Image" "png.gz" "high" "gzip payload detected as PNG" + return 0 + ;; + image/jpeg) + set_detection "Image" "jpg.gz" "high" "gzip payload detected as JPEG" + return 0 + ;; + image/tiff) + set_detection "Image" "tiff.gz" "high" "gzip payload detected as TIFF" + return 0 + ;; + image/gif) + set_detection "Image" "gif.gz" "high" "gzip payload detected as GIF" + return 0 + ;; + application/zip) + classify_zip "$payload_file" + if [[ "$DETECTED_GROUP" != "Unknown" ]]; then + DETECTED_SUFFIX="${DETECTED_SUFFIX}.gz" + DETECTED_REASON="gzip payload wraps a recognized ${DETECTED_GROUP} package" + else + set_detection "Unknown" "gz" "low" "gzip payload is ZIP data without a strong application signature" + fi + return 0 + ;; + esac + + if [[ "$description" == *"Microsoft Word"* ]]; then + set_detection "Word" "doc.gz" "medium" "gzip payload looks like a legacy Word document" + return 0 + fi + + if [[ "$description" == *"Microsoft Excel"* ]]; then + set_detection "Excel" "xls.gz" "medium" "gzip payload looks like a legacy Excel document" + return 0 + fi + + if [[ "$description" == *"Microsoft PowerPoint"* ]]; then + set_detection "PowerPoint" "ppt.gz" "medium" "gzip payload looks like a legacy PowerPoint document" + return 0 + fi + + set_detection "Unknown" "gz" "low" "gzip payload type is not recognized" +} + +classify_gz() { + local archive="$1" + local header_name="" + local payload_file="$TMP_ROOT/${RANDOM}-payload" + local mime_type="" + local description="" + + header_name=$(gzip_original_name "$archive") + + if tar -tzf "$archive" >/dev/null 2>&1; then + set_detection "Tar" "tar.gz" "high" "gzip payload is a TAR archive" + if [[ -n "$header_name" ]]; then + DETECTED_BASENAME=$(derive_basename_from_hint "$header_name" "tar.gz") + fi + return 0 + fi + + if ! gzip -cd -- "$archive" > "$payload_file" 2>/dev/null; then + set_detection "Unknown" "gz" "low" "failed to decompress gzip payload" + return 0 + fi + + mime_type=$(file -b --mime-type "$payload_file" 2>/dev/null) + description=$(file -b "$payload_file" 2>/dev/null) + classify_payload_by_file_info "$payload_file" "$description" "$mime_type" + + if [[ -n "$header_name" && "$DETECTED_GROUP" != "Unknown" ]]; then + DETECTED_BASENAME=$(derive_basename_from_hint "$header_name" "$DETECTED_SUFFIX") + fi + + rm -f -- "$payload_file" +} + +is_managed_output_path() { + local path="$1" + local relative="${path#$SCAN_ROOT/}" + local managed_dir + + if [[ "$relative" == Salvaged/* || "$relative" == *.salvaged/* ]]; then + return 0 + fi + + for managed_dir in $MANAGED_DIRS; do + if [[ "$relative" == ${managed_dir}/* ]]; then + return 0 + fi + done + + return 1 +} + +resolve_destination() { + local destination_dir="$1" + local basename="$2" + local suffix="$3" + local candidate="$destination_dir/$basename.$suffix" + local counter=2 + + if (( OVERWRITE )); then + print -r -- "$candidate" + return 0 + fi + + while [[ -e "$candidate" ]]; do + candidate="$destination_dir/$basename-$counter.$suffix" + (( counter++ )) + done + + print -r -- "$candidate" +} + +perform_move() { + local source_path="$1" + local destination_path="$2" + + if (( DRY_RUN )); then + log "DRY-RUN $source_path -> $destination_path" + return 0 + fi + + mkdir -p -- "${destination_path:h}" || return 1 + if (( OVERWRITE )); then + mv -f -- "$source_path" "$destination_path" + else + mv -- "$source_path" "$destination_path" + fi +} + +perform_copy() { + local source_path="$1" + local destination_path="$2" + + if (( DRY_RUN )); then + log "DRY-RUN copy $source_path -> $destination_path" + return 0 + fi + + mkdir -p -- "${destination_path:h}" || return 1 + if (( OVERWRITE )); then + cp -f "$source_path" "$destination_path" + else + cp "$source_path" "$destination_path" + fi +} + +create_ambiguous_iwork_copies() { + local source_path="$1" + local destination_dir="$2" + local final_basename="$3" + local suffix="" + local copy_path="" + + for suffix in $IWORK_AMBIGUOUS_SUFFIXES; do + copy_path=$(resolve_destination "$destination_dir" "$final_basename" "$suffix") + if perform_copy "$source_path" "$copy_path"; then + verbose "prepared iWork probe copy: $copy_path" + log "$source_path -> $copy_path [Apple-iWork-probe, low]" + else + (( FAILED_COUNT++ )) + warn "failed to copy $source_path to $copy_path" + fi + done +} + +process_archive() { + local archive="$1" + local source_name="$archive:t" + local source_extension="${source_name:e:l}" + local source_basename="${source_name:r}" + local final_basename="" + local destination_dir="" + local destination_path="" + + if is_managed_output_path "$archive"; then + (( SKIPPED_COUNT++ )) + verbose "skipping managed output path: $archive" + return 0 + fi + + (( PROCESSED_COUNT++ )) + reset_detection + + case "$source_extension" in + zip) + classify_zip "$archive" + ;; + gz) + classify_gz "$archive" + ;; + *) + set_detection "Unknown" "$source_extension" "low" "unsupported file extension" + ;; + esac + + if [[ "$DETECTED_GROUP" == "Unknown" ]]; then + (( UNKNOWN_COUNT++ )) + final_basename="$source_basename" + DETECTED_SUFFIX=${DETECTED_SUFFIX:-$source_extension} + else + (( CLASSIFIED_COUNT++ )) + if [[ -n "$DETECTED_BASENAME" ]]; then + final_basename="$DETECTED_BASENAME" + else + final_basename="$source_basename" + fi + fi + + final_basename=$(sanitize_name "$final_basename") + destination_dir="$SCAN_ROOT/$DETECTED_GROUP" + destination_path=$(resolve_destination "$destination_dir" "$final_basename" "$DETECTED_SUFFIX") + + verbose "[$DETECTED_CONFIDENCE] $archive => $DETECTED_GROUP ($DETECTED_REASON)" + if perform_move "$archive" "$destination_path"; then + (( RENAMED_COUNT++ )) + log "$archive -> $destination_path [$DETECTED_GROUP, $DETECTED_CONFIDENCE]" + if [[ "$DETECTED_GROUP" == "Apple-iWork" ]]; then + create_ambiguous_iwork_copies "$destination_path" "$destination_dir" "$final_basename" + fi + else + (( FAILED_COUNT++ )) + warn "failed to move $archive" + fi +} + +collect_archives() { + find "$SCAN_ROOT" \ + \( -path "$SCAN_ROOT/Salvaged" -o -path "$SCAN_ROOT/Salvaged/*" -o -path "$SCAN_ROOT/*.salvaged" -o -path "$SCAN_ROOT/*.salvaged/*" \) -prune \ + -o -type f \( -iname '*.zip' -o -iname '*.gz' \) -print +} + +collect_salvage_targets() { + local damaged_dir + + for damaged_dir in Damaged-Zip Damaged-Apple-iWork Damaged-Pages Damaged-Numbers; do + if [[ -d "$SCAN_ROOT/$damaged_dir" ]] && find "$SCAN_ROOT/$damaged_dir" -type f -iname '*.zip' -print -quit | grep -q .; then + print -r -- "$SCAN_ROOT/$damaged_dir" + fi + done +} + +run_salvage_workflow() { + local salvage_target="$1" + local salvage_output_root="$2" + local -a salvage_cmd + + [[ -x "$SALVAGE_SCRIPT_PATH" ]] || die "salvage script not found or not executable: $SALVAGE_SCRIPT_PATH" + + salvage_cmd=("$SALVAGE_SCRIPT_PATH") + if (( DRY_RUN )); then + salvage_cmd+=(--dry-run) + fi + if (( VERBOSE )); then + salvage_cmd+=(--verbose) + fi + salvage_cmd+=(--output "$salvage_output_root" "$salvage_target") + + log "Salvage $salvage_target -> $salvage_output_root" + "${salvage_cmd[@]}" || warn "salvage workflow failed for $salvage_target" +} + +parse_args() { + local arg + + while (( $# )); do + arg=$1 + case "$arg" in + -n|--dry-run) + DRY_RUN=1 + ;; + -v|--verbose) + VERBOSE=1 + ;; + --overwrite) + OVERWRITE=1 + ;; + --salvage-damaged) + SALVAGE_DAMAGED=1 + ;; + -h|--help) + usage + exit 0 + ;; + --) + shift + break + ;; + -*) + die "unknown option: $arg" + ;; + *) + if [[ -n "$SCAN_ROOT" ]]; then + die "only one directory may be provided" + fi + SCAN_ROOT=$arg + ;; + esac + shift + done + + if [[ -z "$SCAN_ROOT" && $# -gt 0 ]]; then + SCAN_ROOT=$1 + shift + fi + + [[ -n "$SCAN_ROOT" ]] || { + usage + exit 1 + } + + [[ -d "$SCAN_ROOT" ]] || die "directory does not exist: $SCAN_ROOT" + SCAN_ROOT=${SCAN_ROOT:A} +} + +main() { + local archive + local salvage_target + local salvage_output_root + local -a archives + local -a salvage_targets + + parse_args "$@" + ensure_tools + make_temp_root + if (( DRY_RUN )); then + ACTION_LABEL="Planned" + fi + + archives=(${(f)"$(collect_archives)"}) + salvage_targets=(${(f)"$(collect_salvage_targets)"}) + + if (( ${#archives} == 0 && (! SALVAGE_DAMAGED || ${#salvage_targets} == 0) )); then + log "No .zip or .gz files found under $SCAN_ROOT" + return 0 + fi + + if (( ${#archives} > 0 )); then + verbose "found ${#archives} candidate archives under $SCAN_ROOT" + + for archive in $archives; do + process_archive "$archive" + done + fi + + if (( SALVAGE_DAMAGED )); then + salvage_targets=(${(f)"$(collect_salvage_targets)"}) + if (( ${#salvage_targets} == 0 )); then + verbose "no damaged ZIP output folders found for salvage" + else + for salvage_target in $salvage_targets; do + salvage_output_root="$SCAN_ROOT/Salvaged/${salvage_target:t}" + run_salvage_workflow "$salvage_target" "$salvage_output_root" + done + fi + fi + + log "" + log "Summary" + log " Processed: $PROCESSED_COUNT" + log " Classified: $CLASSIFIED_COUNT" + log " Unknown: $UNKNOWN_COUNT" + log " ${ACTION_LABEL}: $RENAMED_COUNT" + log " Skipped: $SKIPPED_COUNT" + log " Failed: $FAILED_COUNT" +} + +main "$@" \ No newline at end of file diff --git a/salvage-damaged-zips.zsh b/salvage-damaged-zips.zsh new file mode 100755 index 0000000..0e02780 --- /dev/null +++ b/salvage-damaged-zips.zsh @@ -0,0 +1,299 @@ +#!/bin/zsh + +emulate -L zsh +setopt extended_glob no_nomatch no_unset pipefail + +typeset -gr SCRIPT_NAME=${0:t} +typeset -g INPUT_ROOT="" +typeset -g OUTPUT_ROOT="" +typeset -g DRY_RUN=0 +typeset -g VERBOSE=0 + +usage() { + cat <<'EOF' +Usage: salvage-damaged-zips.zsh [options] DIRECTORY + +Attempt repair and partial extraction for damaged ZIP files under DIRECTORY. + +Options: + -n, --dry-run Print planned actions without writing repaired files. + -v, --verbose Print extra diagnostics while processing. + -o, --output DIR Write results into DIR. Defaults to DIRECTORY.salvaged. + -h, --help Show this help text. +EOF +} + +log() { + print -r -- "$*" +} + +verbose() { + if (( VERBOSE )); then + print -r -- "$*" + fi +} + +die() { + print -u2 -r -- "error: $*" + exit 1 +} + +ensure_tools() { + local tool + for tool in zip unzip bsdtar file strings perl mktemp find; do + command -v "$tool" >/dev/null 2>&1 || die "required tool not found: $tool" + done +} + +trim_value() { + local value="$1" + value="${value//$'\r'/ }" + value="${value//$'\n'/ }" + value="${value//$'\t'/ }" + value=${value##[[:space:]]##} + value=${value%%[[:space:]]##} + print -r -- "$value" +} + +sanitize_name() { + local value="$1" + value=$(trim_value "$value") + value=${value//$'\0'/} + value=${value//\//-} + value=${value//:/-} + value=${value//\\/-} + value=$(print -r -- "$value" | tr -s ' ') + value=${value##.##} + value=${value%%[[:space:]]##} + value=${value##[[:space:]]##} + if [[ -z "$value" ]]; then + value="Untitled" + fi + print -r -- "$value" +} + +parse_args() { + local arg + + while (( $# )); do + arg=$1 + case "$arg" in + -n|--dry-run) + DRY_RUN=1 + ;; + -v|--verbose) + VERBOSE=1 + ;; + -o|--output) + shift + (( $# )) || die "missing argument for --output" + OUTPUT_ROOT=$1 + ;; + -h|--help) + usage + exit 0 + ;; + --) + shift + break + ;; + -*) + die "unknown option: $arg" + ;; + *) + if [[ -n "$INPUT_ROOT" ]]; then + die "only one directory may be provided" + fi + INPUT_ROOT=$arg + ;; + esac + shift + done + + [[ -n "$INPUT_ROOT" ]] || { + usage + exit 1 + } + + [[ -d "$INPUT_ROOT" ]] || die "directory does not exist: $INPUT_ROOT" + INPUT_ROOT=${INPUT_ROOT:A} + + if [[ -z "$OUTPUT_ROOT" ]]; then + OUTPUT_ROOT="${INPUT_ROOT}.salvaged" + fi + OUTPUT_ROOT=${OUTPUT_ROOT:A} +} + +collect_archives() { + find "$INPUT_ROOT" -type f -iname '*.zip' -print | sort +} + +archive_markers() { + strings -a "$1" 2>/dev/null +} + +text_has_marker() { + local text="$1" + local pattern="$2" + print -r -- "$text" | grep -E -q -- "$pattern" +} + +archive_has_binary_marker() { + local archive="$1" + local pattern="$2" + LC_ALL=C grep -aE -q -- "$pattern" "$archive" +} + +classify_marker_family() { + local archive="$1" + if archive_has_binary_marker "$archive" 'Index/Tables/'; then + print -r -- "Damaged-Numbers" + return 0 + fi + + if archive_has_binary_marker "$archive" 'Metadata/DocumentProperties\.plist|Pages/'; then + print -r -- "Damaged-Pages" + return 0 + fi + + if archive_has_binary_marker "$archive" 'Index/Document\.iwa' && archive_has_binary_marker "$archive" 'Index/CalculationEngine'; then + print -r -- "Damaged-Apple-iWork" + return 0 + fi + + print -r -- "Damaged-Zip" +} + +escape_md_cell() { + local value="$1" + value=${value//|/\\|} + print -r -- "$value" +} + +repair_archive() { + local source_archive="$1" + local repaired_archive="$2" + + zip -FF "$source_archive" --out "$repaired_archive" <<'EOF' >/dev/null 2>"${repaired_archive}.repair.log" +y +EOF +} + +extract_repaired_archive() { + local repaired_archive="$1" + local extract_dir="$2" + + mkdir -p -- "$extract_dir" || return 1 + bsdtar -xf "$repaired_archive" -C "$extract_dir" 2>"${extract_dir}.extract.log" +} + +write_report_header() { + local markdown_report="$1" + local tsv_report="$2" + + cat > "$markdown_report" < "$tsv_report" +} + +append_report_row() { + local markdown_report="$1" + local tsv_report="$2" + local archive_label="$3" + local family="$4" + local repaired_entries="$5" + local visible_assets="$6" + local notes="$7" + + print -r -- "| $(escape_md_cell "$archive_label") | $(escape_md_cell "$family") | $repaired_entries | $visible_assets | $(escape_md_cell "$notes") |" >> "$markdown_report" + print -r -- "$archive_label\t$family\t$repaired_entries\t$visible_assets\t$notes" >> "$tsv_report" +} + +main() { + local -a archives + local archive="" + local source_name="" + local base_name="" + local family="" + local family_dir="" + local repaired_archive="" + local extract_dir="" + local repaired_listing="" + local repaired_entries=0 + local visible_assets=0 + local notes="" + local markdown_report="" + local tsv_report="" + + parse_args "$@" + ensure_tools + + archives=(${(f)"$(collect_archives)"}) + if (( ${#archives} == 0 )); then + log "No .zip files found under $INPUT_ROOT" + return 0 + fi + + if (( DRY_RUN )); then + for archive in $archives; do + family=$(classify_marker_family "$archive") + log "DRY-RUN $archive => $family" + done + return 0 + fi + + mkdir -p -- "$OUTPUT_ROOT/repaired" "$OUTPUT_ROOT/extracted" "$OUTPUT_ROOT/logs" || die "failed to create output directories" + markdown_report="$OUTPUT_ROOT/salvage-report.md" + tsv_report="$OUTPUT_ROOT/salvage-report.tsv" + write_report_header "$markdown_report" "$tsv_report" + + for archive in $archives; do + source_name=${archive:t} + base_name=$(sanitize_name "${source_name:r}") + family=$(classify_marker_family "$archive") + family_dir="$OUTPUT_ROOT/extracted/$family/$base_name" + repaired_archive="$OUTPUT_ROOT/repaired/${base_name}.repaired.zip" + extract_dir="$family_dir" + notes="" + repaired_entries=0 + visible_assets=0 + + verbose "repairing $archive => $family" + if ! repair_archive "$archive" "$repaired_archive"; then + notes="zip -FF could not rebuild a readable archive" + append_report_row "$markdown_report" "$tsv_report" "$source_name" "$family" "$repaired_entries" "$visible_assets" "$notes" + continue + fi + + repaired_listing=$(unzip -Z1 "$repaired_archive" 2>/dev/null) + if [[ -n "$repaired_listing" ]]; then + repaired_entries=$(print -r -- "$repaired_listing" | sed '/^$/d' | wc -l | tr -d ' ') + fi + + extract_repaired_archive "$repaired_archive" "$extract_dir" || true + visible_assets=$(find "$extract_dir" -type f \( -iname '*.jpg' -o -iname '*.jpeg' -o -iname '*.png' -o -iname '*.tiff' -o -iname '*.tif' -o -iname '*.pdf' -o -iname '*.heic' \) 2>/dev/null | wc -l | tr -d ' ') + + if (( visible_assets > 0 )); then + notes="visible embedded assets recovered" + elif (( repaired_entries > 0 )); then + notes="internal iWork entries recovered" + else + notes="repair succeeded but no entries were listed" + fi + + append_report_row "$markdown_report" "$tsv_report" "$source_name" "$family" "$repaired_entries" "$visible_assets" "$notes" + done + + log "Wrote salvage output to $OUTPUT_ROOT" + log "Report: $markdown_report" +} + +main "$@" \ No newline at end of file diff --git a/tests/smoke-test.zsh b/tests/smoke-test.zsh new file mode 100755 index 0000000..4218a7c --- /dev/null +++ b/tests/smoke-test.zsh @@ -0,0 +1,55 @@ +#!/bin/zsh + +emulate -L zsh +setopt no_unset pipefail + +typeset -gr TEST_ROOT=${0:A:h:h} +typeset -gr FIXTURE_DIR="$TEST_ROOT/tmp-fixture" + +cleanup() { + rm -rf -- "$FIXTURE_DIR" +} + +trap cleanup EXIT INT TERM + +mkdir -p "$FIXTURE_DIR/docx-src/word" \ + "$FIXTURE_DIR/docx-src/docProps" \ + "$FIXTURE_DIR/iwork-src/Index" \ + "$FIXTURE_DIR/iwork-src/Metadata" \ + "$FIXTURE_DIR/numbers-src/Index/Tables" \ + "$FIXTURE_DIR/numbers-src/Metadata" \ + "$FIXTURE_DIR/pages-src/Metadata" \ + "$FIXTURE_DIR/pages-src/Pages" \ + "$FIXTURE_DIR/unknown-src" \ + "$FIXTURE_DIR/tar-src" + +printf '%s' '' > "$FIXTURE_DIR/docx-src/[Content_Types].xml" +printf '%s' 'Recovered Word Title' > "$FIXTURE_DIR/docx-src/docProps/core.xml" +printf '%s' 'stub' > "$FIXTURE_DIR/docx-src/word/document.xml" + +printf '%s' 'kMDItemTitleRecovered Pages Title' > "$FIXTURE_DIR/pages-src/Metadata/DocumentProperties.plist" +printf '%s' 'stub' > "$FIXTURE_DIR/pages-src/Pages/Document.iwa" + +printf '%s' 'isMultiPage' > "$FIXTURE_DIR/iwork-src/Metadata/Properties.plist" +printf '%s' 'stub' > "$FIXTURE_DIR/iwork-src/Index/Document.iwa" + +printf '%s' 'isMultiPage' > "$FIXTURE_DIR/numbers-src/Metadata/Properties.plist" +printf '%s' 'stub' > "$FIXTURE_DIR/numbers-src/Index/Document.iwa" +printf '%s' 'stub' > "$FIXTURE_DIR/numbers-src/Index/Tables/DataList.iwa" + +printf '%s' 'mystery' > "$FIXTURE_DIR/unknown-src/file.bin" +printf '%s' 'hello tar' > "$FIXTURE_DIR/tar-src/readme.txt" +printf '%s' '%PDF-1.4 +%%EOF +' > "$FIXTURE_DIR/sample.pdf" + +(cd "$FIXTURE_DIR/docx-src" && zip -qr ../lost-doc.zip .) +(cd "$FIXTURE_DIR/iwork-src" && zip -qr ../lost-iwork.zip .) +(cd "$FIXTURE_DIR/numbers-src" && zip -qr ../lost-numbers.zip .) +(cd "$FIXTURE_DIR/pages-src" && zip -qr ../lost-pages.zip .) +(cd "$FIXTURE_DIR/unknown-src" && zip -qr ../mystery.zip .) +tar -czf "$FIXTURE_DIR/archive.gz" -C "$FIXTURE_DIR/tar-src" . +gzip -c "$FIXTURE_DIR/sample.pdf" > "$FIXTURE_DIR/sample.gz" + +cd "$TEST_ROOT" +./classify-recovered-archives.zsh --dry-run --verbose "$FIXTURE_DIR" \ No newline at end of file