firefly-import-preprocessor/README.md
2026-05-06 23:17:54 +02:00

26 KiB
Raw Blame History

Firefly Import Preprocessor

Version: 1.0.0
Date: 03 May 2026
Status: Production Ready

🌐 Deutsch


Table of Contents

  1. Overview
  2. Installation & Setup
  3. Quick Start
  4. Configuration
  5. Transformation Types
  6. CLI Reference
  7. Debug Mode
  8. Firefly III Integration
  9. Architecture
  10. Error Handling

Overview

The Firefly Import Preprocessor is a production-ready PHP preprocessor for bank CSV export files. It transforms bank data into a standardised format and can optionally import it into Firefly III.

Core Features

Full CSV transformation with complex pipelines
Metadata extraction via regex (IBAN, currency, account name)
14 transformation types for flexible data processing
Firefly III integration — CLI, Docker, and HTTP upload
Debug mode for full processing transparency
Production ready with complete error handling
Zero dependencies for core functionality

Workflow

Input CSV
    ↓
Extract metadata (regex)
    ↓
Transform data rows (pipeline)
    ↓
Write output CSV
    ↓
[Optional] Import into Firefly III

Installation & Setup

Requirements

  • PHP 8.1+
  • Composer (recommended)
  • [Optional] Docker for Firefly III integration

Installation

# 1. Clone / copy the repository
cd ff-imp-preprocessor

# 2. Install dependencies (optional, dev tools only)
composer install

# 3. Create configuration
cp config/config.example.json config/config.json
# Edit config/config.json with your settings

# 4. Create directories
mkdir -p config/import/{source,output,archive,error}
chmod 755 config/import/{source,output,archive,error}

# 5. Run a test
php bin/transformer.php validate config/config.json input.csv

Quick Start

1. Adjust configuration

Edit config/config.json and make sure the extraction rules match your CSV format:

{
  "metadata": {
    "extractionRules": [
      {
        "name": "account_iban",
        "lineNumber": 2,
        "regex": "IBAN:\\s*([A-Z0-9 ]+)",
        "captureGroup": 1
      }
    ]
  },
  "csvStructure": {
    "headerLine": 5,
    "delimiter": ";",
    "encoding": "UTF-8"
  }
}

2. Validate CSV

php bin/transformer.php validate config/config.json input.csv

3. Run transformation

php bin/transformer.php transform input.csv config/config.json

# With debug mode for troubleshooting
php bin/transformer.php transform input.csv config/config.json --debug

4. Inspect output

php bin/transformer.php test input.csv config/config.json --debug
# Shows up to 10 transformed rows and debug logs

Configuration

config.json structure

metadata — Metadata extraction

{
  "metadata": {
    "extractionRules": [
      {
        "name": "account_iban",
        "lineNumber": 2,
        "regex": "IBAN:\\s*([A-Z0-9 ]+)",
        "captureGroup": 1
      },
      {
        "name": "currency_code",
        "lineNumber": 3,
        "regex": "Currency:\\s*([A-Z]{3})",
        "captureGroup": 1
      }
    ]
  }
}
Field Type Description
name string Name of the metadata variable (used in constantvalue)
lineNumber int Line number in CSV (1-based, human-readable)
regex string Regex pattern for extraction (without delimiters)
captureGroup int Capture group index (0 = full match, 1 = first group, etc.)

Regex example:

  • Pattern: IBAN:\s*([A-Z0-9 ]+)
  • Input: IBAN: CH93 0077 2020 6262 5252 7
  • Capture group 1: CH93 0077 2020 6262 5252 7

csvStructure — CSV format

{
  "csvStructure": {
    "headerLine": 5,
    "delimiter": ";",
    "encoding": "UTF-8",
    "hasBom": false
  }
}
Field Type Default Description
headerLine int 5 Line number of the header row (1-based)
delimiter string ; CSV delimiter
encoding string UTF-8 Character encoding (UTF-8, ISO-8859-1, CP1252)
hasBom bool false Whether the file has a BOM (Byte Order Mark)

columnTransformations — Column transformations

{
  "columnTransformations": [
    {
      "sourceColumn": "BookingDate",
      "transformations": [
        {
          "type": "dateformat",
          "fromFormat": "d.m.Y",
          "toFormat": "Y-m-d"
        }
      ],
      "outputColumn": "date",
      "outputAction": "overwrite"
    }
  ]
}

outputAction:

Value Behaviour
overwrite Replace the target column with the transformation result (default)
create Write the result into a new output column
append Concatenate the result to the end of the existing column value. Add "appendDelimiter": " " (any string) to insert a separator between the existing and new value — the delimiter is omitted when the target column is still empty
append-if-not-empty Same as append (including optional appendDelimiter) but skips entirely when the transformation result is empty — safe for optional values such as tags or notes lines
append-line Same as append but the separator is always a newline \n; no leading newline when the target is empty
overwrite-if-empty Only write the result if the target column is currently empty
overwrite-if-not-empty Only write the result if the transformation result is not empty

directories — File system

{
  "directories": {
    "source": "/opt/ff-imp-preprocessor/import/source",
    "output": "/opt/ff-imp-preprocessor/import/output",
    "archive": "/opt/ff-imp-preprocessor/import/archive",
    "error": "/opt/ff-imp-preprocessor/import/error"
  }
}
Field Description
source Input directory
output Output directory
archive Archive for processed files
error Error directory for invalid files

fireflyImport — Firefly III integration

Optional. When present, passing --do-import to the transform command (or using auto-import) will call the Firefly III Data Importer after the output CSV is written.

See Firefly III Integration for the full field reference and mode-specific examples.


Transformation Types

There are 14 supported transformation types that can be combined as a pipeline:

1. trim — Remove whitespace

Removes leading and trailing whitespace.

{ "type": "trim" }
  • Input: Coop Pronto → Output: Coop Pronto

2. lowercase — Convert to lowercase

Converts to lowercase (UTF-8 safe).

{ "type": "lowercase" }
  • Input: COOP PRONTO CHUR → Output: coop pronto chur

3. uppercase — Convert to uppercase

Converts to uppercase (UTF-8 safe).

{ "type": "uppercase" }
  • Input: Coop Pronto Chur → Output: COOP PRONTO CHUR

4. ucwordsfirst — Capitalise after word separators

Capitalises the first letter after each word separator.

{ "type": "ucwordsfirst" }
  • COOP PRONTO CHURCoop Pronto Chur
  • migros-rail cityMigros-Rail City
  • O'NEILL STOREO'Neill Store
  • SAINT-JEAN-DE-MAURIENNESaint-Jean-De-Maurienne

Separators: space, hyphen, apostrophe, slash, period, comma, semicolon, colon, parentheses.

Guard: If the input already contains both uppercase and lowercase letters (mixed-case), it is returned unchanged. This prevents accidentally re-casing intentionally formatted strings such as "Coop pronto chur". Fully uppercase or fully lowercase inputs are always processed.


5. replace — String replacement

Replaces a substring with another string (case-sensitive).

{ "type": "replace", "search": "  ", "replace": " " }
  • Input: Coop Pronto (two spaces) → Output: Coop Pronto (one space)

6. split — Split column

Splits a value at a delimiter and keeps a defined part.

{ "type": "split", "delimiter": ";", "part": 0 }
  • Input: Coop Pronto Chur;7007 Chur → Output: Coop Pronto Chur

7. regex — Regex replacement

Replaces parts of a string using a regular expression. Uses PHP preg_replace.

{ "type": "regex", "pattern": "^(.*?);.*$", "replace": "$1" }

No match → original value is passed through unchanged (pipeline-safe).

Use capture groups as $1, $2, … in the replace field. A pattern without ^/$ anchors replaces only the matched portion, not the whole value.


8. regexextract — Regex extraction

Extracts a capture group and returns only that. Uses PHP preg_match.

{ "type": "regexextract", "pattern": "(\\d{4,} [^;]+)" }
  • Input: Coop Pronto Chur, 7007 Chur → Output: 7007 Chur
  • No match → empty string

⚠ Not pipeline-safe: A no-match discards all previous pipeline results. Use regex instead if you want to preserve the current value on no-match.


9. dateformat — Date reformatting

Converts between date formats.

{ "type": "dateformat", "fromFormat": "d.m.Y", "toFormat": "Y-m-d" }
  • Input: 10.12.2025 → Output: 2025-12-10

Supports all PHP DateTime format characters.


10. truncate — Truncate string

Truncates a string to a maximum length.

{ "type": "truncate", "maxLength": 100 }

11. constantvalue — Constant value from metadata

Injects an extracted metadata value as a constant for every row.

{
  "sourceColumn": "_constant_",
  "transformations": [
    { "type": "constantvalue", "metadataKey": "account_iban" }
  ],
  "outputColumn": "account_iban",
  "outputAction": "create"
}
  • Every row receives the extracted account_iban value (e.g. CH9300777222666888999) in a new column.

12. map — Copy / rename column

Copies a column value as-is (optionally to a new name).

{ "type": "map" }

13. pipeline — Nested pipeline

Runs a sub-pipeline as a single transformation step.

{
  "type": "pipeline",
  "steps": [
    { "type": "trim" },
    { "type": "lowercase" },
    { "type": "ucwordsfirst" }
  ]
}

Useful for grouping steps as a logical unit within a transformations array.


14. timeperiod — Map time to a period label

Parses a time string and returns the label of the matching period range. Supports midnight-spanning ranges (e.g. 22:0003:59). Returns default (empty string by default) when no range matches or the input is invalid.

{
  "type": "timeperiod",
  "timeFormat": "H:i:s",
  "periods": [
    { "from": "04:00:00", "to": "08:59:59", "label": "Morgen" },
    { "from": "09:00:00", "to": "10:59:59", "label": "Vormittag" },
    { "from": "11:00:00", "to": "13:59:59", "label": "Mittag" },
    { "from": "14:00:00", "to": "17:59:59", "label": "Nachmittag" },
    { "from": "18:00:00", "to": "21:59:59", "label": "Abend" },
    { "from": "22:00:00", "to": "03:59:59", "label": "Nacht" }
  ],
  "default": ""
}
  • "09:30:00""Vormittag"
  • "23:00:00""Nacht" (midnight-spanning range)
  • "02:00:00""Nacht" (midnight-spanning range)
  • "" or unparseable input → ""

timeFormat follows PHP's DateTime::createFromFormat syntax (default H:i:s).


Row filtering — skipIf

Rows can be excluded from the output by adding a top-level skipIf key to the configuration. The value is a filter node — either a bare condition or a nested and/or group.

Bare condition:

"skipIf": { "column": "Buchungstext", "operator": "equals", "value": "Saldovortrag" }

AND group:

"skipIf": {
  "and": [
    { "column": "Beschreibung1", "operator": "empty" },
    { "column": "Beschreibung2", "operator": "empty" }
  ]
}

Nested AND / OR:

"skipIf": {
  "or": [
    { "column": "Amount", "operator": "gt", "value": "10000" },
    {
      "and": [
        { "column": "Type", "operator": "equals", "value": "Saldo" },
        { "column": "Notes", "operator": "empty" }
      ]
    }
  ]
}

Supported operators:

Operator Matches when…
empty column value is empty string
not-empty column value is not empty
equals column value equals "value"
not-equals column value does not equal "value"
contains column value contains "value"
not-contains column value does not contain "value"
matches column value matches regex "pattern"
not-matches column value does not match regex "pattern"
gt (float) column > (float) value
gte (float) column >= (float) value
lt (float) column < (float) value
lte (float) column <= (float) value

Pipeline example

Multiple transformations chained:

{
  "sourceColumn": "BookingText",
  "transformations": [
    { "type": "trim" },
    { "type": "replace", "search": "  ", "replace": " " },
    { "type": "lowercase" },
    { "type": "ucwordsfirst" }
  ],
  "outputColumn": "description",
  "outputAction": "overwrite"
}

Processing:

  1. " COOP PRONTO " → trim → "COOP PRONTO"
  2. "COOP PRONTO" → replace → "COOP PRONTO"
  3. "COOP PRONTO" → lowercase → "coop pronto"
  4. "coop pronto" → ucwordsfirst → "Coop Pronto"

CLI Reference

php bin/transformer.php <command> [input] [config] [options]

Commands

Command Description
test Test run (up to 10 rows)
transform Full transformation
validate Validate configuration
auto-import Directory monitoring
help Show help

Options

Option Description
--debug, -d Enable debug mode
--rows=N Max. N rows (test command)
--output=FILE, -o Output path
--do-import Import into Firefly III after transformation (transform only)
--strict Strict validation
--watch Continuous monitoring
--interval=SEC Check interval in seconds (default: 60)
--dry-run Simulation mode, no real operations

Debug Mode

php bin/transformer.php test input.csv config/config.json --debug

Log categories

Category When
transformer Start/end of transformation
csv_reader While reading CSV
metadata During metadata extraction
metadata_warning On extraction problems
transformation For each transformation step
csv_writer While writing output CSV

Debug log output (JSON)

{
  "success": true,
  "debug_logs": [
    {
      "timestamp": 1702200120.5432,
      "category": "transformer",
      "message": "Transformation started",
      "data": { "inputFile": "input.csv", "maxRows": 0 }
    },
    {
      "timestamp": 1702200120.5445,
      "category": "metadata",
      "message": "Extraction rule applied",
      "data": { "rule_name": "account_iban", "value": "CH93..." }
    }
  ]
}

Firefly III Integration

The transformer can automatically import transformed files into Firefly III. Three operating modes cover all typical deployment scenarios.

Prerequisites (all modes)

1. Create a Firefly III Data Importer JSON configuration file

This file maps transformed CSV columns to Firefly III transaction fields (format v3).

Recommended approach: upload a sample CSV once in the Firefly III Data Importer Web UI, configure the column mapping there, then download the finished configuration. Alternatively, use config/firefly-import-config.example.json as a template and adjust default_account to your asset account ID.

2. Choose an operating mode — see sections below.


fireflyImport field reference

Field Type Description
mode string Operating mode: cli | docker | http (default: cli)
jsonConfig string Path to the Firefly III Data Importer JSON config file (format v3). For cli and http modes the file must exist locally; relative paths are resolved from the working directory where php bin/transformer.php is invoked (typically the project root). For docker mode the path is inside the container — local existence is not checked.
importerCommand string Full CLI command (modes: cli, docker)
importerUrl string URL of the Data Importer (mode: http)
personalSecret string The AUTO_IMPORT_SECRET set on the importer server (min. 16 chars). Sent as ?secret= URL query parameter. (mode: http)
accessToken string Firefly III Personal Access Token. Sent as Authorization: Bearer header. Required if not already set as FIREFLY_III_ACCESS_TOKEN in the importer environment. (mode: http)
deleteAfterImport boolean Delete transformed CSV after successful import
chunkSize integer Split the CSV into chunks of at most N data rows and import each chunk as a separate request. Prevents server-side timeouts on large files (rule of thumb: ~34 s/transaction for HTTP mode). 0 or absent = no chunking (default). Applies to all modes.
chunkRetries integer Number of additional import attempts per chunk after the first. On failure the importer retries up to this many times before aborting. 0 or absent = no retry (default). Only effective when chunkSize > 0.
chunkRetryDelay integer Pause in seconds before each chunk request after the first, and between retry attempts for the same failed chunk. Addresses both inter-chunk cooldown and retry back-off. 0 or absent = no pause (default). Only effective when chunkSize > 0.
connectionTimeout integer Maximum seconds to wait for the TCP connection to the importer to be established. Distinct from timeout (full transfer duration). Default: 10. (mode: http only)
timeout integer Timeout in seconds per request (default: 300). For chunked imports this applies per chunk, not for the total run.
environment object Additional environment variables (modes: cli, docker)

Mode cli — Transformer and Firefly on the same server

Both the transformer and the Firefly III Data Importer run on the same server. The transformer calls the importer directly as a local command.

"fireflyImport": {
  "mode": "cli",
  "jsonConfig": "/opt/firefly-data-importer/storage/configurations/ubs-import.json",
  "importerCommand": "php /opt/firefly-data-importer/artisan importer:import",
  "deleteAfterImport": false,
  "chunkSize": 50,
  "chunkRetries": 3,
  "chunkRetryDelay": 10,
  "timeout": 300,
  "environment": {
    "FIREFLY_III_URL": "https://localhost",
    "FIREFLY_III_ACCESS_TOKEN": "your-token-here"
  }
}

Mode docker — Transformer local, Firefly in Docker

The transformer runs locally or in its own container; the Firefly III Data Importer runs in a Docker container. The transformer calls the importer via docker exec.

Important: The transformer's output directory must be mounted as a volume in the importer container. jsonConfig is the path inside the container (not a local path). Do not use the -it flag (no TTY).

Example docker-compose.yml for the importer:

services:
  firefly-importer:
    image: fireflyiii/data-importer:latest
    volumes:
      - /opt/ff-imp-preprocessor/import:/import
    environment:
      - FIREFLY_III_URL=https://your-firefly.com
      - FIREFLY_III_ACCESS_TOKEN=your-token-here
      - CAN_POST_FILES=false
"fireflyImport": {
  "mode": "docker",
  "jsonConfig": "/import/configs/ubs-import.json",
  "importerCommand": "docker exec firefly-importer php artisan importer:import",
  "deleteAfterImport": false,
  "chunkSize": 50,
  "chunkRetries": 3,
  "chunkRetryDelay": 10,
  "timeout": 300
}

The JSON config file must be available inside the container — either via a volume mount or docker cp:

docker cp ubs-import.json firefly-importer:/import/configs/ubs-import.json

Mode http — Transformer local, Firefly importer on a remote server

The transformer runs locally; the Firefly III Data Importer is reachable over HTTP(S). The CSV and JSON configuration are sent as a multipart HTTP upload to the importer.

Requirements on the importer server:

CAN_POST_FILES=true
AUTO_IMPORT_SECRET=<secret>  # at least 16 characters — set this as personalSecret in your config

Local requirement: PHP extension ext-curl

"fireflyImport": {
  "mode": "http",
  "importerUrl": "https://importer.your-server.com",
  "personalSecret": "your-auto-import-secret-min-16-chars",
  "accessToken": "your-firefly-iii-personal-access-token",
  "jsonConfig": "config/ubs-import.json",
  "deleteAfterImport": false,
  "chunkSize": 50,
  "chunkRetries": 3,
  "chunkRetryDelay": 10,
  "connectionTimeout": 10,
  "timeout": 300
}

The transformer sends a POST request to {importerUrl}/autoupload?secret={personalSecret} with the CSV and JSON config as multipart form fields. The accessToken is sent as Authorization: Bearer. If FIREFLY_III_ACCESS_TOKEN is already set in the importer's environment, accessToken can be omitted.


Server-side tuning

For large imports the bottleneck is usually the Firefly III Data Importer server, not the transformer. The settings below belong in the importer's environment (.env or docker-compose.yml):

Setting Recommended value Notes
PHP_MEMORY_LIMIT 512M 2048M Docker env var. Raise when PHP crashes with "Allowed memory size exhausted".
CONNECTION_TIMEOUT 60 Seconds to wait for TCP connect to Firefly III API. Default is ~31 s (π × 10).
IGNORE_DUPLICATE_ERRORS true Suppress duplicate-transaction warnings on repeated imports.

nginx reverse proxy (if applicable):

proxy_read_timeout  600s;   # must exceed the longest single-chunk import time
client_max_body_size 64M;   # must accommodate your largest chunk CSV

Docker Compose example:

services:
  firefly-importer:
    environment:
      - PHP_MEMORY_LIMIT=1024M
      - CONNECTION_TIMEOUT=60
      - IGNORE_DUPLICATE_ERRORS=true

Usage

# Transform only (no import)
php bin/transformer.php transform input.csv config/config.json

# Transform and import into Firefly III
php bin/transformer.php transform input.csv config/config.json --do-import

# Watch mode: transform and import automatically for each new CSV in source directory
php bin/transformer.php auto-import config/config.json --watch

Architecture

Components

bin/transformer.php (CLI entry point)
  ↓
TransformerEngine (orchestration)
  ├─ ConfigurationLoader (load / validate config)
  ├─ CsvReader (read CSV)
  ├─ MetadataExtractor (metadata via regex)
  ├─ ColumnTransformer (apply transformations)
  ├─ CsvWriter (write CSV)
  ├─ FireflyImporter (Firefly III integration)
  └─ DebugLogger (debug logs)

Data flow

Input CSV
  ↓
CsvReader::readMetadataLines() → array of lines
  ↓
MetadataExtractor::extract() → {iban: "...", currency: "..."}
  ↓
CsvReader::readCsvData() → array of rows
  ↓
ColumnTransformer::transformRow() → transformed row (pipeline)
  ↓
CsvWriter::write() → output CSV

Classes

Class Responsibility
TransformerEngine Orchestrates the entire workflow
ConfigurationLoader Loads and validates JSON configuration
CsvReader Reads CSV with metadata support
MetadataExtractor Extracts metadata via regex
ColumnTransformer Transforms columns (pipeline)
CsvWriter Writes output CSV
FireflyImporter Imports into Firefly III
DebugLogger Static logger for debug output

Error Handling

Common errors

"Input file not found"

# Check the file path
ls -la input.csv

# Use an absolute path if relative paths do not work
php bin/transformer.php transform /absolute/path/input.csv config.json

"Missing metadata: account_iban"

The IBAN could not be extracted — wrong regex or wrong line number.

# Inspect the first lines of the CSV
head -5 input.csv

# Validate with debug output
php bin/transformer.php validate config.json input.csv --debug

"Invalid JSON: …"

Syntax error in config.json.

php -r "json_decode(file_get_contents('config/config.json'), true) or die('JSON invalid');"

"Configuration: 'csvStructure.headerLine' required"

A required configuration field is missing.

diff config/config.json config/config.example.json

Exception handling

try {
    $result = $engine->transform($inputFile);
    if (!$result['success']) {
        echo "Error: " . $result['error'];
    }
} catch (Exception $e) {
    echo "Fatal error: " . $e->getMessage();
}

Tips

UTF-8 handling

The transformer uses UTF-8 safe functions throughout:

  • mb_strtolower() instead of strtolower()
  • mb_strtoupper() instead of strtoupper()
  • mb_strlen() for correct character counting

Supported encodings: UTF-8, ISO-8859-1, CP1252.

Regex tips

Pattern without delimiters (auto-wrapped):

"pattern": "IBAN:\\s*([A-Z0-9 ]+)"
// becomes: /IBAN:\s*([A-Z0-9 ]+)/u

With explicit flags:

"pattern": "/IBAN:\\s*([A-Z0-9 ]+)/iu"
// case-insensitive

Performance

  • Optimised for: up to 1 million rows
  • Typical file size: 10100 k rows

Batch processing

#!/bin/bash
for file in import/source/*.csv; do
    php bin/transformer.php transform "$file" config/config.json
    if [ $? -eq 0 ]; then
        mv "$file" import/archive/
    else
        mv "$file" import/error/
    fi
done

Version History

v1.0.0 (03 May 2026)

  • Initial release
  • 14 transformation types
  • Metadata extraction via regex
  • Debug mode
  • Firefly III integration (cli / docker / http)
  • Full documentation

License: GPL-3.0
Author: PHP CSV Transformer Project
Repository: git.andare.ch/david.reindl/ff-imp-preprocessor