firefly-import-preprocessor/README.md
2026-05-04 00:23:02 +02:00

852 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Firefly Import Preprocessor
**Version:** 1.0.0
**Date:** 03 May 2026
**Status:** Production Ready
🌐 [Deutsch](README.de.md)
---
## Table of Contents
1. [Overview](#overview)
2. [Installation & Setup](#installation--setup)
3. [Quick Start](#quick-start)
4. [Configuration](#configuration)
5. [Transformation Types](#transformation-types)
6. [CLI Reference](#cli-reference)
7. [Debug Mode](#debug-mode)
8. [Firefly III Integration](#firefly-iii-integration)
9. [Architecture](#architecture)
10. [Error Handling](#error-handling)
---
## Overview
The **Firefly Import Preprocessor** is a production-ready PHP preprocessor for bank CSV export files. It transforms bank data into a standardised format and can optionally import it into Firefly III.
### Core Features
**Full CSV transformation** with complex pipelines
**Metadata extraction** via regex (IBAN, currency, account name)
**13 transformation types** for flexible data processing
**Firefly III integration** — CLI, Docker, and HTTP upload
**Debug mode** for full processing transparency
**Production ready** with complete error handling
**Zero dependencies** for core functionality
### Workflow
```text
Input CSV
Extract metadata (regex)
Transform data rows (pipeline)
Write output CSV
[Optional] Import into Firefly III
```
---
## Installation & Setup
### Requirements
- PHP 8.1+
- Composer (recommended)
- [Optional] Docker for Firefly III integration
### Installation
```bash
# 1. Clone / copy the repository
cd ff-imp-preprocessor
# 2. Install dependencies (optional, dev tools only)
composer install
# 3. Create configuration
cp config/config.example.json config/config.json
# Edit config/config.json with your settings
# 4. Create directories
mkdir -p config/import/{source,output,archive,error}
chmod 755 config/import/{source,output,archive,error}
# 5. Run a test
php bin/transformer.php validate config/config.json input.csv
```
---
## Quick Start
### 1. Adjust configuration
Edit `config/config.json` and make sure the extraction rules match your CSV format:
```json
{
"metadata": {
"extractionRules": [
{
"name": "account_iban",
"lineNumber": 2,
"regex": "IBAN:\\s*([A-Z0-9 ]+)",
"captureGroup": 1
}
]
},
"csvStructure": {
"headerLine": 5,
"delimiter": ";",
"encoding": "UTF-8"
}
}
```
### 2. Validate CSV
```bash
php bin/transformer.php validate config/config.json input.csv
```
### 3. Run transformation
```bash
php bin/transformer.php transform input.csv config/config.json
# With debug mode for troubleshooting
php bin/transformer.php transform input.csv config/config.json --debug
```
### 4. Inspect output
```bash
php bin/transformer.php test input.csv config/config.json --debug
# Shows up to 10 transformed rows and debug logs
```
---
## Configuration
### config.json structure
#### `metadata` — Metadata extraction
```json
{
"metadata": {
"extractionRules": [
{
"name": "account_iban",
"lineNumber": 2,
"regex": "IBAN:\\s*([A-Z0-9 ]+)",
"captureGroup": 1
},
{
"name": "currency_code",
"lineNumber": 3,
"regex": "Currency:\\s*([A-Z]{3})",
"captureGroup": 1
}
]
}
}
```
| Field | Type | Description |
| ------ | ----- | ----------- |
| `name` | string | Name of the metadata variable (used in `constantvalue`) |
| `lineNumber` | int | Line number in CSV (1-based, human-readable) |
| `regex` | string | Regex pattern for extraction (without delimiters) |
| `captureGroup` | int | Capture group index (0 = full match, 1 = first group, etc.) |
**Regex example:**
- Pattern: `IBAN:\s*([A-Z0-9 ]+)`
- Input: `IBAN: CH93 0077 2020 6262 5252 7`
- Capture group 1: `CH93 0077 2020 6262 5252 7`
#### `csvStructure` — CSV format
```json
{
"csvStructure": {
"headerLine": 5,
"delimiter": ";",
"encoding": "UTF-8",
"hasBom": false
}
}
```
| Field | Type | Default | Description |
| ------ | ----- | ------- | ----------- |
| `headerLine` | int | 5 | Line number of the header row (1-based) |
| `delimiter` | string | `;` | CSV delimiter |
| `encoding` | string | `UTF-8` | Character encoding (UTF-8, ISO-8859-1, CP1252) |
| `hasBom` | bool | false | Whether the file has a BOM (Byte Order Mark) |
#### `columnTransformations` — Column transformations
```json
{
"columnTransformations": [
{
"sourceColumn": "BookingDate",
"transformations": [
{
"type": "dateformat",
"fromFormat": "d.m.Y",
"toFormat": "Y-m-d"
}
],
"outputColumn": "date",
"outputAction": "overwrite"
}
]
}
```
**outputAction:**
- `overwrite` — overwrite the source column
- `create` — create a new column (for regex extract, split, etc.)
#### `directories` — File system
```json
{
"directories": {
"source": "/opt/ff-imp-preprocessor/import/source",
"output": "/opt/ff-imp-preprocessor/import/output",
"archive": "/opt/ff-imp-preprocessor/import/archive",
"error": "/opt/ff-imp-preprocessor/import/error"
}
}
```
| Field | Description |
| ------ | ----------- |
| `source` | Input directory |
| `output` | Output directory |
| `archive` | Archive for processed files |
| `error` | Error directory for invalid files |
#### `fireflyImport` — Firefly III integration
The operating mode is controlled by the `mode` field. Possible values: `cli`, `docker`, `http`.
Full details and examples: [Firefly III Integration](#firefly-iii-integration).
```json
{
"fireflyImport": {
"mode": "docker",
"jsonConfig": "/import/configs/ubs-import.json",
"importerCommand": "docker exec firefly-importer php artisan importer:import",
"autoImport": false,
"deleteAfterImport": false,
"timeout": 300
}
}
```
| Field | Type | Description |
| --- | --- | --- |
| `mode` | string | Operating mode: `cli` \| `docker` \| `http` (default: `cli`) |
| `jsonConfig` | string | Path to the Firefly III Data Importer JSON config file (format v3) |
| `importerCommand` | string | Full CLI command *(modes: cli, docker)* |
| `importerUrl` | string | URL of the Data Importer *(mode: http)* |
| `importerSecret` | string | `AUTO_IMPORT_SECRET` of the importer (min. 16 chars) *(mode: http)* |
| `autoImport` | boolean | Run import immediately after transformation |
| `deleteAfterImport` | boolean | Delete transformed CSV after successful import |
| `timeout` | integer | Timeout in seconds (default: 300) |
| `environment` | object | Additional environment variables *(modes: cli, docker)* |
---
## Transformation Types
There are **13 supported transformation types** that can be combined as a pipeline:
### 1. **trim** — Remove whitespace
Removes leading and trailing whitespace.
```json
{ "type": "trim" }
```
- Input: ` Coop Pronto ` → Output: `Coop Pronto`
---
### 2. **lowercase** — Convert to lowercase
Converts to lowercase (UTF-8 safe).
```json
{ "type": "lowercase" }
```
- Input: `COOP PRONTO CHUR` → Output: `coop pronto chur`
---
### 3. **uppercase** — Convert to uppercase
Converts to uppercase (UTF-8 safe).
```json
{ "type": "uppercase" }
```
- Input: `Coop Pronto Chur` → Output: `COOP PRONTO CHUR`
---
### 4. **ucwordsfirst** — Capitalise after word separators
Capitalises the first letter after each word separator.
```json
{ "type": "ucwordsfirst" }
```
- `COOP PRONTO CHUR``Coop Pronto Chur`
- `migros-rail city``Migros-Rail City`
- `O'NEILL STORE``O'Neill Store`
- `SAINT-JEAN-DE-MAURIENNE``Saint-Jean-De-Maurienne`
Separators: space, hyphen, apostrophe, slash, period, comma, semicolon, colon, parentheses.
---
### 5. **replace** — String replacement
Replaces a substring with another string (case-sensitive).
```json
{ "type": "replace", "search": " ", "replace": " " }
```
- Input: `Coop Pronto` (two spaces) → Output: `Coop Pronto` (one space)
---
### 6. **split** — Split column
Splits a value at a delimiter and keeps a defined part.
```json
{ "type": "split", "delimiter": ";", "part": 0 }
```
- Input: `Coop Pronto Chur;7007 Chur` → Output: `Coop Pronto Chur`
---
### 7. **regex** — Regex replacement
Replaces parts of a string using a regular expression. Uses PHP `preg_replace`.
```json
{ "type": "regex", "pattern": "^(.*?);.*$", "replace": "$1" }
```
**No match → original value is passed through unchanged** (pipeline-safe).
Use capture groups as `$1`, `$2`, … in the `replace` field.
A pattern without `^`/`$` anchors replaces only the matched portion, not the whole value.
---
### 8. **regexextract** — Regex extraction
Extracts a capture group and returns **only that**. Uses PHP `preg_match`.
```json
{ "type": "regexextract", "pattern": "(\\d{4,} [^;]+)" }
```
- Input: `Coop Pronto Chur, 7007 Chur` → Output: `7007 Chur`
- No match → empty string
> **⚠ Not pipeline-safe:** A no-match discards all previous pipeline results. Use `regex` instead if you want to preserve the current value on no-match.
---
### 9. **dateformat** — Date reformatting
Converts between date formats.
```json
{ "type": "dateformat", "fromFormat": "d.m.Y", "toFormat": "Y-m-d" }
```
- Input: `10.12.2025` → Output: `2025-12-10`
Supports all PHP `DateTime` format characters.
---
### 10. **truncate** — Truncate string
Truncates a string to a maximum length.
```json
{ "type": "truncate", "maxLength": 100 }
```
---
### 11. **constantvalue** — Constant value from metadata
Injects an extracted metadata value as a constant for every row.
```json
{
"sourceColumn": "_constant_",
"transformations": [
{ "type": "constantvalue", "metadataKey": "account_iban" }
],
"outputColumn": "account_iban",
"outputAction": "create"
}
```
- Every row receives the extracted `account_iban` value (e.g. `CH9300777222666888999`) in a new column.
---
### 12. **map** — Copy / rename column
Copies a column value as-is (optionally to a new name).
```json
{ "type": "map" }
```
---
### 13. **pipeline** — Nested pipeline
Runs a sub-pipeline as a single transformation step.
```json
{
"type": "pipeline",
"steps": [
{ "type": "trim" },
{ "type": "lowercase" },
{ "type": "ucwordsfirst" }
]
}
```
Useful for grouping steps as a logical unit within a `transformations` array.
---
### Pipeline example
Multiple transformations chained:
```json
{
"sourceColumn": "BookingText",
"transformations": [
{ "type": "trim" },
{ "type": "replace", "search": " ", "replace": " " },
{ "type": "lowercase" },
{ "type": "ucwordsfirst" }
],
"outputColumn": "description",
"outputAction": "overwrite"
}
```
**Processing:**
1. `" COOP PRONTO "` → trim → `"COOP PRONTO"`
2. `"COOP PRONTO"` → replace → `"COOP PRONTO"`
3. `"COOP PRONTO"` → lowercase → `"coop pronto"`
4. `"coop pronto"` → ucwordsfirst → `"Coop Pronto"`
---
## CLI Reference
```bash
php bin/transformer.php <command> [input] [config] [options]
```
### Commands
| Command | Description |
| ------- | ----------- |
| `test` | Test run (up to 10 rows) |
| `transform` | Full transformation |
| `validate` | Validate configuration |
| `auto-import` | Directory monitoring |
| `help` | Show help |
### Options
| Option | Description |
| ------ | ----------- |
| `--debug`, `-d` | Enable debug mode |
| `--rows=N` | Max. N rows (`test` command) |
| `--output=FILE`, `-o` | Output path |
| `--strict` | Strict validation |
| `--watch` | Continuous monitoring |
| `--interval=SEC` | Check interval in seconds (default: 60) |
| `--dry-run` | Simulation mode, no real operations |
---
## Debug Mode
```bash
php bin/transformer.php test input.csv config/config.json --debug
```
### Log categories
| Category | When |
| -------- | ---- |
| `transformer` | Start/end of transformation |
| `csv_reader` | While reading CSV |
| `metadata` | During metadata extraction |
| `metadata_warning` | On extraction problems |
| `transformation` | For each transformation step |
| `csv_writer` | While writing output CSV |
### Debug log output (JSON)
```json
{
"success": true,
"debug_logs": [
{
"timestamp": 1702200120.5432,
"category": "transformer",
"message": "Transformation started",
"data": { "inputFile": "input.csv", "maxRows": 0 }
},
{
"timestamp": 1702200120.5445,
"category": "metadata",
"message": "Extraction rule applied",
"data": { "rule_name": "account_iban", "value": "CH93..." }
}
]
}
```
---
## Firefly III Integration
The transformer can automatically import transformed files into Firefly III.
Three operating modes cover all typical deployment scenarios.
### Prerequisites (all modes)
**1. Create a Firefly III Data Importer JSON configuration file**
This file maps transformed CSV columns to Firefly III transaction fields (format v3).
Recommended approach: upload a sample CSV once in the Firefly III Data Importer Web UI, configure the column mapping there, then download the finished configuration. Alternatively, use `config/firefly-import-config.example.json` as a template and adjust `default_account` to your asset account ID.
**2. Choose an operating mode** — see sections below.
---
### Mode `cli` — Transformer and Firefly on the same server
Both the transformer and the Firefly III Data Importer run on the same server. The transformer calls the importer directly as a local command.
```json
"fireflyImport": {
"mode": "cli",
"jsonConfig": "/opt/firefly-data-importer/storage/configurations/ubs-import.json",
"importerCommand": "php /opt/firefly-data-importer/artisan importer:import",
"autoImport": true,
"deleteAfterImport": false,
"timeout": 300,
"environment": {
"FIREFLY_III_URL": "https://localhost",
"FIREFLY_III_ACCESS_TOKEN": "your-token-here"
}
}
```
---
### Mode `docker` — Transformer local, Firefly in Docker
The transformer runs locally or in its own container; the Firefly III Data Importer runs in a Docker container. The transformer calls the importer via `docker exec`.
**Important:** The transformer's output directory must be mounted as a volume in the importer container. `jsonConfig` is the path **inside the container** (not a local path). Do not use the `-it` flag (no TTY).
Example `docker-compose.yml` for the importer:
```yaml
services:
firefly-importer:
image: fireflyiii/data-importer:latest
volumes:
- /opt/ff-imp-preprocessor/import:/import
environment:
- FIREFLY_III_URL=https://your-firefly.com
- FIREFLY_III_ACCESS_TOKEN=your-token-here
- CAN_POST_FILES=false
```
```json
"fireflyImport": {
"mode": "docker",
"jsonConfig": "/import/configs/ubs-import.json",
"importerCommand": "docker exec firefly-importer php artisan importer:import",
"autoImport": true,
"deleteAfterImport": false,
"timeout": 300
}
```
The JSON config file must be available inside the container — either via a volume mount or `docker cp`:
```bash
docker cp ubs-import.json firefly-importer:/import/configs/ubs-import.json
```
---
### Mode `http` — Transformer local, Firefly importer on a remote server
The transformer runs locally; the Firefly III Data Importer is reachable over HTTP(S). The CSV and JSON configuration are sent as a multipart HTTP upload to the importer.
**Requirements on the importer server:**
```text
CAN_POST_FILES=true
AUTO_IMPORT_SECRET=<secret> # at least 16 characters
```
**Local requirement:** PHP extension `ext-curl`
```json
"fireflyImport": {
"mode": "http",
"importerUrl": "https://importer.your-server.com",
"importerSecret": "your-auto-import-secret-min-16-chars",
"jsonConfig": "/local/path/to/ubs-import.json",
"autoImport": true,
"deleteAfterImport": false,
"timeout": 300
}
```
The transformer sends files to `POST {importerUrl}/autoupload`. The JSON config lives locally — the transformer uploads it together with the CSV. No volume mount or SSH access to the remote server is required.
---
### Usage
```bash
# Transformation + automatic import (when autoImport=true)
php bin/transformer.php transform input.csv config/config.json
# Watch mode: trigger on new CSV in source directory
php bin/transformer.php auto-import config/config.json --watch
```
---
## Architecture
### Components
```text
bin/transformer.php (CLI entry point)
TransformerEngine (orchestration)
├─ ConfigurationLoader (load / validate config)
├─ CsvReader (read CSV)
├─ MetadataExtractor (metadata via regex)
├─ ColumnTransformer (apply transformations)
├─ CsvWriter (write CSV)
├─ FireflyImporter (Firefly III integration)
└─ DebugLogger (debug logs)
```
### Data flow
```text
Input CSV
CsvReader::readMetadataLines() → array of lines
MetadataExtractor::extract() → {iban: "...", currency: "..."}
CsvReader::readCsvData() → array of rows
ColumnTransformer::transformRow() → transformed row (pipeline)
CsvWriter::write() → output CSV
```
### Classes
| Class | Responsibility |
| ----- | -------------- |
| `TransformerEngine` | Orchestrates the entire workflow |
| `ConfigurationLoader` | Loads and validates JSON configuration |
| `CsvReader` | Reads CSV with metadata support |
| `MetadataExtractor` | Extracts metadata via regex |
| `ColumnTransformer` | Transforms columns (pipeline) |
| `CsvWriter` | Writes output CSV |
| `FireflyImporter` | Imports into Firefly III |
| `DebugLogger` | Static logger for debug output |
---
## Error Handling
### Common errors
#### "Input file not found"
```bash
# Check the file path
ls -la input.csv
# Use an absolute path if relative paths do not work
php bin/transformer.php transform /absolute/path/input.csv config.json
```
---
#### "Missing metadata: account_iban"
The IBAN could not be extracted — wrong regex or wrong line number.
```bash
# Inspect the first lines of the CSV
head -5 input.csv
# Validate with debug output
php bin/transformer.php validate config.json input.csv --debug
```
---
#### "Invalid JSON: …"
Syntax error in `config.json`.
```bash
php -r "json_decode(file_get_contents('config/config.json'), true) or die('JSON invalid');"
```
---
#### "Configuration: 'csvStructure.headerLine' required"
A required configuration field is missing.
```bash
diff config/config.json config/config.example.json
```
---
### Exception handling
```php
try {
$result = $engine->transform($inputFile);
if (!$result['success']) {
echo "Error: " . $result['error'];
}
} catch (Exception $e) {
echo "Fatal error: " . $e->getMessage();
}
```
---
## Tips
### UTF-8 handling
The transformer uses UTF-8 safe functions throughout:
- `mb_strtolower()` instead of `strtolower()`
- `mb_strtoupper()` instead of `strtoupper()`
- `mb_strlen()` for correct character counting
Supported encodings: UTF-8, ISO-8859-1, CP1252.
### Regex tips
**Pattern without delimiters (auto-wrapped):**
```json
"pattern": "IBAN:\\s*([A-Z0-9 ]+)"
// becomes: /IBAN:\s*([A-Z0-9 ]+)/u
```
**With explicit flags:**
```json
"pattern": "/IBAN:\\s*([A-Z0-9 ]+)/iu"
// case-insensitive
```
### Performance
- **Optimised for:** up to 1 million rows
- **Typical file size:** 10100 k rows
### Batch processing
```bash
#!/bin/bash
for file in import/source/*.csv; do
php bin/transformer.php transform "$file" config/config.json
if [ $? -eq 0 ]; then
mv "$file" import/archive/
else
mv "$file" import/error/
fi
done
```
---
## Version History
**v1.0.0 (03 May 2026)**
- ✅ Initial release
- ✅ 13 transformation types
- ✅ Metadata extraction via regex
- ✅ Debug mode
- ✅ Firefly III integration (cli / docker / http)
- ✅ Full documentation
---
**License:** GPL-3.0
**Author:** PHP CSV Transformer Project
**Repository:** [git.andare.ch/david.reindl/ff-imp-preprocessor](https://git.andare.ch/david.reindl/ff-imp-preprocessor)