✈️ 50,000-foot FlightPath Overview
The Architecture For Data File Feed Ingestion and Management
- Preboarding Within The Data Lifecycle
- Purpose-built components, state of the art integrations
- Goals
- Principles
Preboarding Within The Data Lifecycle
Preboarding is an integral part of the flow of data files from untrusted producers to data product end users. FlightPath Data and FlightPath Server are the purpose-built, drop-in implementation of effective data preboarding.

Purpose-built components, state of the art integrations
FlightPath Data, FlightPath Server, and the CsvPath Framework together form a complete data preboarding architecture that takes your data file feed ingestion operations to the next level. The solution is developer-friendly, opinionated, and flexible to keep your system design effort low. Not only do the three core components build on one anther, but they also support leading clouds, observability, data management, and notification tools.
Architecture Components
FlightPath augments CsvPath Framework to create a complete low-code, high-function preboarding system for ingesting data file feeds. The architecture has two layers: core components and enabling integrations.
CsvPath Framework
- Data and metadata management for file feeds
- CsvPath Validation Language for validation and upgrading
- CsvPath Reference language for querying staged data or accessing specific files
- Event-driven integration
FlightPath Data
- A project-based IDE for CsvPath Validation Language
- A CsvPath Framework production management console
- A project configuration and syncing tool
FlightPath Server
- A multi-project/multi-user runtime for FlightPath Data projects
- An arms-length integration target for managed file transfer systems
- A trusted publisher serving known-good data to downstream consumers

Enabling Integrations
Preboarding is just one stop on data’s journey from source to consumer. FlightPath and CsvPath Framework integrate with data file sources and destinations, observability platforms, and metadata-layer collaborators.
- Five supported MFT and data lake storage backends: AWS, Azure, GCP, SFTP, and file system
- Metadata capture to any mainstream relational database
- Arms-length integrations using webhooks and APIs
- Support for sending data processing events to OpenTelemetry and OpenLineage platforms

Goals
FlightPath and CsvPath are on a mission to increase quality and lower the cost and risk of managing data file feeds. Every aspect of their design is based on meeting specific goals.
Lower Manual Effort
- Move manual data checking into automated CsvPath Validation Language rules and schemas
- Move manual arrival and process monitoring into OpenTelemetry and OpenLineage events or webhook notifications
- Remove the architectural design effort required to create a data ingestion preboarding solution
- Reduce project setup effort to enable small, agile projects with lower risks and faster turnaround
Higher Data Quality
- Turn raw untrustworthy data into “ideal-form” raw data in a known-good or known-bad state
- Assess “ideal” and “known” using formal well-formedness checks, validity, canonicalization, and business rule acceptance criteria
- Bring semi-structured CSV and Excel data to the same level of conformance testing as relational, XML, and EDI data
- Shift quality control left to catch data-quality issues as close to the data producer as possible
Improved Data Access
- Put all inbound data file feeds into a consistent storage structure for easy access
- Provide a permanent record of data at intake and data published internally
- Provide a simple query language to find data based on file location and data lifecycle criteria
- Provide a process-locally-monitor-centrally capability allowing many small disconnected data partner projects to feed data, metadata, and monitoring events into coherent centralized systems
Data Explainability
- Identify data immediately with a durable identity that is traceable: a social security number, birth certificate, street address, and family tree
- Structure data changes consistently and provide indicators and documentation as to lifecycle stages
- Capture data evaluation outputs multi-dimensionally with discrete easily inspectable outputs, including matched and unmatched lines, rule-driven printouts, run logs, lineage and quality events, error state events, variables captured, and runtime metrics
- Make tracing changes through process steps simple and repeatable
Principles
FlightPath is built according to principles derived from observing what has worked in practice for companies that have robust data file feed handling. While every data file feed has unique aspects, knowing what works enables you to assess where to be flexible vs. where to hold the line for more correct approach.
Immutability
Immutable data means never changing a data set in place, but rather creating a copy of the data when it is modified. The benefit of immutable data is:
- The state of the data set is never in doubt
- Replaying a change or rewinding a process to try something different is easy
- Processing results are deterministic and explainable
There are, of course, costs. Primarily there is the cost of the data storage and the overhead of managing more data sets. CsvPath Framework substantially limits the latter by automating a consistent pre-built process of data staging and processing. The cost of storage is still an issue, but storage costs are low and falling. Moreover, data can be aged from hot to warm to cool storage automatically and relatively transparently in cloud data stores. In short, handled well, the payoff in quality and person-hour efficiency far outweighs the cost of immutability.
Idempotence
Idempotent data processing means that processing runs have the exact same output every time you run them on any given data input. Fast forensics requires that processing steps can be understood easily. When processing the same data doesn’t lead to the same result understanding why bad data happened becomes dramatically harder. Idempotent processing plus immutable data result in deterministic outcomes that are easy to assess, explain, and evolve.
Keep It Simple
Declarative processing
In general, there are two types of data processing tools, those that are procedural and those that are declarative. A procedural process changes data through a set of logical commands such as for every line in dataset increase line_count by 1. Declarative processes instead simply state an outcome and allow the processing environment to act accordingly behind the scenes. SQL, a declarative language, would write the same logic as select count(*) from dataset. The benefit of declarative processing is simplicity and a lower chance for error. FlightPath encourages declarative processing using CsvPath Validation Language and CsvPath Reference Language.
Linear focus
Data preboarding is the leading edge of data ingestion. Its purpose is the critical and limited goals of capture, identification, validation and upgrading, and metadata publishing. Anything else is out of scope. Unlike ETL/ELT or big data batch or stream processing, preboarding moves data in a linear way through a simple lifecycle that results in untrustable raw data turning into known-good raw data. Other processes may require joins, splits, calculations, mastering, summation, augmentation, restructuring, etc. But preboarding is exclusively about predictably and cheaply building data trust before any of that happens.
Small contexts
Small projects with limited scope and infrequent changes are lower risk. The more frequent the changes, the smaller the scope should be.
CsvPath Framework provides the consistency and tooling that allows you to run small projects in a distributed fashion efficiently and with centralized awareness and control. FlightPath gives you the ability to spin up a CsvPath Framework project and deploy it to production in minutes. FlightPath and the Framework namespace and version assets and data at every step so that projects can live complete disconnected or mix, match, and overlay each other without adding confusion and risk.
That separation plus consistency lets you scope projects down to a single data partner, or even individual data feeds per partner. You can support any number of backend storage systems, heterogeneous file types, or event-based integrations, while at the same time, capturing data to a single data lake, a single observability system, and/or using a single naming, metadata, and file location convention. Truly the best of both worlds.