Batch vs. Incremental Pipelines
Your deduplication pipeline can be implemented in either batch or incremental versions. These are not mutually exclusive options, and many organizations will end up building both.
The Medplum team recommends starting with a batch pipeline. If your source systems are short-lived or changing infrequently, a batch pipeline may be sufficient. Regardless, even if you end up building incremental pipelines, batch pipelines are typically easier to get started with as you iterate on your matching and merge rules.
Batch pipelines run as offline jobs that consider all records at once to produce sets of patient matches. Most implementations schedule these pipelines to run on a regularly scheduled interval. As this is an N2 problem, they are primarily constrained by memory rather than latency.
Typically, these pipelines compute matches in a data warehouse or a compute engine, but can also can also be computed in a Medplum Bot with sufficient memory. A typical workflow is:
- Export patient data from Medplum into the appropriate data warehouse or compute engine (e.g. Spark). Note that even large patient datasets should be able to fit into local memory (1M patients < 10GB), so distributed computation is not strictly required. See our analytics guide for more info.
- Use matching rules to detect matched pairs of records. Because this is an N2 operation, we recommend using some form of exact matching rules to reduce the cardinality of the problem, before applying "fuzzy matching."
- Use merging rules to combine matched pairs into sets and create the master record.
- Use the Medplum API to update the
Patient.linkelements for all records.
Each invocation of an incremental pipeline considers a single record and finds all matches. Typically, these pipelines are run per-source-record at the time of creation or update. As these pipelines are typically used to manage high-frequency, event-driven updates, latency is more of a concern than memory.
Incremental pipelines can often be implemented using Medplum Bots, and a typical workflow is:
- Set up a Bot to listen for updates to patient records.
- Use matching rules to detect matching records. Because incremental pipelines only consider matches for a single record, we are less memory constrained and can apply more "fuzzy matching" rules.
Use merging rules to update the master patient record.
Patient.linkelements for all relevant records.
The next section will discuss matching potential duplicate records.