Initial Load Mode
Starting with release 2.2.0, the consumer can operate in a special mode called initial_load. This mode is optimized for high-volume data ingestion.
Consumer Modes
The consumer supports two modes:
| Mode | Description |
|---|---|
streaming | Default mode - standard behavior consistent with previous versions |
initial_load | Optimized for high-volume data ingestion |
When to Use Initial Load Mode
The initial_load mode is recommended for:
- Customers handling large datasets from Solifi upstream systems (typically more than 5 million messages)
- Faster initial data ingestion
- Bootstrapping a new database
The consumer cannot run permanently in initial_load mode. Once data has been loaded, you must switch back to streaming mode.
How Initial Load Mode Works
Streaming Mode (Default)
In streaming mode, the consumer follows an update-if-present principle:
- Inserts a new record if it doesn't exist
- Updates it if it does exist
While this ensures data consistency, it can significantly slow down message processing with very large datasets.
Initial Load Mode
In initial_load mode, the consumer operates on an insert-only principle:
- Processes one topic partition at a time
- Bypasses update locks
- Significantly faster ingestion
- Consumer stops automatically when complete
Performance Comparison
| Scenario | Streaming Mode | Initial Load Mode |
|---|---|---|
| 140 topics, 100M messages | 10-12 days (single consumer) | ~4 hours (8 consumers) |
| Database locking | Update locks | No locking |
| Memory usage | Moderate | Higher during load |
Running Initial Load Mode
The process involves four steps:
- Perform a Dry Run
- Execute the Initial Load
- Monitor Load Progress
- Switch Back to Streaming Mode
Step 1: Perform a Dry Run
Before performing the actual data load, start with a clean backend database.
The dry run phase identifies all topics and partitions and records offset information without consuming any messages.
solifi:
initial-load:
enabled: true # Enables initial_load mode
dryrun: true # Performs discovery only
batch-size: 10000 # Optional. Messages per batch (default: 100000)
save-full-audit: false # Optional. Save all audit records (default: false)
clientId: load-app-1 # Unique identifier for this consumer instance
The process completes within a few seconds and creates a database table named lp_initial_load.
lp_initial_load Table Structure
| Column | Description |
|---|---|
topic_name | Name of the Kafka topic |
partition_id | Partition number of the topic |
start_offset | Offset from which the consumer starts reading |
end_offset | Offset up to (but not including) which the consumer reads |
status | Current processing status |
total_entries | Total number of unique messages identified |
load_duration_ms | Time taken to process the partition |
Status Lifecycle
| State | Description | Next States |
|---|---|---|
INITIAL | Default state | ALLOCATED |
ALLOCATED | Reserved for processing | LOADED, FAILED |
LOADED | Data read up to end_offset - 1 | SAVED, FAILED |
SAVED | Records written to database | COMPLETED, FAILED |
COMPLETED | Partition fully processed | — (Terminal) |
FAILED | Processing error occurred | — (Manual reset required) |
Step 2: Execute the Initial Load
After the dry run, modify your configuration:
solifi:
initial-load:
enabled: true
dryrun: false # Enable actual loading
batch-size: 20000
clientId: load-app-1
Each clientId instance processes one topic at a time and handles all of its partitions sequentially.
Running Multiple Instances
To speed up processing, run multiple consumer instances in parallel:
- Assign different
clientIdvalues to each instance - Use the same Kafka consumer group ID
spring:
kafka:
consumer:
group-id: mycompany-initial-load # Same for all instances
When multiple instances start with unique clientId values, each acquires a lock on one topic. After processing all partitions, it looks for another topic with status = INITIAL. If none remain, the instance shuts down.
Step 3: Monitor Load Progress
Monitor progress using SQL queries against the lp_initial_load table.
View Current Progress
SELECT *
FROM lp_initial_load
WHERE status NOT IN ('INITIAL')
ORDER BY last_updated_ts DESC;
View Load Duration (Melbourne Time)
SELECT
SWITCHOFFSET(MIN(load_started_ts), '+10:00') AS Start,
SWITCHOFFSET(MAX(last_updated_ts), '+10:00') AS Last,
DATEDIFF(MINUTE, MIN(load_started_ts), MAX(last_updated_ts)) AS Minutes,
CONVERT(VARCHAR(5), DATEADD(MINUTE, DATEDIFF(MINUTE, MIN(load_started_ts), MAX(last_updated_ts)), 0), 114) AS Duration
FROM lp_initial_load;
View Progress by Consumer Instance
SELECT status, status_info, COUNT(*) AS Count
FROM lp_initial_load
GROUP BY status, status_info
ORDER BY status, status_info;
Step 4: Switch Back to Streaming Mode
Once all partitions are processed, verify that every record has COMPLETED status.
Remove or disable the initial-load section:
solifi:
# initial-load:
# enabled: false
# dryrun: false
# batch-size: 20000
# clientId: load-app-1
The consumer will resume from the end_offset values recorded in lp_initial_load table.
If upstream systems continue producing messages during the initial load, a small backlog may accumulate. The consumer automatically resumes from the latest offsets upon switching back to streaming mode.
The save-full-audit Property
When save-full-audit is set to true, the consumer stores all records for a given key in the audit table. This differs from the default behavior.
Default Behavior (save-full-audit: false)
Only the latest record for each key is saved in the audit table.
With save-full-audit: true
Every record associated with a key is persisted.
Example
Given a topic names with id as the key:
| id | name |
|---|---|
| 1 | Sam |
| 2 | Fred |
| 3 | Brett |
| 2 | Rom |
| 2 | Rex |
Default behavior (save-full-audit: false):
names_audit table:
| id | name | lp_db_action |
|---|---|---|
| 1 | Sam | INITIAL |
| 2 | Rex | INITIAL |
| 3 | Brett | INITIAL |
With save-full-audit: true:
names_audit table:
| id | name | lp_db_action |
|---|---|---|
| 1 | Sam | INITIAL |
| 2 | Fred | INITIAL |
| 3 | Brett | INITIAL |
| 2 | Rom | INITIAL |
| 2 | Rex | INITIAL |
Enabling save-full-audit increases overall execution time.
Recovering from Failed Loads
Do not attempt recovery while the consumer is running in streaming mode. This may result in inconsistent data.
Failures typically occur due to:
- Insufficient resources (especially memory)
- Database downtime
- Network interruptions
Option 1: Full Restart (Recommended for Many Failures)
- Start with a fresh database (no existing data or audit tables)
- Repeat from Step 1: Perform a Dry Run
Option 2: Partial Recovery (Recommended for Few Failures)
- Identify failed topics and partitions using
lp_initial_load - Drop their corresponding data and audit tables
- Reset their statuses to
INITIAL
Example with 2 failed topics:
| topic_name | partition | status |
|---|---|---|
| topic_a | 0 | FAILED |
| topic_a | 1 | FAILED |
| topic_b | 0 | ALLOCATED |
| topic_b | 1 | SAVED |
Drop the tables and reset:
-- Drop tables first, then:
UPDATE lp_initial_load
SET status = 'INITIAL'
WHERE topic_name IN ('topic_a', 'topic_b');
Restart the consumer in initial_load mode.
Infrastructure Recommendations
Performance depends on:
- Volume of data per topic and partition
- Number of parallel consumer instances
- Database capacity (CPU, memory, storage)
Sample Workload Results
| Parameter | Value |
|---|---|
| Total topics | 144 |
| Average partitions per topic | 6 |
| Largest partition | 1 million messages |
| Total messages | 46 million |
| Workload | Duration |
|---|---|
| Total messages (46M) | 2 hours |
| With Audit (92M) | 3 hours 47 mins |
| With Full Audit (95M) | 4 hours 17 mins |
Configuration Used
- Consumer instances: 8 (each with 4 GB memory and 4 CPU)
- Database instance: 8 CPU, 16 GB memory
- Network: Local, no firewalls or packet inspection
Post-Load Resources
After switching to streaming mode, resource requirements drop substantially:
- Typical configuration: 2 CPU / 2 GB memory per consumer instance
Next Steps
- Learn about auditing
- Configure health monitoring
- Understand data refresh