Initial Load Mode

Starting with release 2.2.0, the consumer can operate in a special mode called initial_load. This mode is optimized for high-volume data ingestion.

Consumer Modes

The consumer supports two modes:

Mode	Description
`streaming`	Default mode - standard behavior consistent with previous versions
`initial_load`	Optimized for high-volume data ingestion

When to Use Initial Load Mode

The initial_load mode is recommended for:

Customers handling large datasets from Solifi upstream systems (typically more than 5 million messages)
Faster initial data ingestion
Bootstrapping a new database

Important

The consumer cannot run permanently in initial_load mode. Once data has been loaded, you must switch back to streaming mode.

How Initial Load Mode Works

Streaming Mode (Default)

In streaming mode, the consumer follows an update-if-present principle:

Inserts a new record if it doesn't exist
Updates it if it does exist

While this ensures data consistency, it can significantly slow down message processing with very large datasets.

Initial Load Mode

In initial_load mode, the consumer operates on an insert-only principle:

Processes one topic partition at a time
Bypasses update locks
Significantly faster ingestion
Consumer stops automatically when complete

Performance Comparison

Scenario	Streaming Mode	Initial Load Mode
140 topics, 100M messages	10-12 days (single consumer)	~4 hours (8 consumers)
Database locking	Update locks	No locking
Memory usage	Moderate	Higher during load

Running Initial Load Mode

The process involves four steps:

Perform a Dry Run
Execute the Initial Load
Monitor Load Progress
Switch Back to Streaming Mode

Step 1: Perform a Dry Run

Before performing the actual data load, start with a clean backend database.

The dry run phase identifies all topics and partitions and records offset information without consuming any messages.

solifi:
  initial-load:  
    enabled: true       # Enables initial_load mode
    dryrun: true        # Performs discovery only
    batch-size: 10000   # Optional. Messages per batch (default: 100000)
    save-full-audit: false  # Optional. Save all audit records (default: false)
    clientId: load-app-1    # Unique identifier for this consumer instance

The process completes within a few seconds and creates a database table named lp_initial_load.

lp_initial_load Table Structure

Column	Description
`topic_name`	Name of the Kafka topic
`partition_id`	Partition number of the topic
`start_offset`	Offset from which the consumer starts reading
`end_offset`	Offset up to (but not including) which the consumer reads
`status`	Current processing status
`total_entries`	Total number of unique messages identified
`load_duration_ms`	Time taken to process the partition

Status Lifecycle

State	Description	Next States
`INITIAL`	Default state	`ALLOCATED`
`ALLOCATED`	Reserved for processing	`LOADED`, `FAILED`
`LOADED`	Data read up to `end_offset - 1`	`SAVED`, `FAILED`
`SAVED`	Records written to database	`COMPLETED`, `FAILED`
`COMPLETED`	Partition fully processed	— (Terminal)
`FAILED`	Processing error occurred	— (Manual reset required)

Step 2: Execute the Initial Load

After the dry run, modify your configuration:

solifi:
  initial-load:  
    enabled: true
    dryrun: false       # Enable actual loading
    batch-size: 20000
    clientId: load-app-1

Each clientId instance processes one topic at a time and handles all of its partitions sequentially.

Running Multiple Instances

To speed up processing, run multiple consumer instances in parallel:

Assign different clientId values to each instance
Use the same Kafka consumer group ID

spring:
  kafka:
    consumer:
      group-id: mycompany-initial-load  # Same for all instances

Parallel Processing

When multiple instances start with unique clientId values, each acquires a lock on one topic. After processing all partitions, it looks for another topic with status = INITIAL. If none remain, the instance shuts down.

Step 3: Monitor Load Progress

Monitor progress using SQL queries against the lp_initial_load table.

View Current Progress

SELECT * 
FROM lp_initial_load
WHERE status NOT IN ('INITIAL')
ORDER BY last_updated_ts DESC;

View Load Duration (Melbourne Time)

SELECT 
  SWITCHOFFSET(MIN(load_started_ts), '+10:00') AS Start,
  SWITCHOFFSET(MAX(last_updated_ts), '+10:00') AS Last,
  DATEDIFF(MINUTE, MIN(load_started_ts), MAX(last_updated_ts)) AS Minutes,
  CONVERT(VARCHAR(5), DATEADD(MINUTE, DATEDIFF(MINUTE, MIN(load_started_ts), MAX(last_updated_ts)), 0), 114) AS Duration
FROM lp_initial_load;

View Progress by Consumer Instance

SELECT status, status_info, COUNT(*) AS Count
FROM lp_initial_load
GROUP BY status, status_info
ORDER BY status, status_info;

Step 4: Switch Back to Streaming Mode

Once all partitions are processed, verify that every record has COMPLETED status.

Remove or disable the initial-load section:

solifi:
  # initial-load:
  #   enabled: false
  #   dryrun: false
  #   batch-size: 20000
  #   clientId: load-app-1

The consumer will resume from the end_offset values recorded in lp_initial_load table.

Backlog Processing

If upstream systems continue producing messages during the initial load, a small backlog may accumulate. The consumer automatically resumes from the latest offsets upon switching back to streaming mode.

The save-full-audit Property

When save-full-audit is set to true, the consumer stores all records for a given key in the audit table. This differs from the default behavior.

Default Behavior (save-full-audit: false)

Only the latest record for each key is saved in the audit table.

With save-full-audit: true

Every record associated with a key is persisted.

Example

Given a topic names with id as the key:

id	name
1	Sam
2	Fred
3	Brett
2	Rom
2	Rex

Default behavior (save-full-audit: false):

names_audit table:

id	name	lp_db_action
1	Sam	INITIAL
2	Rex	INITIAL
3	Brett	INITIAL

With save-full-audit: true:

names_audit table:

id	name	lp_db_action
1	Sam	INITIAL
2	Fred	INITIAL
3	Brett	INITIAL
2	Rom	INITIAL
2	Rex	INITIAL

Performance Impact

Enabling save-full-audit increases overall execution time.

Recovering from Failed Loads

Warning

Do not attempt recovery while the consumer is running in streaming mode. This may result in inconsistent data.

Failures typically occur due to:

Insufficient resources (especially memory)
Database downtime
Network interruptions

Option 1: Full Restart (Recommended for Many Failures)

Start with a fresh database (no existing data or audit tables)
Repeat from Step 1: Perform a Dry Run

Option 2: Partial Recovery (Recommended for Few Failures)

Identify failed topics and partitions using lp_initial_load
Drop their corresponding data and audit tables
Reset their statuses to INITIAL

Example with 2 failed topics:

topic_name	partition	status
topic_a	0	FAILED
topic_a	1	FAILED
topic_b	0	ALLOCATED
topic_b	1	SAVED

Drop the tables and reset:

-- Drop tables first, then:
UPDATE lp_initial_load 
SET status = 'INITIAL' 
WHERE topic_name IN ('topic_a', 'topic_b');

Restart the consumer in initial_load mode.

Infrastructure Recommendations

Performance depends on:

Volume of data per topic and partition
Number of parallel consumer instances
Database capacity (CPU, memory, storage)

Sample Workload Results

Parameter	Value
Total topics	144
Average partitions per topic	6
Largest partition	1 million messages
Total messages	46 million

Workload	Duration
Total messages (46M)	2 hours
With Audit (92M)	3 hours 47 mins
With Full Audit (95M)	4 hours 17 mins

Configuration Used

Consumer instances: 8 (each with 4 GB memory and 4 CPU)
Database instance: 8 CPU, 16 GB memory
Network: Local, no firewalls or packet inspection

Post-Load Resources

After switching to streaming mode, resource requirements drop substantially:

Typical configuration: 2 CPU / 2 GB memory per consumer instance

Next Steps

Learn about auditing
Configure health monitoring
Understand data refresh

Consumer Modes​

When to Use Initial Load Mode​

How Initial Load Mode Works​

Streaming Mode (Default)​

Initial Load Mode​

Performance Comparison​

Running Initial Load Mode​

Step 1: Perform a Dry Run​

lp_initial_load Table Structure​

Status Lifecycle​

Step 2: Execute the Initial Load​

Running Multiple Instances​

Step 3: Monitor Load Progress​

View Current Progress​

View Load Duration (Melbourne Time)​

View Progress by Consumer Instance​

Step 4: Switch Back to Streaming Mode​

The save-full-audit Property​

Default Behavior (save-full-audit: false)​

With save-full-audit: true​

Example​

Recovering from Failed Loads​

Option 1: Full Restart (Recommended for Many Failures)​

Option 2: Partial Recovery (Recommended for Few Failures)​

Infrastructure Recommendations​

Sample Workload Results​

Configuration Used​

Post-Load Resources​

Next Steps​

Consumer Modes

When to Use Initial Load Mode

How Initial Load Mode Works

Streaming Mode (Default)

Initial Load Mode

Performance Comparison

Running Initial Load Mode

Step 1: Perform a Dry Run

lp_initial_load Table Structure

Status Lifecycle

Step 2: Execute the Initial Load

Running Multiple Instances

Step 3: Monitor Load Progress

View Current Progress

View Load Duration (Melbourne Time)

View Progress by Consumer Instance

Step 4: Switch Back to Streaming Mode

The save-full-audit Property

Default Behavior (save-full-audit: false)

With save-full-audit: true

Example

Recovering from Failed Loads

Option 1: Full Restart (Recommended for Many Failures)

Option 2: Partial Recovery (Recommended for Few Failures)

Infrastructure Recommendations

Sample Workload Results

Configuration Used

Post-Load Resources

Next Steps