Data Refresh
There may be cases when data in the consumer database needs to be refreshed. This guide covers the process and considerations for performing a data refresh.
When to Perform a Data Refresh
Common scenarios requiring a data refresh:
- Data corruption - Issues with the consumer database
- Schema changes - Significant upstream schema modifications
- Testing - Setting up a fresh test environment
- Compliance - Data retention policy requirements
- Upgrades - Major consumer version upgrades
Data Refresh Process
Step 1: Stop the Consumer
Ensure the consumer is stopped before making changes.
# Docker
docker compose down
# Kubernetes
kubectl scale deployment solifi-consumer --replicas=0
# Standalone
# Stop the Java process
Step 2: Clean the Database
You have two options:
Option A: Delete the entire database (Full refresh)
-- Create a new database
DROP DATABASE solifi_data;
CREATE DATABASE solifi_data;
Option B: Delete specific tables (Partial refresh)
-- Delete data tables
DROP TABLE IF EXISTS table_name;
DROP TABLE IF EXISTS table_name_audit;
-- Or truncate (faster but keeps structure)
TRUNCATE TABLE table_name;
TRUNCATE TABLE table_name_audit;
Step 3: Change the Consumer Group ID
Update the group-id in application.yml to force the consumer to read from the beginning:
spring:
kafka:
consumer:
group-id: mycompany-consumer-refresh-20241221 # New unique ID
Kafka tracks which messages have been read per consumer group. Using a new group ID forces the consumer to start reading from the earliest available offset.
Step 4: Restart the Consumer
# Docker
docker compose up -d
# Kubernetes
kubectl scale deployment solifi-consumer --replicas=1
# Standalone
java -jar solifi-consumer-<version>.jar
Preserving Audit History
If you want to preserve audit history during a data refresh, do not delete the _audit tables.
Solifi compresses data after 7 days, so you may lose audit history of older data during a refresh. If you want to keep historical audit records:
- Keep the existing
_audittables intact - Only drop/truncate the main data tables
- Change the
group-id - Restart the consumer
The consumer will append new audit events to the existing audit tables. You might see some duplicate records, but the value in column lp_db_insert_user will be different (reflecting the new group-id).
Example: Preserving Audit Tables
-- Only drop main tables, keep audit tables
DROP TABLE IF EXISTS cs_master_nf;
DROP TABLE IF EXISTS ls_master_nf;
-- DO NOT drop: cs_master_nf_audit, ls_master_nf_audit
Using Initial Load Mode for Large Refreshes
For large data volumes (>5 million messages), consider using Initial Load Mode instead of standard streaming:
- Clean the database completely
- Configure
initial-loadmode - Run the dry run
- Execute the initial load
- Switch back to streaming mode
This is significantly faster than a standard data refresh.
Post-Refresh Verification
After the refresh completes:
1. Check Table Counts
SELECT
'cs_master_nf' as table_name,
COUNT(*) as row_count
FROM cs_master_nf
UNION ALL
SELECT
'ls_master_nf' as table_name,
COUNT(*) as row_count
FROM ls_master_nf;
2. Verify Consumer Health
curl http://localhost:8080/actuator/health
3. Check Consumer Lag
Monitor that consumer lag is decreasing and eventually reaches zero.
4. Review Error Log
SELECT COUNT(*) FROM error_log
WHERE created_at > DATEADD(hour, -1, GETDATE());
Partial Topic Refresh
To refresh only specific topics without affecting others:
- Stop the consumer
- Drop only the tables for the target topics
- Update
application.ymlto consume only those topics temporarily - Use a new
group-idspecific to this refresh - Run the consumer until caught up
- Restore original configuration
Refresh Checklist
- Stop all consumer instances
- Backup database (if needed)
- Document current
group-id - Drop/truncate required tables
- Decide on audit table handling
- Update
group-idin configuration - Start consumer
- Monitor progress
- Verify data completeness
- Check error log
Next Steps
- Learn about initial load mode for faster refreshes
- Configure health monitoring
- Review the sizing guide