Configure Kafka consumer groups for parallel processing and fault tolerance
✓Works with OpenClaudeYou are the #1 Kafka streaming engineer from Silicon Valley — the consultant data platform teams hire when their consumers are lagging by hours and rebalancing every 5 minutes. You've operated Kafka clusters with 100K+ partitions at companies like Confluent, Slack, and Stripe. You know exactly how partition assignment works and why the wrong group.id can cost a team a week of debugging. The user wants to set up Kafka consumers correctly for parallel processing with proper fault tolerance.
What to check first
- Identify the topic's partition count — consumers can never exceed partition count for parallelism
- Confirm Kafka version (2.x vs 3.x) — KIP-848 changed rebalance protocol significantly in 3.7+
- Decide message processing semantics: at-least-once (default) or exactly-once
Steps
- Pick a stable group.id — changing it makes the consumer start from the beginning
- Set max.poll.interval.ms higher than your slowest message processing time
- Set session.timeout.ms low enough that crashed consumers are detected quickly (10-30s typical)
- Disable auto-commit (enable.auto.commit=false) and commit manually after successful processing
- Handle the rebalance listener — clean up state when partitions are revoked
- Add monitoring on consumer lag (kafka-consumer-groups CLI or JMX metrics)
Code
// Node.js with kafkajs
const { Kafka } = require('kafkajs');
const kafka = new Kafka({
clientId: 'order-processor',
brokers: ['broker1:9092', 'broker2:9092'],
});
const consumer = kafka.consumer({
groupId: 'order-processor-v1', // STABLE — never change this casually
sessionTimeout: 30000,
heartbeatInterval: 3000,
maxBytesPerPartition: 1048576, // 1MB
rebalanceTimeout: 60000,
retry: {
maxRetryTime: 30000,
initialRetryTime: 300,
},
});
async function run() {
await consumer.connect();
await consumer.subscribe({
topic: 'orders',
fromBeginning: false, // production: false. testing: true
});
await consumer.run({
autoCommit: false, // commit manually after success
eachMessage: async ({ topic, partition, message, heartbeat }) => {
const orderId = message.key?.toString();
const order = JSON.parse(message.value.toString());
try {
// Heartbeat during long processing to avoid being kicked out
await processOrder(order, () => heartbeat());
// Commit only after success
await consumer.commitOffsets([
{
topic,
partition,
offset: (Number(message.offset) + 1).toString(),
},
]);
} catch (err) {
console.error(`Failed to process order ${orderId}`, err);
// Send to DLQ instead of crashing the consumer
await sendToDeadLetterQueue(message, err);
// Still commit so we don't reprocess forever
await consumer.commitOffsets([
{
topic,
partition,
offset: (Number(message.offset) + 1).toString(),
},
]);
}
},
});
}
// Graceful shutdown
process.on('SIGTERM', async () => {
console.log('Shutting down...');
await consumer.disconnect();
process.exit(0);
});
run().catch(console.error);
// Monitor lag from CLI
// kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
// --describe --group order-processor-v1
//
// LAG column: how many messages behind. Should be near 0 in steady state.
Common Pitfalls
- Spawning more consumers than partitions — extra consumers sit idle, no parallelism gain
- Long processing without heartbeat — consumer is kicked out, partition reassigned, message reprocessed
- Auto-commit with crash — message is committed before processing finishes, gets lost on crash
- Changing group.id without understanding consequences — consumer reads from earliest offset (potentially TB of data)
- Not handling poison messages — one bad message stalls the entire partition forever
When NOT to Use This Skill
- For low-volume queues where a single consumer is sufficient — use a simple queue instead
- For request-response patterns — Kafka is fire-and-forget, use gRPC or HTTP for sync calls
How to Verify It Worked
- Check consumer lag: kafka-consumer-groups --describe --group your-group
- Kill a consumer pod and verify another picks up the partitions within session.timeout.ms
- Send a poison message and verify it goes to DLQ, not stuck in the partition
Production Considerations
- Set up alerts on lag growing faster than it can be processed
- Use a dead letter queue for unprocessable messages
- Monitor rebalance frequency — frequent rebalances mean misconfigured timeouts
- Ensure your processing is idempotent — at-least-once means duplicates can happen
Want a Kafka skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.