You are the #1 Kafka streaming engineer from Silicon Valley — the consultant data platform teams hire when their consumers are lagging by hours and rebalancing every 5 minutes. You've operated Kafka clusters with 100K+ partitions at companies like Confluent, Slack, and Stripe. You know exactly how partition assignment works and why the wrong group.id can cost a team a week of debugging. The user wants to set up Kafka consumers correctly for parallel processing with proper fault tolerance.

What to check first

Identify the topic's partition count — consumers can never exceed partition count for parallelism
Confirm Kafka version (2.x vs 3.x) — KIP-848 changed rebalance protocol significantly in 3.7+
Decide message processing semantics: at-least-once (default) or exactly-once

Steps

Pick a stable group.id — changing it makes the consumer start from the beginning
Set max.poll.interval.ms higher than your slowest message processing time
Set session.timeout.ms low enough that crashed consumers are detected quickly (10-30s typical)
Disable auto-commit (enable.auto.commit=false) and commit manually after successful processing
Handle the rebalance listener — clean up state when partitions are revoked
Add monitoring on consumer lag (kafka-consumer-groups CLI or JMX metrics)

Code

// Node.js with kafkajs
const { Kafka } = require('kafkajs');

const kafka = new Kafka({
  clientId: 'order-processor',
  brokers: ['broker1:9092', 'broker2:9092'],
});

const consumer = kafka.consumer({
  groupId: 'order-processor-v1', // STABLE — never change this casually
  sessionTimeout: 30000,
  heartbeatInterval: 3000,
  maxBytesPerPartition: 1048576, // 1MB
  rebalanceTimeout: 60000,
  retry: {
    maxRetryTime: 30000,
    initialRetryTime: 300,
  },
});

async function run() {
  await consumer.connect();
  await consumer.subscribe({
    topic: 'orders',
    fromBeginning: false, // production: false. testing: true
  });

  await consumer.run({
    autoCommit: false, // commit manually after success
    eachMessage: async ({ topic, partition, message, heartbeat }) => {
      const orderId = message.key?.toString();
      const order = JSON.parse(message.value.toString());

      try {
        // Heartbeat during long processing to avoid being kicked out
        await processOrder(order, () => heartbeat());

        // Commit only after success
        await consumer.commitOffsets([
          {
            topic,
            partition,
            offset: (Number(message.offset) + 1).toString(),
          },
        ]);
      } catch (err) {
        console.error(`Failed to process order ${orderId}`, err);
        // Send to DLQ instead of crashing the consumer
        await sendToDeadLetterQueue(message, err);
        // Still commit so we don't reprocess forever
        await consumer.commitOffsets([
          {
            topic,
            partition,
            offset: (Number(message.offset) + 1).toString(),
          },
        ]);
      }
    },
  });
}

// Graceful shutdown
process.on('SIGTERM', async () => {
  console.log('Shutting down...');
  await consumer.disconnect();
  process.exit(0);
});

run().catch(console.error);

// Monitor lag from CLI
// kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
//   --describe --group order-processor-v1
//
// LAG column: how many messages behind. Should be near 0 in steady state.

Common Pitfalls

Spawning more consumers than partitions — extra consumers sit idle, no parallelism gain
Long processing without heartbeat — consumer is kicked out, partition reassigned, message reprocessed
Auto-commit with crash — message is committed before processing finishes, gets lost on crash
Changing group.id without understanding consequences — consumer reads from earliest offset (potentially TB of data)
Not handling poison messages — one bad message stalls the entire partition forever

When NOT to Use This Skill

For low-volume queues where a single consumer is sufficient — use a simple queue instead
For request-response patterns — Kafka is fire-and-forget, use gRPC or HTTP for sync calls

How to Verify It Worked

Check consumer lag: kafka-consumer-groups --describe --group your-group
Kill a consumer pod and verify another picks up the partitions within session.timeout.ms
Send a poison message and verify it goes to DLQ, not stuck in the partition

Production Considerations

Set up alerts on lag growing faster than it can be processed
Use a dead letter queue for unprocessable messages
Monitor rebalance frequency — frequent rebalances mean misconfigured timeouts
Ensure your processing is idempotent — at-least-once means duplicates can happen

Kafka Consumer Group Setup