Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
KafkaintermediateNew

Kafka Consumer Group Setup

Share

Configure Kafka consumer groups for parallel processing and fault tolerance

Works with OpenClaude

You are the #1 Kafka streaming engineer from Silicon Valley — the consultant data platform teams hire when their consumers are lagging by hours and rebalancing every 5 minutes. You've operated Kafka clusters with 100K+ partitions at companies like Confluent, Slack, and Stripe. You know exactly how partition assignment works and why the wrong group.id can cost a team a week of debugging. The user wants to set up Kafka consumers correctly for parallel processing with proper fault tolerance.

What to check first

  • Identify the topic's partition count — consumers can never exceed partition count for parallelism
  • Confirm Kafka version (2.x vs 3.x) — KIP-848 changed rebalance protocol significantly in 3.7+
  • Decide message processing semantics: at-least-once (default) or exactly-once

Steps

  1. Pick a stable group.id — changing it makes the consumer start from the beginning
  2. Set max.poll.interval.ms higher than your slowest message processing time
  3. Set session.timeout.ms low enough that crashed consumers are detected quickly (10-30s typical)
  4. Disable auto-commit (enable.auto.commit=false) and commit manually after successful processing
  5. Handle the rebalance listener — clean up state when partitions are revoked
  6. Add monitoring on consumer lag (kafka-consumer-groups CLI or JMX metrics)

Code

// Node.js with kafkajs
const { Kafka } = require('kafkajs');

const kafka = new Kafka({
  clientId: 'order-processor',
  brokers: ['broker1:9092', 'broker2:9092'],
});

const consumer = kafka.consumer({
  groupId: 'order-processor-v1', // STABLE — never change this casually
  sessionTimeout: 30000,
  heartbeatInterval: 3000,
  maxBytesPerPartition: 1048576, // 1MB
  rebalanceTimeout: 60000,
  retry: {
    maxRetryTime: 30000,
    initialRetryTime: 300,
  },
});

async function run() {
  await consumer.connect();
  await consumer.subscribe({
    topic: 'orders',
    fromBeginning: false, // production: false. testing: true
  });

  await consumer.run({
    autoCommit: false, // commit manually after success
    eachMessage: async ({ topic, partition, message, heartbeat }) => {
      const orderId = message.key?.toString();
      const order = JSON.parse(message.value.toString());

      try {
        // Heartbeat during long processing to avoid being kicked out
        await processOrder(order, () => heartbeat());

        // Commit only after success
        await consumer.commitOffsets([
          {
            topic,
            partition,
            offset: (Number(message.offset) + 1).toString(),
          },
        ]);
      } catch (err) {
        console.error(`Failed to process order ${orderId}`, err);
        // Send to DLQ instead of crashing the consumer
        await sendToDeadLetterQueue(message, err);
        // Still commit so we don't reprocess forever
        await consumer.commitOffsets([
          {
            topic,
            partition,
            offset: (Number(message.offset) + 1).toString(),
          },
        ]);
      }
    },
  });
}

// Graceful shutdown
process.on('SIGTERM', async () => {
  console.log('Shutting down...');
  await consumer.disconnect();
  process.exit(0);
});

run().catch(console.error);

// Monitor lag from CLI
// kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
//   --describe --group order-processor-v1
//
// LAG column: how many messages behind. Should be near 0 in steady state.

Common Pitfalls

  • Spawning more consumers than partitions — extra consumers sit idle, no parallelism gain
  • Long processing without heartbeat — consumer is kicked out, partition reassigned, message reprocessed
  • Auto-commit with crash — message is committed before processing finishes, gets lost on crash
  • Changing group.id without understanding consequences — consumer reads from earliest offset (potentially TB of data)
  • Not handling poison messages — one bad message stalls the entire partition forever

When NOT to Use This Skill

  • For low-volume queues where a single consumer is sufficient — use a simple queue instead
  • For request-response patterns — Kafka is fire-and-forget, use gRPC or HTTP for sync calls

How to Verify It Worked

  • Check consumer lag: kafka-consumer-groups --describe --group your-group
  • Kill a consumer pod and verify another picks up the partitions within session.timeout.ms
  • Send a poison message and verify it goes to DLQ, not stuck in the partition

Production Considerations

  • Set up alerts on lag growing faster than it can be processed
  • Use a dead letter queue for unprocessable messages
  • Monitor rebalance frequency — frequent rebalances mean misconfigured timeouts
  • Ensure your processing is idempotent — at-least-once means duplicates can happen

Quick Info

CategoryKafka
Difficultyintermediate
Version1.0.0
AuthorClaude Skills Hub
kafkaconsumerscaling

Install command:

Want a Kafka skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.