Building an AIS Kafka Consumer in Python

Problem Framing

Operationalising a Kafka consumer for Automatic Identification System telemetry exposes a range of failure modes that do not appear in generic streaming tutorials: NMEA sentence fragmentation, MMSI collisions across multi-receiver feeds, sub-second duplicate position reports from overlapping coastal VHF stations, and GIL contention during bulk deserialization. This page is part of the Real-Time AIS Stream Ingestion Pipelines workflow; it addresses the consumer tier specifically — from broker connection and partition assignment through to validated PyArrow batch writes ready for downstream vessel route segmentation.

The diagram below shows where the Kafka consumer sits within the wider ingestion pipeline, from raw VHF/satellite feed through to object-storage Parquet.

Root Cause: Why Naïve Consumers Fail on AIS Feeds

Standard Kafka consumer examples assume clean JSON blobs and let the library auto-commit offsets. AIS feeds break both assumptions. Raw payloads are multi-sentence NMEA fragments — a single Type 5 voyage message spans two !AIVDM sentences that may arrive in different poll cycles. Auto-commit can advance the offset past fragment 1 before fragment 2 arrives, permanently losing the message. A minimal reproduction:

from confluent_kafka import Consumer

# WRONG — auto-commit drops mid-fragmented NMEA messages
conf = {
    "bootstrap.servers": "broker:9092",
    "group.id": "ais-dev",
    "enable.auto.commit": "true",   # offset advances regardless of parse success
    "auto.offset.reset": "latest",
}
c = Consumer(conf)
c.subscribe(["ais-raw"])

msg = c.poll(1.0)
# If msg is fragment 1 of a Type 5 sentence and auto-commit fires here,
# fragment 2 will still arrive but offset 0 is already committed.
# A consumer restart will skip the incomplete message silently.

The diagram below contrasts the two offset-commit orderings on a fragmented Type 5 message. Auto-commit advances the offset the instant a fragment is delivered, so a crash before reassembly silently skips the message on restart; the production loop defers the commit until the reassembled record is durably written to Parquet, so the same crash replays cleanly.

The second common failure is coordinate validation — decoded AIS lat/lon fields default to 0x1A838 (91°) when a vessel has not yet acquired a GPS fix. These sentinel values pass a basic range check and corrupt downstream spatial joins unless caught explicitly.

Step-by-Step Fix

Step 1 — Install dependencies and pin versions

pip install "confluent-kafka==2.3.0" "pyarrow==15.0.2" "numpy==1.26.4"

confluent-kafka wraps librdkafka and provides zero-copy deserialization and native C-level LZ4/Snappy support. Pure-Python alternatives (kafka-python, aiokafka) add 2–4× deserialization latency at 50 k msg/s throughput.

Step 2 — Consumer configuration

import logging
from confluent_kafka import Consumer, KafkaError, KafkaException

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s — %(message)s",
)
logger = logging.getLogger("ais.consumer")

def build_consumer(bootstrap_servers: str, group_id: str) -> Consumer:
    """
    Construct a confluent-kafka Consumer configured for AIS ingestion.

    Key decisions:
    - enable.auto.commit = false: offsets advance only after Parquet write succeeds
    - auto.offset.reset = latest: real-time mode; use 'earliest' for replay
    - fetch.min.bytes / fetch.max.wait.ms: batch network I/O without GIL saturation
    - session.timeout.ms / heartbeat.interval.ms: tolerate heavy serialization bursts
    """
    conf: dict[str, str] = {
        "bootstrap.servers": bootstrap_servers,
        "group.id": group_id,
        "enable.auto.commit": "false",
        "auto.offset.reset": "latest",
        "fetch.min.bytes": "65536",       # 64 KB minimum fetch
        "fetch.max.wait.ms": "500",       # wait up to 500 ms to fill fetch
        "session.timeout.ms": "30000",
        "heartbeat.interval.ms": "10000",
        "max.poll.interval.ms": "300000", # allow 5 min for batch writes
    }
    return Consumer(conf)

Partition the Kafka topic by mmsi_hash (e.g. mmsi % num_partitions) rather than by message type. This guarantees ordered delivery per vessel, which is mandatory for the NMEA sentence reassembly that reconstructs multi-fragment messages before the next stage.

Step 3 — Schema validation and typed PyArrow record batches

AIS bitfields decoded per ITU-R M.1371 must be cast to fixed-width types before any spatial operation. Using pandas.DataFrame as an intermediate incurs 6–12× memory overhead versus direct PyArrow allocation; pre-allocate NumPy arrays and pass them to pa.RecordBatch.from_arrays:

import pyarrow as pa
import numpy as np
from typing import Any

AIS_SCHEMA = pa.schema([
    pa.field("mmsi",        pa.uint32()),
    pa.field("timestamp",   pa.timestamp("us", tz="UTC")),
    pa.field("latitude",    pa.float64()),
    pa.field("longitude",   pa.float64()),
    pa.field("speed_knots", pa.float32()),
    pa.field("course_deg",  pa.float32()),
    pa.field("nav_status",  pa.uint8()),
    pa.field("rot",         pa.float32()),
])

# Sentinel: AIS lat/lon = 91° or 181° means "position not available"
_LAT_SENTINEL = 91.0
_LON_SENTINEL = 181.0

def _is_valid_position(lat: float, lon: float) -> bool:
    """Return False for sentinel values and out-of-range WGS84 coordinates."""
    if lat == _LAT_SENTINEL or lon == _LON_SENTINEL:
        return False
    return -90.0 <= lat <= 90.0 and -180.0 <= lon <= 180.0

def build_record_batch(messages: list[dict[str, Any]]) -> pa.RecordBatch:
    """
    Convert a list of validated AIS message dicts to a typed PyArrow RecordBatch.

    Raises ValueError for any record with out-of-bounds or sentinel coordinates
    so the caller can route the offending batch to a Dead Letter Queue.
    """
    n = len(messages)
    mmsi_arr  = np.empty(n, dtype=np.uint32)
    ts_arr    = np.empty(n, dtype="datetime64[us]")
    lat_arr   = np.empty(n, dtype=np.float64)
    lon_arr   = np.empty(n, dtype=np.float64)
    speed_arr = np.empty(n, dtype=np.float32)
    cog_arr   = np.empty(n, dtype=np.float32)
    nav_arr   = np.empty(n, dtype=np.uint8)
    rot_arr   = np.empty(n, dtype=np.float32)

    for i, msg in enumerate(messages):
        lat, lon = float(msg["lat"]), float(msg["lon"])
        if not _is_valid_position(lat, lon):
            raise ValueError(
                f"MMSI {msg['mmsi']}: invalid position ({lat}, {lon}) — "
                "sentinel or out-of-WGS84-range; routing to DLQ"
            )
        mmsi_arr[i]  = msg["mmsi"]
        ts_arr[i]    = np.datetime64(msg["timestamp"], "us")
        lat_arr[i]   = lat
        lon_arr[i]   = lon
        speed_arr[i] = float(msg.get("sog", 0.0))
        cog_arr[i]   = float(msg.get("cog", 0.0))
        nav_arr[i]   = int(msg.get("nav_status", 15))
        rot_arr[i]   = float(msg.get("rot", 128.0))

    return pa.RecordBatch.from_arrays(
        [mmsi_arr, ts_arr, lat_arr, lon_arr, speed_arr, cog_arr, nav_arr, rot_arr],
        schema=AIS_SCHEMA,
    )

All coordinates are implicitly in EPSG:4326 (WGS84) after decoding — no reprojection is needed at this stage, but the consuming layer (e.g. a vessel trajectory clustering job) must confirm this datum before any distance calculation.

Step 4 — Deduplication and the poll loop

Terrestrial VHF receiver networks generate duplicate position reports when the same vessel is visible from multiple shore stations. Deduplicate via a rolling LRU cache keyed by (mmsi, timestamp_us) before appending to the batch buffer. The cache does not persist across consumer restarts, so configure a TTL of 60 seconds (longer than any plausible duplicate window):

import gc
import pyarrow.parquet as pq
from concurrent.futures import ThreadPoolExecutor
from functools import lru_cache
from confluent_kafka import Consumer, KafkaError, Message

# Simple in-process deduplication; replace with Redis for multi-process deployments
_seen: set[tuple[int, int]] = set()
_DEDUP_WINDOW = 3_600  # max entries before eviction

def _is_duplicate(mmsi: int, timestamp_us: int) -> bool:
    key = (mmsi, timestamp_us)
    if key in _seen:
        return True
    if len(_seen) > _DEDUP_WINDOW:
        _seen.clear()  # cheap eviction; swap for an LRU if ordering matters
    _seen.add(key)
    return False


def route_to_dlq(consumer: Consumer, msg: Message, reason: str) -> None:
    """Forward a failed message to the DLQ topic with original offset metadata."""
    logger.warning(
        "DLQ route: topic=%s partition=%d offset=%d reason=%s",
        msg.topic(), msg.partition(), msg.offset(), reason,
    )
    # Production: produce to '<topic>-dlq' with headers preserving original coords
    # Omitted here for brevity; use consumer.produce() or a separate Producer instance


def write_parquet_batch(batch: pa.RecordBatch, output_path: str) -> None:
    """Write a RecordBatch to Parquet; called from a thread pool to avoid blocking poll."""
    pq.write_table(
        pa.Table.from_batches([batch]),
        output_path,
        compression="snappy",
        write_statistics=True,
    )
    logger.info("Wrote %d rows to %s", batch.num_rows, output_path)


def run_consumer(
    bootstrap_servers: str,
    group_id: str,
    topic: str,
    output_prefix: str,
    batch_size: int = 5_000,
) -> None:
    """
    Main poll loop: accumulate AIS messages into typed batches, write Parquet,
    then commit offsets. DLQ-routes any message that fails validation.
    """
    consumer = build_consumer(bootstrap_servers, group_id)
    consumer.subscribe([topic])
    logger.info("Subscribed to %s as group %s", topic, group_id)

    batch: list[dict] = []
    batch_index = 0

    with ThreadPoolExecutor(max_workers=2) as pool:
        try:
            while True:
                # poll() returns one Message or None — loop to fill a batch
                msg: Message | None = consumer.poll(timeout=1.0)
                if msg is None:
                    continue
                if msg.error():
                    if msg.error().code() != KafkaError._PARTITION_EOF:
                        logger.error("Kafka error: %s", msg.error())
                    continue

                try:
                    record: dict = parse_and_validate(msg.value())  # type: ignore[name-defined]
                except Exception as exc:
                    route_to_dlq(consumer, msg, str(exc))
                    continue

                if record is None:
                    continue  # filter-only message (e.g. Type 4 base-station report)

                ts_us = int(np.datetime64(record["timestamp"], "us").view(np.int64))
                if _is_duplicate(record["mmsi"], ts_us):
                    continue

                batch.append(record)

                if len(batch) >= batch_size:
                    try:
                        rb = build_record_batch(batch)
                    except ValueError as exc:
                        logger.error("Batch validation failed: %s — routing batch to DLQ", exc)
                        for m in batch:
                            logger.debug("DLQ record: %s", m)
                        batch.clear()
                        continue

                    output_path = f"{output_prefix}/ais_{batch_index:08d}.parquet"
                    pool.submit(write_parquet_batch, rb, output_path)
                    consumer.commit(asynchronous=False)
                    batch.clear()
                    batch_index += 1

                    # Trigger GC between batches, not inside the poll loop
                    gc.collect()

        except KeyboardInterrupt:
            logger.info("Consumer interrupted — flushing remaining batch.")
        finally:
            if batch:
                rb = build_record_batch(batch)
                write_parquet_batch(rb, f"{output_prefix}/ais_{batch_index:08d}.parquet")
                consumer.commit(asynchronous=False)
            consumer.close()
            logger.info("Consumer closed.")

The parse_and_validate function is not shown in full here — see the step-by-step AIS message decoding guide for the complete bitfield decoder, which builds on the AIS NMEA sentence parser that handles multi-sentence fragment reassembly upstream of this consumer.

Step 5 — Verify the fix: acceptance test

After starting the consumer against a staging broker, confirm correct behaviour with the following checks:

import pyarrow.parquet as pq
import numpy as np

def acceptance_test(parquet_path: str) -> None:
    """
    Validate a Parquet output file from the AIS consumer.
    Raises AssertionError if any gate fails.
    """
    table = pq.read_table(parquet_path)

    # Gate 1: schema integrity
    assert table.schema.equals(AIS_SCHEMA), f"Schema mismatch: {table.schema}"

    # Gate 2: no sentinel coordinates
    lats = table["latitude"].to_pylist()
    lons = table["longitude"].to_pylist()
    assert all(-90.0 <= lat <= 90.0 for lat in lats), "Latitude out of WGS84 range"
    assert all(-180.0 <= lon <= 180.0 for lon in lons), "Longitude out of WGS84 range"
    assert 91.0 not in lats and 181.0 not in lons, "Sentinel position value leaked into output"

    # Gate 3: temporal ordering per MMSI (required for route-segmentation downstream)
    import pandas as pd
    df = table.to_pandas()
    for mmsi, grp in df.groupby("mmsi"):
        assert grp["timestamp"].is_monotonic_increasing, \
            f"MMSI {mmsi}: timestamps not monotonically increasing"

    # Gate 4: plausible speed (> 50 knots flags a data error for non-military vessels)
    fast = (table["speed_knots"].to_pylist())
    outliers = [s for s in fast if s > 50.0]
    if outliers:
        raise AssertionError(f"Implausible speed values: {outliers}")

    print(f"PASS — {table.num_rows} rows validated in {parquet_path}")

Additionally, monitor consumer lag from the command line to confirm the pipeline keeps pace with the live feed:

kafka-consumer-groups.sh \
  --bootstrap-server broker:9092 \
  --describe \
  --group ais-ingestion-prod

A healthy consumer holds lag below 10 000 messages on each partition. Sustained lag above 50 000 on a single partition signals GIL contention or slow Parquet writes; switch those writes to a separate multiprocessing.Process instead of a thread.

Edge Cases and Gotchas

Type 14 safety messages carry text, not position. If your parse_and_validate implementation falls through to position extraction for message types other than 1, 2, 3, 9, 18, and 21, it will attempt to decode arbitrary ASCII text as latitude bits, producing garbage coordinates that pass the sentinel check but fail the WGS84 range gate only intermittently. Add an explicit message-type allow-list before coordinate extraction.
Satellite AIS (S-AIS) feeds duplicate terrestrial reports with a timestamp skew of 30–90 seconds. The in-process _seen set evicts after 3 600 entries by count, not by time. If your batch rate is low (< 100 msg/s) a duplicate arriving 90 seconds later will still be in the set; if your rate is high (> 10 k msg/s), the set may evict within seconds. For mixed-rate deployments, key the deduplication cache on (mmsi, timestamp_us) in Redis with a 120-second TTL rather than a count-based eviction.
Consumer group rebalances during heavy deserialization bursts can exceed max.poll.interval.ms if the Parquet write blocks the poll thread. The solution above offloads writes to ThreadPoolExecutor, but if the thread pool queue grows (pool saturated), the submit call itself can block. Set max_workers=4 and add a Future result check after each submission; raise an exception if the future threw, triggering a clean shutdown rather than a silent stall.
The confluent-kafka debug flag (debug=cgrp,topic,fetch) emits verbose broker negotiation logs to stderr by default. Redirect these to a rotating file handler during staging to preserve rebalance event history without flooding stdout.

Step-by-Step AIS Message Decoding in Python — the parse_and_validate bitfield decoder this consumer depends on
Clustering Vessel Tracks with DBSCAN — downstream consumer of the Parquet batches this pipeline writes
Anomaly Detection in AIS Trajectories — next-stage processing for flagging spoofed or erratic position reports

Up: Real-Time AIS Stream Ingestion Pipelines — parent workflow covering broker topology, topic design, and partition strategy.

Building an AIS Kafka Consumer in Python

Problem Framing #

Root Cause: Why Naïve Consumers Fail on AIS Feeds #

Step-by-Step Fix #

Step 1 — Install dependencies and pin versions #

Step 2 — Consumer configuration #

Step 3 — Schema validation and typed PyArrow record batches #

Step 4 — Deduplication and the poll loop #

Step 5 — Verify the fix: acceptance test #

Edge Cases and Gotchas #

Related #