Clustering Vessel Tracks with DBSCAN

High-frequency AIS positional streams routinely fragment under naive Euclidean distance metrics: variable sampling intervals, GPS multipath drift, and intermittent transmission dropouts split continuous maritime routes into disjoint micro-segments that generate false behavioral primitives downstream. This page resolves that exact failure mode by walking through a production DBSCAN implementation that enforces Haversine geodesic distance, temporal continuity filtering, and memory-bounded chunk I/O. It is one stage of the broader segmenting vessel routes by behavior workflow — the parent page to return to — which converts raw AIS telemetry into kinematically homogeneous track segments before any operational-state classification runs.

Reference parameters

The defaults below hold for the cluster_vessel_tracks routine built in Step 4 and match the chunk conventions used across the AIS vessel tracking and route automation pipelines. Tune eps_km and min_samples to the local transmission density before adjusting anything else.

Parameter	Default	Unit	Role
`eps_km`	0.5	km	Neighbourhood radius; 0.3–0.5 for coastal/Arctic, up to 2.0 for open-ocean transits
`min_samples`	5	pings	Core-point threshold; raise to 8 for low-density Arctic feeds
`max_temporal_gap_sec`	1800	s	Segment break threshold within a `(mmsi, cluster_id)` group
`batch_size`	500 000	rows	Parquet streaming chunk; fits within 2 GB RAM
`metric`	`haversine`	—	Geodesic distance over radian coordinates
`algorithm`	`ball_tree`	—	O(N log N) spatial index; avoids O(N²) distance matrix
`EARTH_RADIUS_KM`	6371.0	km	Mean Earth radius for the km→radian conversion

Why Euclidean Distance Breaks Maritime Clustering

The failure is straightforward to reproduce. Loading a raw Parquet shard and running DBSCAN with the default metric='euclidean' over decimal-degree coordinates produces spurious clusters:

import numpy as np
import polars as pl
from sklearn.cluster import DBSCAN

df = pl.read_parquet("ais_shard.parquet")
coords = np.column_stack([df["lat"].to_numpy(), df["lon"].to_numpy()])

# WRONG: Euclidean over decimal degrees — scale distorts at high latitudes
db = DBSCAN(eps=0.005, min_samples=5, metric="euclidean")
labels = db.fit_predict(coords)
# Result: >40 % of pings in North Sea data assigned label -1 (noise)
# at 60 °N, 1 ° longitude ≈ 55 km; 1 ° latitude ≈ 111 km — asymmetric distances

The asymmetry between degree-latitude and degree-longitude grows with latitude. At 60 °N a 0.005 ° longitude step is roughly 300 m, while the same step in latitude is 555 m — a 1.85× distortion ratio that DBSCAN interprets as genuine spatial separation. Vessels transiting the North Sea, Norwegian coast, or Alaskan shelf see fragmentation rates above 40 % under this metric. The fix is to operate in geodesic space from the outset.

The second failure mode is purely temporal. A vessel anchored at position A for six hours, then transiting past position A twelve hours later, will be merged into the same spatial density region by any distance-only metric. Temporal continuity filtering applied after spatial density estimation prevents this collapse.

Step-by-Step Fix

Step 1 — Validate input schema and express `eps` in radians

The input Parquet file must expose four typed columns. Validate at ingestion rather than discovering schema mismatches mid-batch.

import logging
import pyarrow.parquet as pq

logger = logging.getLogger(__name__)

REQUIRED_COLUMNS = {"mmsi", "timestamp", "lat", "lon"}
EARTH_RADIUS_KM = 6371.0

def validate_schema(input_path: str) -> None:
    """Raise ValueError if required AIS columns are absent from the Parquet schema."""
    schema = pq.read_schema(input_path)
    missing = REQUIRED_COLUMNS - set(schema.names)
    if missing:
        raise ValueError(
            f"Input Parquet missing required columns: {missing}. "
            f"Found: {schema.names}"
        )
    logger.info("Schema validated: %s", input_path)

def km_to_radians(eps_km: float) -> float:
    """Convert a kilometre radius to radians for the Haversine metric."""
    if eps_km <= 0:
        raise ValueError(f"eps_km must be positive; got {eps_km}")
    return eps_km / EARTH_RADIUS_KM

Call validate_schema before opening the batch iterator. Expressing eps in radians at this stage prevents the silent scale error that appears when the raw kilometre value is passed to DBSCAN directly (which would treat it as a radian value ≈ 28 611 km — clustering the entire ocean into one region).

Step 2 — Run DBSCAN with BallTree and Haversine metric

scikit-learn’s BallTree reduces spatial indexing to O(N log N) and avoids materialising the O(N²) pairwise distance matrix that would exhaust memory on multi-year archives.

import numpy as np
import polars as pl
from sklearn.cluster import DBSCAN

def apply_spatial_dbscan(
    df: pl.DataFrame,
    eps_rad: float,
    min_samples: int,
) -> pl.DataFrame:
    """
    Apply DBSCAN over WGS84 coordinates using Haversine metric.

    Args:
        df: Polars DataFrame with lat, lon columns in decimal degrees.
        eps_rad: Neighbourhood radius in radians (convert km via km_to_radians).
        min_samples: Minimum points to constitute a core point.

    Returns:
        DataFrame with an added cluster_id column (Int32); noise = -1.
    """
    coords_rad = np.deg2rad(
        np.column_stack([df["lat"].to_numpy(), df["lon"].to_numpy()])
    )

    db = DBSCAN(
        eps=eps_rad,
        min_samples=min_samples,
        metric="haversine",
        algorithm="ball_tree",
        n_jobs=-1,
    )
    labels = db.fit_predict(coords_rad)

    logger.debug(
        "DBSCAN complete: %d core points, %d noise pings (%.1f %%)",
        int((labels >= 0).sum()),
        int((labels == -1).sum()),
        100.0 * (labels == -1).mean(),
    )

    return df.with_columns(
        pl.Series("cluster_id", labels).cast(pl.Int32)
    )

Setting n_jobs=-1 parallelises the BallTree query across all available cores. For vessels in high-traffic coastal zones — port approach corridors, traffic separation schemes — min_samples=5 with eps_km=0.5 reliably captures stationary and slow-manoeuvring behaviours. Open-ocean transits may warrant eps_km=2.0 to tolerate wider positional scatter from reduced transmission frequency.

Step 3 — Apply temporal continuity filtering per `(mmsi, cluster_id)` group

Two pings at the same anchorage separated by a multi-hour gap belong to distinct operational events. After spatial labelling, enforce a maximum inter-point time delta within each vessel–cluster group.

def apply_temporal_segmentation(
    df: pl.DataFrame,
    max_gap_sec: float,
) -> pl.DataFrame:
    """
    Split spatially co-located pings into discrete segments wherever
    the inter-point time gap exceeds max_gap_sec.

    Args:
        df: DataFrame with mmsi, cluster_id, timestamp columns.
        max_gap_sec: Maximum allowed gap in seconds before a segment break.

    Returns:
        DataFrame with segment_id column (Int32), unique per (mmsi, cluster_id, break).
    """
    df = df.sort(["mmsi", "cluster_id", "timestamp"])

    df = df.with_columns(
        pl.col("timestamp")
          .diff()
          .over(["mmsi", "cluster_id"])
          .dt.total_seconds()
          .alias("dt_sec")
    )

    df = df.with_columns(
        (pl.col("dt_sec") > max_gap_sec)
          .fill_null(False)           # first ping in each group has null diff
          .alias("is_break")
    )

    df = df.with_columns(
        pl.col("is_break")
          .cum_sum()
          .over(["mmsi", "cluster_id"])
          .cast(pl.Int32)
          .alias("segment_id")
    )

    return df.drop(["dt_sec", "is_break"])

The cumulative sum of boolean break flags produces a monotonically increasing segment_id that restarts with each temporal discontinuity, without requiring a join or groupby materialisation.

Step 4 — Stream Parquet batches and write typed output

The full pipeline wires the three steps above into a streaming loop that maintains a memory ceiling regardless of archive size. The standard chunk size of 500 000 rows used across the AIS vessel tracking and route automation workflows fits comfortably within 2 GB RAM on commodity CI runners.

import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

def cluster_vessel_tracks(
    input_path: str,
    output_path: str,
    eps_km: float = 0.5,
    min_samples: int = 5,
    max_temporal_gap_sec: float = 1800.0,
    batch_size: int = 500_000,
) -> None:
    """
    Production routine for spatiotemporal DBSCAN segmentation of AIS trajectories.

    Args:
        input_path: Parquet file with columns mmsi, timestamp, lat, lon.
        output_path: Destination Parquet path for clustered segments.
        eps_km: Neighbourhood radius in kilometres.
        min_samples: Minimum points to form a DBSCAN core point.
        max_temporal_gap_sec: Max allowed gap within a segment (seconds).
        batch_size: Points per Parquet batch.
    """
    validate_schema(input_path)
    eps_rad = km_to_radians(eps_km)
    logger.info(
        "Starting DBSCAN segmentation: eps=%.3f km (%.6f rad), min_samples=%d",
        eps_km, eps_rad, min_samples,
    )

    parquet_file = pq.ParquetFile(input_path)
    writer: pq.ParquetWriter | None = None

    output_schema = pa.schema([
        pa.field("mmsi", pa.int64()),
        pa.field("timestamp", pa.timestamp("us")),
        pa.field("lat", pa.float64()),
        pa.field("lon", pa.float64()),
        pa.field("cluster_id", pa.int32()),
        pa.field("segment_id", pa.int32()),
    ])

    for batch in parquet_file.iter_batches(batch_size=batch_size):
        df = pl.from_arrow(batch)
        if df.is_empty():
            continue

        df = apply_spatial_dbscan(df, eps_rad, min_samples)
        df = df.filter(pl.col("cluster_id") != -1)
        if df.is_empty():
            logger.debug("Batch fully noise; skipping.")
            continue

        df = apply_temporal_segmentation(df, max_temporal_gap_sec)
        out = df.select(["mmsi", "timestamp", "lat", "lon", "cluster_id", "segment_id"])

        arrow_batch = out.to_arrow().cast(output_schema)
        if writer is None:
            writer = pq.ParquetWriter(output_path, output_schema)
        writer.write_table(arrow_batch)
        logger.info("Batch written: %d rows → %s", len(out), output_path)

    if writer is not None:
        writer.close()
        logger.info("Segmentation complete: %s", output_path)
    else:
        raise RuntimeError(
            f"No valid output rows produced from {input_path}. "
            "Check eps_km, min_samples, and input data quality."
        )

Using pq.ParquetWriter with a fixed output schema avoids the read-back-and-concat pattern that appeared in earlier versions of this routine, which doubled I/O on every batch and introduced schema drift risks when Polars inferred inconsistent types across chunks.

Verification and Acceptance Test

After running cluster_vessel_tracks, confirm segment statistics against known operational-state distributions:

import polars as pl

def verify_segmentation(output_path: str) -> None:
    """Assert segment statistics fall within expected operational ranges."""
    df = pl.read_parquet(output_path)

    assert "cluster_id" in df.columns, "cluster_id column missing from output"
    assert "segment_id" in df.columns, "segment_id column missing from output"
    assert (df["cluster_id"] == -1).sum() == 0, "Noise pings (label -1) leaked into output"

    seg_stats = (
        df.group_by(["mmsi", "cluster_id", "segment_id"])
          .agg(pl.count().alias("ping_count"))
    )

    median_pings = seg_stats["ping_count"].median()
    assert median_pings >= 3, (
        f"Median segment length {median_pings:.1f} pings — suspiciously short. "
        "Check min_samples or input data density."
    )

    n_segments = len(seg_stats)
    logger.info(
        "Verification passed: %d segments, median %.1f pings/segment",
        n_segments, median_pings,
    )

A healthy run on coastal AIS data (port approaches, traffic separation zones) yields median segment lengths of 8–25 pings, with segments aligning recognisably with port-approach corridors, anchorage zones, and transit lanes. Segment lengths consistently below 4 pings indicate eps_km is too small or min_samples too high for the data’s transmission frequency. Segments exceeding 500 pings typically signal that max_temporal_gap_sec is too permissive and is merging distinct operational events.

The output feeds directly into speed and heading profiling for maritime analytics, which consumes the typed (mmsi, cluster_id, segment_id) partition key to compute per-segment kinematic profiles.

Edge Cases and Gotchas

High-latitude distortion with large eps_km. At latitudes above 70 °N, even Haversine-correct DBSCAN can conflate geographically distinct anchorages when eps_km exceeds 1.0 km, because Norwegian and Alaskan fjord geometry places multiple anchorage points within a single dense region. Reduce eps_km to 0.3–0.5 km for Arctic AIS datasets and increase min_samples to 8 to compensate for lower ping density.
scikit-learn version and BallTree leaf size. Versions below 1.2 contain a BallTree memory regression that can cause OOM on batches above 200 000 rows with n_jobs=-1. Pin scikit-learn>=1.3.0 in your environment. If memory pressure persists, add leaf_size=40 to the DBSCAN constructor; the default of 30 generates a deeper tree that consumes more RAM during construction.
Polars diff().over() behaviour on single-row groups. A vessel with exactly one ping in a DBSCAN-group–batch combination produces a null from .diff() that .fill_null(False) correctly handles. However, if the Polars version pre-dates 0.19 the .dt.total_seconds() call on a null duration raises a ComputeError rather than propagating null. Pin polars>=0.19.0 and add a .cast(pl.Float64).fill_null(0.0) guard after .dt.total_seconds() when operating on mixed-version CI environments.
Timestamp timezone awareness. AIS feeds from NMEA aggregators such as parsing AIS NMEA sentences with Python emit UTC-naive timestamps. If downstream systems convert to local time zones before ingestion, .diff() over DST boundaries returns incorrect deltas. Store and process timestamps as UTC throughout; apply timezone localisation only at the reporting layer.

Segmenting Vessel Routes by Behavior — parent workflow: finite state machine pipeline that consumes DBSCAN segment output
Speed and Heading Profiling for Maritime Analytics — downstream stage consuming (mmsi, cluster_id, segment_id) partitions
Anomaly Detection in AIS Trajectories — applies rolling MAD scoring over the same behavioral segment schema
Parsing AIS NMEA Sentences with Python — upstream stage producing the mmsi, timestamp, lat, lon Parquet inputs this routine consumes

Clustering Vessel Tracks with DBSCAN

Reference parameters #

Why Euclidean Distance Breaks Maritime Clustering #

Step-by-Step Fix #

Step 1 — Validate input schema and express eps in radians #

Step 2 — Run DBSCAN with BallTree and Haversine metric #

Step 3 — Apply temporal continuity filtering per (mmsi, cluster_id) group #

Step 4 — Stream Parquet batches and write typed output #

Verification and Acceptance Test #

Edge Cases and Gotchas #

Related #