How to Convert LAS to XYZ for Bathymetry

Converting airborne or multibeam-derived LAS/LAZ point clouds to XYZ format for bathymetric use is a controlled data-reduction operation that sits within the Point Cloud Filtering for Multibeam Sonar workflow — not a trivial file translation. Two failure modes dominate in practice: out-of-memory crashes from naive bulk reads on 10–50 GB survey files, and systematic depth biases introduced by unvalidated CRS transformations that corrupt every downstream interpolation step. This guide builds a production-grade, chunked conversion pipeline using laspy and pyproj that enforces classification boundaries, respects memory ceilings, and preserves vertical datum integrity from source VLR through to space-delimited XYZ output.

Why Bulk LAS Reads Fail for Marine Data

The failure mode is straightforward to reproduce:

import laspy

# Do NOT do this on a 20 GB survey file
las = laspy.read("survey_line_01.las")   # loads entire point record into RAM
z = las.z                                # MemoryError on standard nodes

A 20 GB LAS file at Point Data Record Format 1 (20 bytes/point) holds roughly one billion points. Loading the full record triggers a single contiguous allocation of several gigabytes before numpy has processed a single coordinate — standard compute nodes with 16–32 GB RAM fail immediately once OS overhead is factored in.

Beyond memory, unclassified LAS exports carry water-column backscatter, vessel-wake noise, and surface multipath returns alongside valid seabed hits. Passing all returns to the XYZ writer introduces interpolation artifacts that propagate into DEM generation and removing bathymetric artifacts and noise becomes necessary downstream — a step that is far cheaper to skip by filtering at the source.

Step 1 — Inspect the LAS Header and Extract CRS

Before streaming any point data, read the header to confirm point count, format, and the embedded spatial reference stored in the Variable Length Records (VLRs). A missing or ambiguous CRS at this stage must abort the pipeline — silent coordinate mismatches cannot be corrected after export.

import laspy
import pyproj
from typing import Optional

def get_source_epsg(las_file: laspy.LasReader) -> Optional[int]:
    """
    Extract EPSG code from the LASF_Projection VLR.
    Returns None if no parseable CRS is found in the header.
    """
    for vlr in las_file.header.vlrs:
        if getattr(vlr, "user_id", "") == "LASF_Projection":
            try:
                record_data = vlr.record_data
                if isinstance(record_data, (bytes, bytearray)):
                    wkt = record_data.decode("utf-8", errors="ignore").strip("\x00")
                    if wkt:
                        crs = pyproj.CRS.from_wkt(wkt)
                        return crs.to_epsg()
            except Exception:
                pass
    return None

# Inspect without loading point data
with laspy.open("survey_line_01.las", "r") as las_file:
    hdr = las_file.header
    src_epsg = get_source_epsg(las_file)
    print(f"Points: {hdr.point_count:,}")
    print(f"Format: {hdr.point_format.id}")
    print(f"CRS: {'EPSG:' + str(src_epsg) if src_epsg else 'UNDEFINED — abort'}")

For airborne LiDAR datasets (EPSG:4979 or a UTM zone), you must plan a vertical transformation to a tidal datum before the output can be used in DEM interpolation for seafloor mapping. If get_source_epsg returns None and a target EPSG is specified, raise immediately rather than proceeding with undefined coordinates.

Step 2 — Apply Classification Mask

The OGC LAS specification assigns numeric classification codes to every point. For marine terrain modeling, only three classes produce valid seabed geometry:

Code	Label	Marine use
2	Ground	Airborne LiDAR bottom returns
9	Water	Bathymetric LiDAR sub-surface returns
12	Overlap/Bottom	Context-dependent; include when sensor metadata confirms seabed assignment

All other classes — 1 (Unclassified), 7 (Noise), 18 (High Noise) — must be discarded. Applying the mask inside each chunk, rather than post-load, prevents the unfiltered array from ever occupying RAM:

import numpy as np

BATHYMETRIC_CLASSES = frozenset({2, 9, 12})

def filter_bathymetric_chunk(chunk: laspy.PackedPointRecord) -> np.ndarray:
    """
    Boolean mask selecting only valid bathymetric classification codes.
    Uses numpy.isin for vectorised comparison across the chunk array.
    """
    return np.isin(chunk.classification, sorted(BATHYMETRIC_CLASSES))

Step 3 — Stream in Chunks and Transform Coordinates

laspy.open() exposes chunk_iterator(chunk_size), which reads the file in fixed-size slices and releases each slice after processing. Set chunk_size to 2 000 000 points (roughly 60–120 MB depending on point record format) for a predictable memory ceiling regardless of total file size.

Vertical datum transformation is applied per-chunk using pyproj.Transformer. Constructing the transformer once per file (not once per chunk) avoids repeated grid-file loading:

import pyproj

def build_transformer(
    src_epsg: int,
    dst_epsg: int
) -> pyproj.Transformer:
    """
    Build a deterministic CRS transformer with always_xy=True to prevent
    axis-order inversion — a common silent failure with geographic CRS pairs.
    """
    return pyproj.Transformer.from_crs(
        f"EPSG:{src_epsg}",
        f"EPSG:{dst_epsg}",
        always_xy=True
    )

The pipeline diagram below shows how each chunk moves from disk through the filter and transform stages to the buffered XYZ writer:

Step 4 — Complete Production Implementation

The following script wires header inspection, classification filtering, CRS transformation, and buffered writing into a single runnable pipeline. Requirements: laspy>=2.4.0, numpy>=1.24.0, pyproj>=3.4.0.

#!/usr/bin/env python3
"""
Production-grade LAS to XYZ converter for bathymetric point clouds.
Enforces classification filtering, chunked streaming, and deterministic CRS handling.
"""

import argparse
import logging
import sys
from pathlib import Path
from typing import Optional

import laspy
import numpy as np
import pyproj

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)
logger = logging.getLogger(__name__)

BATHYMETRIC_CLASSES = frozenset({2, 9, 12})
CHUNK_SIZE = 2_000_000       # ~60–120 MB per chunk depending on point record format
OUTPUT_DELIMITER = " "
OUTPUT_FMT = "%.4f %.4f %.4f"


def get_source_epsg(las_file: laspy.LasReader) -> Optional[int]:
    for vlr in las_file.header.vlrs:
        if getattr(vlr, "user_id", "") == "LASF_Projection":
            try:
                raw = vlr.record_data
                if isinstance(raw, (bytes, bytearray)):
                    wkt = raw.decode("utf-8", errors="ignore").strip("\x00")
                    if wkt:
                        return pyproj.CRS.from_wkt(wkt).to_epsg()
            except Exception:
                pass
    return None


def filter_bathymetric_chunk(chunk: laspy.PackedPointRecord) -> np.ndarray:
    return np.isin(chunk.classification, sorted(BATHYMETRIC_CLASSES))


def convert_las_to_xyz(
    input_path: Path,
    output_path: Path,
    target_epsg: Optional[int] = None,
    chunk_size: int = CHUNK_SIZE,
) -> None:
    """Stream-convert a LAS/LAZ file to space-delimited XYZ."""
    if not input_path.is_file():
        raise FileNotFoundError(f"Input LAS not found: {input_path}")

    with laspy.open(str(input_path), "r") as las_file:
        hdr = las_file.header
        src_epsg = get_source_epsg(las_file)

        if src_epsg is None and target_epsg is not None:
            raise ValueError(
                "Source LAS has no parseable CRS in its VLRs. "
                "Cannot transform coordinates safely — aborting."
            )

        logger.info(
            "Reading %s | points: %s | CRS: %s",
            input_path.name,
            f"{hdr.point_count:,}",
            f"EPSG:{src_epsg}" if src_epsg else "undefined",
        )

        # Build transformer once; avoids repeated PROJ grid-file loading per chunk
        transformer: Optional[pyproj.Transformer] = None
        if target_epsg and src_epsg and target_epsg != src_epsg:
            transformer = pyproj.Transformer.from_crs(
                f"EPSG:{src_epsg}", f"EPSG:{target_epsg}", always_xy=True
            )
            logger.info("Coordinate transformation: EPSG:%s → EPSG:%s", src_epsg, target_epsg)

        total_written = 0

        # buffering=1_048_576 reduces syscall overhead during high-throughput writes
        with open(str(output_path), "w", buffering=1_048_576) as out_f:
            effective_epsg = target_epsg if target_epsg else src_epsg
            out_f.write(f"# LAS to XYZ | source: {input_path.name}\n")
            out_f.write(f"# CRS: EPSG:{effective_epsg}\n")
            out_f.write("# X Y Z\n")

            for chunk in las_file.chunk_iterator(chunk_size):
                mask = filter_bathymetric_chunk(chunk)
                if not np.any(mask):
                    continue

                x = chunk.x[mask].astype(np.float64)
                y = chunk.y[mask].astype(np.float64)
                z = chunk.z[mask].astype(np.float64)

                if transformer is not None:
                    xt, yt, zt = transformer.transform(x, y, z)
                    x = np.asarray(xt)
                    y = np.asarray(yt)
                    z = np.asarray(zt)

                np.savetxt(out_f, np.column_stack((x, y, z)),
                           delimiter=OUTPUT_DELIMITER, fmt=OUTPUT_FMT)
                total_written += len(x)

    logger.info("Done. %s points written to %s", f"{total_written:,}", output_path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Convert LAS/LAZ to XYZ for bathymetric processing"
    )
    parser.add_argument("input_las", type=Path, help="Input LAS or LAZ file")
    parser.add_argument("output_xyz", type=Path, help="Output XYZ file path")
    parser.add_argument("--target-epsg", type=int, default=None,
                        help="Target EPSG code for CRS transformation")
    parser.add_argument("--chunk-size", type=int, default=CHUNK_SIZE,
                        help="Points per processing chunk (default: 2 000 000)")
    args = parser.parse_args()

    try:
        convert_las_to_xyz(
            args.input_las, args.output_xyz, args.target_epsg, args.chunk_size
        )
    except Exception as exc:
        logger.error("Pipeline failed: %s", exc)
        sys.exit(1)

Step 5 — Verification and Acceptance Test

After conversion, confirm three things: the filtered point count is plausible for the survey density, no XYZ row contains nan or inf, and the Z range matches the expected depth window for the survey area. When the vertical transformation is in play, the decisive acceptance test is the residual at known control points — co-located benchmarks, GNSS tide-gauge ties, or overlapping survey lines. The vertical drift is the root-mean-square of the difference between transformed depths and control depths:

\mathrm{RMSE}_z = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(z_i^{\text{out}} - z_i^{\text{ref}}\right)^2}

A pass requires $\mathrm{RMSE}_z$ to sit inside the survey’s vertical uncertainty budget (commonly the IHO S-44 total vertical uncertainty for the order); a non-zero systematic mean offset rather than scatter is the signature of a wrong datum or a missing geoid/separation grid, not random noise. The diagram below contrasts the two residual signatures against the same control line:

import numpy as np
from pathlib import Path

def validate_xyz_output(xyz_path: Path, expected_depth_min: float, expected_depth_max: float) -> None:
    """
    Load the output XYZ and assert basic integrity.
    Raises AssertionError with a diagnostic message on any failure.
    """
    data = np.loadtxt(
        xyz_path, comments="#", delimiter=" ", dtype=np.float64
    )
    assert data.ndim == 2 and data.shape[1] == 3, \
        f"Expected Nx3 array, got shape {data.shape}"
    assert np.isfinite(data).all(), \
        "Output contains NaN or Inf — check transformer pipeline"
    z_min, z_max = data[:, 2].min(), data[:, 2].max()
    assert expected_depth_min <= z_min and z_max <= expected_depth_max, (
        f"Z range [{z_min:.2f}, {z_max:.2f}] outside expected "
        f"[{expected_depth_min}, {expected_depth_max}] — datum mismatch?"
    )
    print(f"Validation passed: {len(data):,} points, Z range [{z_min:.2f}, {z_max:.2f}]")

# Example: nearshore survey, depths expected -80 m to +2 m (MLLW)
validate_xyz_output(Path("survey_line_01_bathy.xyz"), -80.0, 2.0)

To validate the vertical datum transformation itself, compute the residual against known control depths and assert it stays within the survey’s vertical uncertainty budget:

import numpy as np

def vertical_drift_rmse(
    transformed_z: np.ndarray,
    reference_z: np.ndarray,
    max_rmse: float,
) -> float:
    """
    Root-mean-square vertical residual at co-located control points.
    Raises AssertionError if drift exceeds the allowed budget, or if the
    mean offset dominates the scatter (datum/separation-grid error signature).
    """
    if transformed_z.shape != reference_z.shape:
        raise ValueError("Control arrays must align element-wise")
    residual = transformed_z - reference_z
    rmse = float(np.sqrt(np.mean(residual ** 2)))
    mean_offset = float(np.mean(residual))
    assert rmse <= max_rmse, f"Vertical RMSE {rmse:.3f} m exceeds budget {max_rmse} m"
    if abs(mean_offset) > 0.5 * rmse:
        raise AssertionError(
            f"Systematic offset {mean_offset:.3f} m dominates scatter — "
            "likely wrong vertical datum or missing separation grid"
        )
    return rmse

For larger outputs where np.loadtxt would itself exhaust memory, stream-validate using the same chunk_iterator pattern but against the output file line-by-line.

Edge Cases and Gotchas

LAZ (compressed) input. laspy>=2.0 reads LAZ natively when lazrs or laszip is installed. If neither backend is present, laspy.open() raises LaspyException without a clear message — install lazrs-python and confirm with python -c "import lazrs" before running in production.
always_xy=True is not optional. Omitting it in pyproj.Transformer.from_crs() causes geographic CRS pairs (e.g., EPSG:4326 → EPSG:4979) to silently invert latitude and longitude, producing coordinates in the ocean or wrong hemisphere. The error is not caught by the validator because the values are numerically finite.
Class 12 ambiguity in mixed surveys. In combined LiDAR-bathymetry acquisitions (e.g., EAARL-B or Chiroptera), class 12 may be flagged as “Overlap” at the water surface rather than confirmed bottom — include it only when the sensor processing log explicitly assigns it to seabed returns. When in doubt, exclude and note the decision in the output header comment.
Chunk boundary classification drift. If the source LAS was written with imperfect point ordering, a single scan line can straddle two chunks. The boolean mask operates independently on each chunk, so partial scan lines are handled correctly — but verify total retained count against a reference tool such as pdal info --stats to confirm no points are lost or duplicated at boundaries. The PDAL-based cleaning workflow provides an independent point count for cross-checking.

Point Cloud Filtering for Multibeam Sonar — parent workflow covering the full MBES filtering pipeline (up one level)
Using PDAL for Bathymetric Point Cloud Cleaning — PDAL-native pipeline for noise removal and scan-line repair
Removing Bathymetric Artifacts and Noise — raster-stage artifact suppression after gridding
DEM Interpolation Techniques for Seafloor Mapping — downstream consumer of XYZ exports

How to Convert LAS to XYZ for Bathymetry #

Why Bulk LAS Reads Fail for Marine Data #

Step 1 — Inspect the LAS Header and Extract CRS #

Step 2 — Apply Classification Mask #

Step 3 — Stream in Chunks and Transform Coordinates #

Step 4 — Complete Production Implementation #

Step 5 — Verification and Acceptance Test #

Edge Cases and Gotchas #

Related #