geomermaids · GeoParquet

Free daily snapshots of OpenStreetMap data for North America, packaged so you can query them straight from a URL — no downloads, no account, no setup beyond DuckDB.

—files published

—latest snapshot

—last refresh

What is this?

This site publishes OpenStreetMap data in GeoParquet 2.0 — a modern, open, columnar file format — split by country, by state/province/region, and by theme (buildings, roads, water, etc.). Every file is small enough to use on a laptop, and the format is cloud-native: your query engine can read just the columns and rows it needs over HTTP, without downloading the whole file.

Coverage

Currently spans three countries — 98 admin regions × 16 themes = 1,568 files per daily snapshot, roughly 15 GB altogether:

United States — 50 states, the District of Columbia, Puerto Rico, and US Virgin Islands (53 regions)
Canada — 10 provinces and 3 territories (13 regions)
Mexico — 31 states and Ciudad de México (32 regions)

New snapshots publish every 24 hours; each one is dated and immutable, so analysis you run today is reproducible next year against the same bytes. Europe, South America, and the rest of the world are on the roadmap — need a specific region sooner?

Themes

Each admin region is split into 16 thematic files, each with typed columns promoted from OSM tags:

theme	geometry	typed columns (excerpt)
`buildings`	polygon	building, name, levels, height, addr_*
`roads`	linestring	highway, ref, oneway, surface, maxspeed, lanes
`railways`	point, linestring	railway, name, operator, gauge, electrified
`waterways`	linestring	waterway, name, width, intermittent, tunnel
`water`	polygon	water, natural, name, intermittent, salt
`landuse`	polygon	landuse, name, operator
`natural_areas`	polygon	natural, name, wetland
`natural_features`	point	natural, name, ele, prominence
`places`	point	place, name, population, admin_level, capital
`boundaries`	polygon	boundary, admin_level, name, iso3166_*
`pois`	point	amenity, shop, tourism, leisure, office, healthcare
`amenities_polygons`	polygon	amenity, shop, tourism, leisure, brand
`power`	point, linestring, polygon	power, name, voltage, frequency, operator
`aeroways`	point, linestring, polygon	aeroway, name, iata, icao, ref, surface
`barriers`	point, linestring	barrier, name, access, height, material
`public_transport`	point	public_transport, highway, railway, name, operator

Every file also ships the full OSM tags as a MAP<VARCHAR, VARCHAR> — anything not promoted to a typed column is still there, just a tags['key'] lookup away.

Browse it like a directory

Open our File Explorer in a browser to walk the bucket as a tree — every snapshot, country, and state listed with sizes and modified times. Click any .parquet to download it, or hit the adjacent view link to preview it on a map. CLI tools (DuckDB httpfs, curl, rclone) get the raw bytes exactly as before; the HTML listing only renders for clients that send Accept: text/html.

Preview a file on a map

Don't want to write SQL just to see what a file contains? The map viewer opens any GeoParquet from the bucket directly in your browser — DuckDB-WASM streams just the row groups your viewport overlaps, deck.gl renders them on a basemap. Paste a URL or click view from the File Explorer; pan and zoom to load other areas. Useful for spot-checking shape, density, and admin-region coverage before committing to a download.

URL pattern

https://parquetry.geomermaids.com/<YYYY-MM-DD | latest>/country=<CC>/state=<ISO>/<theme>.parquet

A few concrete examples:

# Every building in New York State (latest snapshot)
https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet

# Ontario's entire road network
https://parquetry.geomermaids.com/latest/country=CA/state=CA-ON/roads.parquet

# Points of interest in Jalisco, Mexico
https://parquetry.geomermaids.com/latest/country=MX/state=MX-JAL/pois.parquet

# Pinned to a specific date — immutable, reproducible
https://parquetry.geomermaids.com/2026-04-19/country=US/state=US-CA/waterways.parquet

# Machine-readable index of every snapshot ever published
https://parquetry.geomermaids.com/snapshots.json

Dated snapshots are immutable — safe to pin in reproducible pipelines. The latest/ alias always resolves to the most recent.

Retention

Snapshots age out on a tiered schedule so the archive stays bounded:

Last 14 days — every daily snapshot is kept.
15 days to 12 months — only the first day of each month is kept.
Older than 12 months — only December 31 of each year is kept, forever.

Yearly anchors survive the monthly-pruning window, so Dec 31 snapshots are always reachable. The current set is listed in snapshots.json. Need a specific date kept indefinitely? Get in touch.

Build a URL

Pick a region and theme; we'll assemble the URL for you.

Access

HTTPS S3

Country

State / Province

Theme

—

via endpoint s3.geomermaids.com (path-style, anonymous reads)

Open in new tab ↗ View on map ↗

Inside each file

Every parquet is validator-clean GeoParquet 2.0 — spatial-query-optimized, metadata-complete, bloom-filtered. Pop the hood on country=US/state=US-NY/buildings.parquet for example, and gpio's inspect / check commands report:

📄 country=US/state=US-NY/buildings.parquet          394 MB

Rows                   4,469,880
Row groups             88   (avg 50,794 rows each, 4.5 MB compressed)
Compression            ZSTD
Parquet type           GEOMETRY   (native v2.0 logical type, not WKB)
GeoParquet version     2.0.0
CRS                    OGC:CRS84  (WGS 84, lng/lat)
Geometry types         MultiPolygon
Bbox                   [-79.76, 40.49, -71.86, 45.01]
Bloom filters          16 columns, 1.0 MB total
Spatial order          Hilbert   (consecutive/random ratio 0.00089 —
                       "strongly spatially clustered")
Spec validation        29 of 29 checks pass  (gpio check, zero warnings)

Wait, what's a bloom filter? A tiny probabilistic index (a few KB per column, per row group) that answers one question very fast: "could this value be in this row group?" If the filter says no, the query engine skips the whole group without opening it. If it says maybe, the engine reads the group and checks for real. No false negatives — a match is never missed.

The win: a query like WHERE name = 'Central Park' across 4.5M New York buildings touches one row group out of 88 instead of the whole 394 MB file. Same story for every typed column and for tag-key lookups.

Every file in every theme in every region carries the same guarantees:

Native Parquet GEOMETRY logical type, emitted directly by DuckDB's parquet writer (GEOPARQUET_VERSION 'V2') — no WKB-in-binary + sidecar metadata roundtrip. Carries per-geometry row-group bbox statistics that DuckDB-class engines prune on without deserializing anything.
Explicit bbox column (xmin/ymin/xmax/ymax). The native geo statistics above only help readers that understand the GEOMETRY type; the bbox column exposes standard Parquet float stats so any engine — Spark, Trino, polars, pyarrow — prunes row groups on WHERE bbox.xmin <= … AND bbox.xmax >= ….
Hilbert-ordered rows. Bbox queries read a contiguous slice of row groups instead of the whole file. Typical spatial query touches under 5% of file bytes.
ZSTD compression on every column, including geometry.
50,000-row row groups. Small enough for aggressive spatial pruning, large enough to keep metadata overhead negligible.
Bloom filters on every promoted column plus country, state_iso, osm_type, and the tags map (see above). Point lookups skip whole row groups before any I/O.
Hive-partitioned by country= and state=. Query engines prune entire countries or regions before reading a single parquet.
Dated, immutable snapshots plus a latest/ alias. Pin a date for reproducibility; use latest/ when you just want today's data.
29 of 29 gpio check validations pass on every file — zero warnings, zero cosmetic metadata drift.
Open-source pipeline. Built with osmium-tool + DuckDB. Full source at github.com/gsueur/osm-geoparquet (MIT) — every file here is reproducible from source.

Try it now

New to DuckDB ? Discover our cookbooks for a gentle introduction to the engine and state of the art queries for GeoParquet v2.0.

Direct access — you know the region and theme, you want that file. Plain HTTPS, zero setup.
Catalog-wide queries — you want "every airport in North America" or "all rail stations in Canada". S3-style wildcards across the whole catalog.

Direct access

One region, one theme, one URL. No credentials, no session state — DuckDB streams the bytes over HTTPS and pulls only the row groups your filter touches.

1 · Setup

Install extensions once per session. The url() macro keeps later snippets readable — it's optional, just a thin wrapper so the examples below don't repeat the full URL. Plain string URLs work equally well:

-- Without the macro, just pass the full URL directly:
SELECT COUNT(*) FROM read_parquet(
  'https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet'
);

INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial;

-- Our coordinates are (longitude, latitude). Tell the spatial extension so
-- ST_Distance_Sphere & friends read the axes in that order (example 5).
SET geometry_always_xy = true;

CREATE OR REPLACE MACRO url(state, theme) AS
  'https://parquetry.geomermaids.com/latest/country=' ||
  split_part(state, '-', 1) || '/state=' || state || '/' || theme || '.parquet';

2 · How many buildings in a bounding box?

A rectangle around Midtown Manhattan. See the spatial pushdown notes below for when DuckDB prunes row groups automatically and when the predicate still forces a full scan — the answer currently depends on the file's geometry type and the function you use.

SELECT COUNT(*) AS buildings
FROM read_parquet(url('US-NY', 'buildings'))
WHERE bbox.xmin <= -73.98 AND bbox.xmax >= -73.99   -- prunes row groups (standard float stats)
  AND bbox.ymin <= 40.76  AND bbox.ymax >= 40.75
  AND st_intersects_extent(geometry, ST_MakeEnvelope(-73.99, 40.75, -73.98, 40.76));

The bbox.* clause is what makes this fast: it filters the explicit bbox covering column, whose standard Parquet float statistics let any engine skip row groups before reading them. st_intersects_extent then does the exact bbox-overlap test. Each bbox is a conservative bound (rounded outward to float32), so the bbox.* clause alone returns a slight superset — drop the st_intersects_extent line on engines without DuckDB's spatial functions (Spark, Trino, polars) and you still get correct pruning.

3 · Tallest residential towers in a neighborhood

Pulls four columns (not the whole 20-column file) and uses a spatial filter at the same time:

SELECT name, building, levels, height
FROM read_parquet(url('US-NY', 'buildings'))
WHERE building = 'residential'
  AND levels > 30
  AND bbox.xmin <= -73.95 AND bbox.xmax >= -73.99   -- prune row groups to the neighborhood
  AND bbox.ymin <= 40.79  AND bbox.ymax >= 40.76
  AND st_intersects_extent(geometry, ST_MakeEnvelope(-73.99, 40.76, -73.95, 40.79))
ORDER BY levels DESC
LIMIT 10;

4 · Match restaurants to the buildings they sit inside

A spatial join between two theme files, scoped to downtown Boston with a bbox filter on each side so DuckDB reads only the relevant row groups from both files before the ST_Contains join:

SELECT b.name AS building, p.name AS restaurant, p.cuisine
FROM read_parquet(url('US-MA', 'buildings')) b
JOIN read_parquet(url('US-MA', 'pois')) p
  ON ST_Contains(b.geometry, p.geometry)
WHERE p.amenity = 'restaurant'
  AND b.name IS NOT NULL
  AND b.bbox.xmin <= -71.05 AND b.bbox.xmax >= -71.07   -- downtown Boston, both files
  AND b.bbox.ymin <= 42.36  AND b.bbox.ymax >= 42.34
  AND p.bbox.xmin <= -71.05 AND p.bbox.xmax >= -71.07
  AND p.bbox.ymin <= 42.36  AND p.bbox.ymax >= 42.34
LIMIT 10;

5 · Nearest features to a point

Amenities within 500 m of Times Square, sorted by distance. Two predicates: a coarse degree-based bbox prefilter and an exact great-circle cutoff in metres:

WITH origin AS (SELECT ST_Point(-73.9857, 40.7484) AS pt)
SELECT p.amenity, p.name,
       ST_Distance_Sphere(p.geometry, o.pt) AS dist_m
FROM read_parquet(url('US-NY', 'pois')) p, origin o
WHERE p.bbox.xmin <= -73.9757 AND p.bbox.xmax >= -73.9957   -- ~1 km bbox prefilter (prunes row groups)
  AND p.bbox.ymin <= 40.7584 AND p.bbox.ymax >= 40.7384
  AND ST_Distance_Sphere(p.geometry, o.pt) < 500            -- exact cutoff, metres
ORDER BY dist_m
LIMIT 15;

Why both? The bbox.* prefilter compares plain float columns whose Parquet statistics let DuckDB skip row groups that don't overlap the search box before reading any data — and it works on any engine, no spatial extension needed. ST_Distance_Sphere returns the exact great-circle distance in metres but isn't a bbox predicate; used alone it forces a full scan. Pairing them gives you readable metre-based output and the cloud-native pruning that makes querying over HTTPS viable. The bbox prefilter is deliberately loose (~1 km box > 500 m radius) so no in-range point is dropped; the sphere check tightens it to the exact 500 m.

6 · Compare across states

Pass a list of URLs, group by state:

SELECT state_iso, COUNT(*) AS pois
FROM read_parquet([
  url('US-NY', 'pois'),
  url('US-MA', 'pois'),
  url('US-CT', 'pois'),
  url('US-RI', 'pois')
])
GROUP BY state_iso
ORDER BY pois DESC;

7 · Same data from Python

If DuckDB isn't your thing, GeoPandas works just as well:

import geopandas as gpd

gdf = gpd.read_parquet(
    "https://parquetry.geomermaids.com/latest/country=US/state=US-RI/buildings.parquet"
)
print(f"{len(gdf):,} Rhode Island buildings")
print(gdf[["building", "name", "height"]].head())

8 · Load into QGIS

Open a file visually in QGIS (3.28 LTR or newer — needs GDAL ≥ 3.5 with the Parquet driver). In Data Source Manager → Vector, paste the URL prefixed with /vsicurl/ so GDAL streams it via HTTP range requests instead of downloading the whole file:

/vsicurl/https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet

Only the row groups your viewport overlaps get fetched. For catalog-wide queries ("every airport in North America"), use DuckDB — QGIS loads one file at a time.

Catalog-wide queries

The direct-access examples above all name a specific file. That's perfect when you know the region and theme — but analysts often don't. They want "every airport in North America", "all rail stations in Canada", or "which state has the most wind turbines?". Answering those needs a wildcard across files, and a wildcard needs a directory listing — something plain HTTPS can't do.

So we run a small read-only S3-compatible endpoint at s3.geomermaids.com, backed by the same bucket. It speaks just enough of the S3 API (ListObjectsV2 + ranged GetObject) for DuckDB's httpfs to expand globs and stream byte ranges. No credentials, no signing — anonymous reads only.

1 · Setup

Point DuckDB at the S3 endpoint. The keys stay empty so httpfs issues unsigned requests. Run this block once per session:

INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial;

SET s3_endpoint='s3.geomermaids.com';
SET s3_url_style='path';
SET s3_use_ssl=true;
SET s3_access_key_id=''; SET s3_secret_access_key='';

2 · Every aeroway feature in North America

One query, ~100 files. Parquet stores row counts in metadata, so DuckDB answers this without reading a single data page:

SELECT count(*) AS features
FROM read_parquet('s3://parquetry/latest/country=*/state=*/aeroways.parquet');

3 · Top 10 states & provinces for full-scale airports

Wildcard the path, group by state_iso. The column is inside every file as a literal — no path-parsing needed:

SELECT state_iso, count(*) AS aerodromes
FROM read_parquet('s3://parquetry/latest/country=*/state=*/aeroways.parquet')
WHERE aeroway = 'aerodrome'
GROUP BY state_iso
ORDER BY aerodromes DESC
LIMIT 10;

4 · Partial wildcards — one country only

Pin any path segment to a literal and the glob prunes at list time. This one only fetches Canadian files:

SELECT state_iso, count(*) AS stations
FROM read_parquet('s3://parquetry/latest/country=CA/state=*/public_transport.parquet')
WHERE railway = 'station'
GROUP BY state_iso
ORDER BY stations DESC;

5 · Wind turbines across the continent

The power theme is small, so a continent-wide scan is cheap. Group by country to see where wind is concentrated:

SELECT country, count(*) AS turbines
FROM read_parquet('s3://parquetry/latest/country=*/state=*/power.parquet')
WHERE power = 'generator'
  AND tags['generator:source'] = 'wind'
GROUP BY country
ORDER BY turbines DESC;

Where this fits in the OSM data ecosystem

A healthy ecosystem of projects already publishes OSM-derived data in various forms — each with a slightly different focus. We stand on that work rather than replace it; our pitch is to fill a specific gap: daily, cloud-native, OSM-native GeoParquet 2.0 that you query straight off a URL.

Project	Format	Cadence	What it offers
Geofabrik	PBF, Shapefile, GeoPackage, GeoJSON	Daily	Our upstream. Pre-clipped regional OSM in every standard GIS format — the canonical place to get regional extracts. Every snapshot we publish starts from the regional PBF Geofabrik produces.
Layercake	GeoParquet + FlatGeobuf	Weekly	Thematic OSM extracts from OpenStreetMap US. Currently buildings, highways, and settlements, with a growing theme catalog.
Overture Maps	GeoParquet	Monthly	Global. Merges OSM with Meta, Microsoft, and TomTom data under a single unified schema — a one-stop integrated worldview across multiple sources.
Daylight Map	PBF + GeoJSON	Monthly	Meta's "cleaned" OSM distribution — OSM only, with quality-validation filters applied on top.
BigQuery public OSM	BigQuery tables	Weekly	Full OSM served as BigQuery tables for anyone already living in Google Cloud.
planet.osm.org	PBF / XML	Minute diffs	The raw source. Everything, no pre-processing — you bring the extraction and format conversion.
Geomermaids parquetry	GeoParquet 2.0	Daily	Raw OSM schema, country + admin-region partitioned, 16 themes, typed columns promoted from tags, directly queryable over HTTPS — built on Geofabrik's daily regional PBFs.

These are complementary, not substitutes. Geofabrik does the hard, essential work of maintaining and publishing clean regional OSM in every standard GIS format — PBF for native OSM tooling, Shapefile and GeoPackage for classic desktop GIS, GeoJSON for the web. We pick up where they leave off: turning the daily PBF into cloud-native GeoParquet 2.0 with a consistent schema, typed columns, and Hilbert-sorted row groups so you can run SQL over HTTPS without a download step. Layercake does something similar with a slightly different focus (themes across contributors); Overture goes further by merging non-OSM corporate data; Daylight applies validation cleanup. Pick whichever combination fits the job.

Custom regions, schemas, or SLA-backed hosting

The hosted snapshots here are an opinionated default: North America, 16 fixed themes, one size fits all. For:

Other regions (Europe, a specific country, a custom polygon)
Different themes or extra columns promoted from OSM tags
Per-customer cadence (hourly, live replication diffs)
SLA-backed freshness & availability guarantees
Multi-region S3 mirrors or cross-account delivery

Get in touch: contact@geomermaids.com