geomermaids · GeoParquet
Free daily snapshots of OpenStreetMap data for North America, packaged so you can query them straight from a URL — no downloads, no account, no setup beyond DuckDB.
What is this?
This site publishes OpenStreetMap data in GeoParquet 2.0 — a modern, open, columnar file format — split by country, by state/province/region, and by theme (buildings, roads, water, etc.). Every file is small enough to use on a laptop, and the format is cloud-native: your query engine can read just the columns and rows it needs over HTTP, without downloading the whole file.
Coverage
Currently spans three countries — 98 admin regions × 16 themes = 1,568 files per daily snapshot, roughly 15 GB altogether:
- United States — 50 states, the District of Columbia, Puerto Rico, and US Virgin Islands (53 regions)
- Canada — 10 provinces and 3 territories (13 regions)
- Mexico — 31 states and Ciudad de México (32 regions)
New snapshots publish every 24 hours; each one is dated and immutable, so analysis you run today is reproducible next year against the same bytes. Europe, South America, and the rest of the world are on the roadmap — need a specific region sooner?
Themes
Each admin region is split into 16 thematic files, each with typed columns promoted from OSM tags:
| theme | geometry | typed columns (excerpt) |
|---|---|---|
buildings | polygon | building, name, levels, height, addr_* |
roads | linestring | highway, ref, oneway, surface, maxspeed, lanes |
railways | point, linestring | railway, name, operator, gauge, electrified |
waterways | linestring | waterway, name, width, intermittent, tunnel |
water | polygon | water, natural, name, intermittent, salt |
landuse | polygon | landuse, name, operator |
natural_areas | polygon | natural, name, wetland |
natural_features | point | natural, name, ele, prominence |
places | point | place, name, population, admin_level, capital |
boundaries | polygon | boundary, admin_level, name, iso3166_* |
pois | point | amenity, shop, tourism, leisure, office, healthcare |
amenities_polygons | polygon | amenity, shop, tourism, leisure, brand |
power | point, linestring, polygon | power, name, voltage, frequency, operator |
aeroways | point, linestring, polygon | aeroway, name, iata, icao, ref, surface |
barriers | point, linestring | barrier, name, access, height, material |
public_transport | point | public_transport, highway, railway, name, operator |
Every file also ships the full OSM tags as a MAP<VARCHAR, VARCHAR> — anything not promoted to a typed column is still there, just a tags['key'] lookup away.
Browse it like a directory
Open our File Explorer in a browser to walk the bucket as a tree — every snapshot, country, and state listed with sizes and modified times. Click any .parquet to download it, or hit the adjacent view link to preview it on a map. CLI tools (DuckDB httpfs, curl, rclone) get the raw bytes exactly as before; the HTML listing only renders for clients that send Accept: text/html.
Preview a file on a map
Don't want to write SQL just to see what a file contains? The map viewer opens any GeoParquet from the bucket directly in your browser — DuckDB-WASM streams just the row groups your viewport overlaps, deck.gl renders them on a basemap. Paste a URL or click view from the File Explorer; pan and zoom to load other areas. Useful for spot-checking shape, density, and admin-region coverage before committing to a download.
URL pattern
https://parquetry.geomermaids.com/<YYYY-MM-DD | latest>/country=<CC>/state=<ISO>/<theme>.parquet
A few concrete examples:
# Every building in New York State (latest snapshot)
https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet
# Ontario's entire road network
https://parquetry.geomermaids.com/latest/country=CA/state=CA-ON/roads.parquet
# Points of interest in Jalisco, Mexico
https://parquetry.geomermaids.com/latest/country=MX/state=MX-JAL/pois.parquet
# Pinned to a specific date — immutable, reproducible
https://parquetry.geomermaids.com/2026-04-19/country=US/state=US-CA/waterways.parquet
# Machine-readable index of every snapshot ever published
https://parquetry.geomermaids.com/snapshots.json
Dated snapshots are immutable — safe to pin in reproducible pipelines. The latest/ alias always resolves to the most recent.
Retention
Snapshots age out on a tiered schedule so the archive stays bounded:
- Last 14 days — every daily snapshot is kept.
- 15 days to 12 months — only the first day of each month is kept.
- Older than 12 months — only December 31 of each year is kept, forever.
Yearly anchors survive the monthly-pruning window, so Dec 31 snapshots are always reachable. The current set is listed in snapshots.json. Need a specific date kept indefinitely? Get in touch.
Build a URL
Pick a region and theme; we'll assemble the URL for you.
—
s3.geomermaids.com (path-style, anonymous reads)Inside each file
Every parquet is validator-clean GeoParquet 2.0 — spatial-query-optimized, metadata-complete, bloom-filtered. Pop the hood on country=US/state=US-NY/buildings.parquet for example, and gpio's inspect / check commands report:
📄 country=US/state=US-NY/buildings.parquet 394 MB
Rows 4,469,880
Row groups 88 (avg 50,794 rows each, 4.5 MB compressed)
Compression ZSTD
Parquet type GEOMETRY (native v2.0 logical type, not WKB)
GeoParquet version 2.0.0
CRS OGC:CRS84 (WGS 84, lng/lat)
Geometry types MultiPolygon
Bbox [-79.76, 40.49, -71.86, 45.01]
Bloom filters 16 columns, 1.0 MB total
Spatial order Hilbert (consecutive/random ratio 0.00089 —
"strongly spatially clustered")
Spec validation 29 of 29 checks pass (gpio check, zero warnings)
Wait, what's a bloom filter? A tiny probabilistic index (a few KB per column, per row group) that answers one question very fast: "could this value be in this row group?" If the filter says no, the query engine skips the whole group without opening it. If it says maybe, the engine reads the group and checks for real. No false negatives — a match is never missed.
The win: a query like WHERE name = 'Central Park' across 4.5M New York buildings touches one row group out of 88 instead of the whole 394 MB file. Same story for every typed column and for tag-key lookups.
Every file in every theme in every region carries the same guarantees:
- Native Parquet
GEOMETRYlogical type, emitted directly by DuckDB's parquet writer (GEOPARQUET_VERSION 'V2') — no WKB-in-binary + sidecar metadata roundtrip. Carries per-geometry row-group bbox statistics that DuckDB-class engines prune on without deserializing anything. - Explicit
bboxcolumn (xmin/ymin/xmax/ymax). The native geo statistics above only help readers that understand theGEOMETRYtype; thebboxcolumn exposes standard Parquet float stats so any engine — Spark, Trino, polars, pyarrow — prunes row groups onWHERE bbox.xmin <= … AND bbox.xmax >= …. - Hilbert-ordered rows. Bbox queries read a contiguous slice of row groups instead of the whole file. Typical spatial query touches under 5% of file bytes.
- ZSTD compression on every column, including geometry.
- 50,000-row row groups. Small enough for aggressive spatial pruning, large enough to keep metadata overhead negligible.
- Bloom filters on every promoted column plus
country,state_iso,osm_type, and thetagsmap (see above). Point lookups skip whole row groups before any I/O. - Hive-partitioned by
country=andstate=. Query engines prune entire countries or regions before reading a single parquet. - Dated, immutable snapshots plus a
latest/alias. Pin a date for reproducibility; uselatest/when you just want today's data. - 29 of 29
gpio checkvalidations pass on every file — zero warnings, zero cosmetic metadata drift. - Open-source pipeline. Built with osmium-tool + DuckDB. Full source at github.com/gsueur/osm-geoparquet (MIT) — every file here is reproducible from source.
Try it now
New to DuckDB ? Discover our cookbooks for a gentle introduction to the engine and state of the art queries for GeoParquet v2.0.
- Direct access — you know the region and theme, you want that file. Plain HTTPS, zero setup.
- Catalog-wide queries — you want "every airport in North America" or "all rail stations in Canada". S3-style wildcards across the whole catalog.
Direct access
One region, one theme, one URL. No credentials, no session state — DuckDB streams the bytes over HTTPS and pulls only the row groups your filter touches.
1 · Setup
Install extensions once per session. The url() macro keeps later snippets readable — it's optional, just a thin wrapper so the examples below don't repeat the full URL. Plain string URLs work equally well:
-- Without the macro, just pass the full URL directly:
SELECT COUNT(*) FROM read_parquet(
'https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet'
);
INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial;
-- Our coordinates are (longitude, latitude). Tell the spatial extension so
-- ST_Distance_Sphere & friends read the axes in that order (example 5).
SET geometry_always_xy = true;
CREATE OR REPLACE MACRO url(state, theme) AS
'https://parquetry.geomermaids.com/latest/country=' ||
split_part(state, '-', 1) || '/state=' || state || '/' || theme || '.parquet';
2 · How many buildings in a bounding box?
A rectangle around Midtown Manhattan. See the spatial pushdown notes below for when DuckDB prunes row groups automatically and when the predicate still forces a full scan — the answer currently depends on the file's geometry type and the function you use.
SELECT COUNT(*) AS buildings
FROM read_parquet(url('US-NY', 'buildings'))
WHERE bbox.xmin <= -73.98 AND bbox.xmax >= -73.99 -- prunes row groups (standard float stats)
AND bbox.ymin <= 40.76 AND bbox.ymax >= 40.75
AND st_intersects_extent(geometry, ST_MakeEnvelope(-73.99, 40.75, -73.98, 40.76));
The bbox.* clause is what makes this fast: it filters the explicit bbox covering column, whose standard Parquet float statistics let any engine skip row groups before reading them. st_intersects_extent then does the exact bbox-overlap test. Each bbox is a conservative bound (rounded outward to float32), so the bbox.* clause alone returns a slight superset — drop the st_intersects_extent line on engines without DuckDB's spatial functions (Spark, Trino, polars) and you still get correct pruning.
3 · Tallest residential towers in a neighborhood
Pulls four columns (not the whole 20-column file) and uses a spatial filter at the same time:
SELECT name, building, levels, height
FROM read_parquet(url('US-NY', 'buildings'))
WHERE building = 'residential'
AND levels > 30
AND bbox.xmin <= -73.95 AND bbox.xmax >= -73.99 -- prune row groups to the neighborhood
AND bbox.ymin <= 40.79 AND bbox.ymax >= 40.76
AND st_intersects_extent(geometry, ST_MakeEnvelope(-73.99, 40.76, -73.95, 40.79))
ORDER BY levels DESC
LIMIT 10;
4 · Match restaurants to the buildings they sit inside
A spatial join between two theme files, scoped to downtown Boston with a bbox filter on each side so DuckDB reads only the relevant row groups from both files before the ST_Contains join:
SELECT b.name AS building, p.name AS restaurant, p.cuisine
FROM read_parquet(url('US-MA', 'buildings')) b
JOIN read_parquet(url('US-MA', 'pois')) p
ON ST_Contains(b.geometry, p.geometry)
WHERE p.amenity = 'restaurant'
AND b.name IS NOT NULL
AND b.bbox.xmin <= -71.05 AND b.bbox.xmax >= -71.07 -- downtown Boston, both files
AND b.bbox.ymin <= 42.36 AND b.bbox.ymax >= 42.34
AND p.bbox.xmin <= -71.05 AND p.bbox.xmax >= -71.07
AND p.bbox.ymin <= 42.36 AND p.bbox.ymax >= 42.34
LIMIT 10;
5 · Nearest features to a point
Amenities within 500 m of Times Square, sorted by distance. Two predicates: a coarse degree-based bbox prefilter and an exact great-circle cutoff in metres:
WITH origin AS (SELECT ST_Point(-73.9857, 40.7484) AS pt)
SELECT p.amenity, p.name,
ST_Distance_Sphere(p.geometry, o.pt) AS dist_m
FROM read_parquet(url('US-NY', 'pois')) p, origin o
WHERE p.bbox.xmin <= -73.9757 AND p.bbox.xmax >= -73.9957 -- ~1 km bbox prefilter (prunes row groups)
AND p.bbox.ymin <= 40.7584 AND p.bbox.ymax >= 40.7384
AND ST_Distance_Sphere(p.geometry, o.pt) < 500 -- exact cutoff, metres
ORDER BY dist_m
LIMIT 15;
Why both? The bbox.* prefilter compares plain float columns whose Parquet statistics let DuckDB skip row groups that don't overlap the search box before reading any data — and it works on any engine, no spatial extension needed. ST_Distance_Sphere returns the exact great-circle distance in metres but isn't a bbox predicate; used alone it forces a full scan. Pairing them gives you readable metre-based output and the cloud-native pruning that makes querying over HTTPS viable. The bbox prefilter is deliberately loose (~1 km box > 500 m radius) so no in-range point is dropped; the sphere check tightens it to the exact 500 m.
6 · Compare across states
Pass a list of URLs, group by state:
SELECT state_iso, COUNT(*) AS pois
FROM read_parquet([
url('US-NY', 'pois'),
url('US-MA', 'pois'),
url('US-CT', 'pois'),
url('US-RI', 'pois')
])
GROUP BY state_iso
ORDER BY pois DESC;
7 · Same data from Python
If DuckDB isn't your thing, GeoPandas works just as well:
import geopandas as gpd
gdf = gpd.read_parquet(
"https://parquetry.geomermaids.com/latest/country=US/state=US-RI/buildings.parquet"
)
print(f"{len(gdf):,} Rhode Island buildings")
print(gdf[["building", "name", "height"]].head())
8 · Load into QGIS
Open a file visually in QGIS (3.28 LTR or newer — needs GDAL ≥ 3.5 with the Parquet driver). In Data Source Manager → Vector, paste the URL prefixed with /vsicurl/ so GDAL streams it via HTTP range requests instead of downloading the whole file:
/vsicurl/https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet
Only the row groups your viewport overlaps get fetched. For catalog-wide queries ("every airport in North America"), use DuckDB — QGIS loads one file at a time.
Catalog-wide queries
The direct-access examples above all name a specific file. That's perfect when you know the region and theme — but analysts often don't. They want "every airport in North America", "all rail stations in Canada", or "which state has the most wind turbines?". Answering those needs a wildcard across files, and a wildcard needs a directory listing — something plain HTTPS can't do.
So we run a small read-only S3-compatible endpoint at s3.geomermaids.com, backed by the same bucket. It speaks just enough of the S3 API (ListObjectsV2 + ranged GetObject) for DuckDB's httpfs to expand globs and stream byte ranges. No credentials, no signing — anonymous reads only.
1 · Setup
Point DuckDB at the S3 endpoint. The keys stay empty so httpfs issues unsigned requests. Run this block once per session:
INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial;
SET s3_endpoint='s3.geomermaids.com';
SET s3_url_style='path';
SET s3_use_ssl=true;
SET s3_access_key_id=''; SET s3_secret_access_key='';
2 · Every aeroway feature in North America
One query, ~100 files. Parquet stores row counts in metadata, so DuckDB answers this without reading a single data page:
SELECT count(*) AS features
FROM read_parquet('s3://parquetry/latest/country=*/state=*/aeroways.parquet');
3 · Top 10 states & provinces for full-scale airports
Wildcard the path, group by state_iso. The column is inside every file as a literal — no path-parsing needed:
SELECT state_iso, count(*) AS aerodromes
FROM read_parquet('s3://parquetry/latest/country=*/state=*/aeroways.parquet')
WHERE aeroway = 'aerodrome'
GROUP BY state_iso
ORDER BY aerodromes DESC
LIMIT 10;
4 · Partial wildcards — one country only
Pin any path segment to a literal and the glob prunes at list time. This one only fetches Canadian files:
SELECT state_iso, count(*) AS stations
FROM read_parquet('s3://parquetry/latest/country=CA/state=*/public_transport.parquet')
WHERE railway = 'station'
GROUP BY state_iso
ORDER BY stations DESC;
5 · Wind turbines across the continent
The power theme is small, so a continent-wide scan is cheap. Group by country to see where wind is concentrated:
SELECT country, count(*) AS turbines
FROM read_parquet('s3://parquetry/latest/country=*/state=*/power.parquet')
WHERE power = 'generator'
AND tags['generator:source'] = 'wind'
GROUP BY country
ORDER BY turbines DESC;
Where this fits in the OSM data ecosystem
A healthy ecosystem of projects already publishes OSM-derived data in various forms — each with a slightly different focus. We stand on that work rather than replace it; our pitch is to fill a specific gap: daily, cloud-native, OSM-native GeoParquet 2.0 that you query straight off a URL.
| Project | Format | Cadence | What it offers |
|---|---|---|---|
| Geofabrik | PBF, Shapefile, GeoPackage, GeoJSON | Daily | Our upstream. Pre-clipped regional OSM in every standard GIS format — the canonical place to get regional extracts. Every snapshot we publish starts from the regional PBF Geofabrik produces. |
| Layercake | GeoParquet + FlatGeobuf | Weekly | Thematic OSM extracts from OpenStreetMap US. Currently buildings, highways, and settlements, with a growing theme catalog. |
| Overture Maps | GeoParquet | Monthly | Global. Merges OSM with Meta, Microsoft, and TomTom data under a single unified schema — a one-stop integrated worldview across multiple sources. |
| Daylight Map | PBF + GeoJSON | Monthly | Meta's "cleaned" OSM distribution — OSM only, with quality-validation filters applied on top. |
| BigQuery public OSM | BigQuery tables | Weekly | Full OSM served as BigQuery tables for anyone already living in Google Cloud. |
| planet.osm.org | PBF / XML | Minute diffs | The raw source. Everything, no pre-processing — you bring the extraction and format conversion. |
| Geomermaids parquetry | GeoParquet 2.0 | Daily | Raw OSM schema, country + admin-region partitioned, 16 themes, typed columns promoted from tags, directly queryable over HTTPS — built on Geofabrik's daily regional PBFs. |
These are complementary, not substitutes. Geofabrik does the hard, essential work of maintaining and publishing clean regional OSM in every standard GIS format — PBF for native OSM tooling, Shapefile and GeoPackage for classic desktop GIS, GeoJSON for the web. We pick up where they leave off: turning the daily PBF into cloud-native GeoParquet 2.0 with a consistent schema, typed columns, and Hilbert-sorted row groups so you can run SQL over HTTPS without a download step. Layercake does something similar with a slightly different focus (themes across contributors); Overture goes further by merging non-OSM corporate data; Daylight applies validation cleanup. Pick whichever combination fits the job.
Custom regions, schemas, or SLA-backed hosting
The hosted snapshots here are an opinionated default: North America, 16 fixed themes, one size fits all. For:
- Other regions (Europe, a specific country, a custom polygon)
- Different themes or extra columns promoted from OSM tags
- Per-customer cadence (hourly, live replication diffs)
- SLA-backed freshness & availability guarantees
- Multi-region S3 mirrors or cross-account delivery
Get in touch: contact@geomermaids.com