5 minute read

Today I am working on using satellite imagery to determine the level of tree canopy coverage in four major east coast downtowns. In particular, I am comparing Atlanta, Charlotte, Washington DC, and Philadelphia. This sits at the intersection of two of my favorite domains; urban planning and data science!

Methods

I chose a superpixel segmentation approach paired with a Gradient Boosting Classifier because pixel-level classifications often suffer from extreme “salt-and-pepper” noise in high-resolution satellite imagery. By grouping contextually similar pixels into cohesive geographic segments, the model evaluates structural shape and localized color variances simultaneously. This allows me to label a handful of segmente per city, build a model, and estimate tree coverage on the remaining segments.

Setup

First, let’s begin with imports, script parameters, and helper functions:

import random

import numpy as np
import polars as pl
import plotnine as p9
import contextily as cx
import matplotlib.pyplot as plt

from sklearn.model_selection import RandomizedSearchCV
from skimage.segmentation import slic, mark_boundaries
from sklearn.ensemble import GradientBoostingClassifier

ZOOM = 16
N_SEGMENTS = 5_000
COMPACTNESS = 0.001
LABELS_PER_CITY = 50
GREEN_THRESHOLD = 0.25
TREE_SAMPLE = 15_000

random.seed(42)
np.random.seed(42)

CITY_HALLS_MERCATOR = {
    "Atlanta": [-9394119, 3993616, -9391119, 3996616],
    "Charlotte": [-9000835, 4195865, -8997835, 4198865],
    "Washington DC": [-8575940, 4704635, -8572940, 4707635],
    "Philadelphia": [-8368832, 4857540, -8365832, 4860540]
}


def show_segment_boundaries(city, path, images, segments):
    boundaries = mark_boundaries(
        images[city],
        segments[city],
        color=(1, 0, 0, 1),
        outline_color=(1, 0, 0, 1),
        mode="thick"
    )
    
    plt.imshow(boundaries)
    plt.axis("off")
    plt.savefig(path)
    plt.close()


def show_segment(city, images, segments, idx):
    segment = segments[city]
    image = images[city]

    coords = np.argwhere(segment == idx)
    mask = image * (segment == idx)[..., np.newaxis]
    
    y_min, x_min = coords.min(axis=0)
    y_max, x_max = coords.max(axis=0)
    
    plt.imshow(mask[y_min:y_max+1, x_min:x_max+1])
    plt.axis("off")
    plt.show()
    plt.close()


def label_trees():
    while True:
        response = input("Is this tree coverage (y/n)? ")
        if response in ["y", "n"]:
            return 1.0 if response == "y" else 0.0
        print("Invalid input. Please enter 'y' or 'n'.")


def get_features(city, images, segments, idx):
    segment = segments[city]
    image = images[city]
    pixels = image[segment == idx]

    return {
        "city": city,
        "idx": idx,
        "mean_r": np.mean(pixels[:, 0]),
        "mean_g": np.mean(pixels[:, 1]),
        "mean_b": np.mean(pixels[:, 2]),
        "std_r": np.std(pixels[:, 0]),
        "std_g": np.std(pixels[:, 1]),
        "std_b": np.std(pixels[:, 2]),
        "segment_size": len(pixels)
    }


def rgb_to_hex(color):
    return f"#{color[0]:02x}{color[1]:02x}{color[2]:02x}"

Satellite Imagery

Next, we can collect satellite images using contextily imported as cx. This utility pulls Esri images using Mercator coordinates at a specific zoom level. The Mercator coordinates I am using are the approximate locations of each city hall for the four cities in my study.

provider = cx.providers.Esri.WorldImagery

images = {
    city: cx.bounds2img(*coords, zoom=ZOOM, source=provider)[0]
    for city, coords in CITY_HALLS_MERCATOR.items()
}

Segmentation

Now we begin the process of segmentation. We will be using slic() from skimage. SLIC (simple linear image clustering) is a common technique to subdivide images into superpixels. Once segmented, we will hand label a few dozen of segments as trees or not trees and then build a model to predict the remainder of the thousands of segments. In the image below, you can see the bounds of the Atlanta segments.

segments = {
    city: slic(image, n_segments=N_SEGMENTS, compactness=COMPACTNESS)
    for city, image in images.items()
}

show_segment_boundaries("Atlanta", "slic-bounds-atlanta.png", images, segments)

Labelled Data

Next, we set up a pipeline to hand-label a number of segments from each city. In particular, we are labeling 50 images from each city for a total of 200 labeled images. Then we will predict whether the remaining thousands of segments are trees.

While this dataset is fundamentally a small supervised subset paired with a larger unlabeled pool, treating it with a standard classifier over manual annotations operates on a semi-supervised philosophy. This strategy yields far better localized results than a completely unsupervised clustering method.

labels = []
    
for city in CITY_HALLS_MERCATOR.keys():
    unique_segments = np.unique(segments[city])
    sampled_segments = random.sample(list(unique_segments), k=LABELS_PER_CITY) 
    
    for i, idx in enumerate(sampled_segments):
        print(f"{city} ({i + 1}/{LABELS_PER_CITY})")
        show_segment(city, images, segments, idx)
        labels.append({"city": city, "idx": idx, "label": label_trees()})

Here is an example segment from Atlanta:

Model Estimation

Once labelled, we build a dataset of the predicted remaining segments to estimate tree coverage. We are using a mixture of polars and plotnine for modern plotting and dataframe operations.

features = [
    get_features(city, images, segments, idx)
    for city, idxs in segments.items()
    for idx in np.unique(idxs)
]

model_data = (
    pl.DataFrame(features)
    .join(
        other=pl.DataFrame(labels),
        on=["city", "idx"],
        how="left"
    )
)

feature_cols = [
    "mean_r",
    "mean_g",
    "mean_b",
    "std_r",
    "std_g",
    "std_b",
    "segment_size"
]

train = model_data.drop_nulls("label")
design_matrix = model_data.select(feature_cols)

grid = {
    "n_estimators": [100, 200, 300],
    "learning_rate": [0.05, 0.1],
    "max_depth": [3, 5, 10, 15]
}

search = (
    RandomizedSearchCV(
        estimator=GradientBoostingClassifier(),
        param_distributions=grid,
        random_state=42,
        n_iter=3,
        cv=3
    )
    .fit(
        train.select(feature_cols).to_numpy(), 
        train.get_column("label").to_numpy()
    )
)

model = search.best_estimator_

predictions = (
    model_data
    .with_columns(
        trees = model.predict_proba(design_matrix)[:, 1]
    )
    .select("city", "idx", "trees")
)

segment_data = []

for city, segment in segments.items():
    for row_id, contents in enumerate(segment):
        for col_id, idx in enumerate(contents):
            color = images[city][row_id, col_id]
            segment_data.append({
                "city": city,
                "row_id": row_id,
                "col_id": col_id,
                "idx": idx,
                "color": rgb_to_hex(color)
            })
    
plotting_data = (
    pl.DataFrame(segment_data)
    .join(
        other=predictions,
        on=["city", "idx"],
        how="left"
    )
    .with_columns(
        (0 - pl.col("row_id")).alias("row_id")
    )
)

print(
    plotting_data
    .group_by("city")
    .agg(
        pl.col("trees").mean()
    )
)

Results

And here is the final plot! You can see the plotnine code below the image shown here. Using a simple percent of pixels predicted to be trees we estimate the following canopy levels for each downtown core:

City Percent Tree Canopy
Atlanta 19%
Charlotte 12%
Philadelphia 10%
Washington DC 16%

plot = (
    p9.ggplot(
        data=(
            plotting_data
            .filter(pl.col("trees") > GREEN_THRESHOLD)
            .sample(n=TREE_SAMPLE, seed=42)
        ),
        mapping=p9.aes(x="col_id", y="row_id")
    ) +
    p9.geom_point(
        data=(
            plotting_data
            .with_row_index("row")
            .filter(
                (pl.col("row") % 30) == 0
            )
        ),
        mapping=p9.aes(color="color"),
        alpha=0.25,
        size=1e-4
    ) +
    p9.geom_point(
        color="green",
        size=0.5
    ) +
    p9.facet_wrap("city") +
    p9.theme_minimal() +
    p9.coord_fixed() +
    p9.scale_color_identity() +
    p9.theme(
        axis_text=p9.element_blank(),
        panel_grid=p9.element_blank(),
        axis_title=p9.element_blank(),
        text=p9.element_text(size=12, face="bold")
    )
)

plot.save(
    filename="canopy-plot.png", 
    width=8, 
    height=8, 
    dpi=300
)

Comparative Context

Our downtown canopy estimates (10% for Philadelphia to 19% for Atlanta) align well with external urban studies (Maco, n.d.), which typically place dense commercial cores between 8% and 15% coverage. However, these numbers are vastly lower than city-wide municipal assessments. For example, Atlanta boasts a 46% overall canopy rate, highlighting the severe green infrastructure deficit when isolating impervious downtown centers.

In the future we could also work to distinguish contiguous tree canopy from single-family home (SFH) neighborhoods is a major hurdle. These zones often feature lawns, ornamental shrubs, and individual trees that trick models into overestimating canopy density without providing the actual ecological benefits of a dense, unified forest structure.

References

Maco, S. E. (n.d.). Assessing canopy cover over streets and sidewalks in street tree populations. Journal of Arboriculture, 28(6), 270–276.

Updated: