1  Geographic data in Python

1.1 Introduction

This chapter introduces key Python packages and data structures for working with the two major types of spatial data, namely:

  • shapely and geopandas—for working with vector layers
  • rasterio—for working with rasters

As we will see in the example code presented later in this chapter, shapely and geopandas are related:

  • shapely is a “low-level” package for working with individual vector geometry objects
  • geopandas is a “high-level” package for working with geometry columns (GeoSeries objects), which internally contain shapely geometries, and vector layers (GeoDataFrame objects)

The geopandas ecosystem provides a comprehensive approach for working with vector layers in Python, with many packages building on it. This is not the case for raster data, however: there are several partially overlapping packages for working with raster data, each with its own advantages and disadvantages. In this book we focus on the most prominent one:

  • rasterio—a spatial-oriented package, focused on “simple” raster formats (such as GeoTIFF), representing a raster using a combination of a numpy array, and a metadata object (dict) specifying the spatial referencing of the array

Another raster-related package worth mentioning is xarray. It is a general-purpose package for working with labeled arrays, thus advantageous for processing “complex” raster format (such as NetCDF), representing a raster using its own native classes, namely xarray.Dataset and xarray.DataArray.

This chapter will briefly explain the fundamental geographic data models: vector and raster. Before demonstrating their implementation in Python, we will introduce the theory behind each data model and the disciplines in which they predominate.

The vector data model represents the world using points, lines, and polygons. These have discrete, well-defined borders, meaning that vector datasets usually have a high level of precision (but not necessarily accuracy). The raster data model divides the surface up into cells of constant size. Raster datasets are the basis of background images used in web-mapping and have been a vital source of geographic data since the origins of aerial photography and satellite-based remote sensing devices. Rasters aggregate spatially specific features to a given resolution, meaning that they are consistent over space and scalable (many worldwide raster datasets are available).

Which to use? The answer likely depends on your domain of application, and the datasets you have access to:

  • Vector datasets and methods dominate the social sciences because human settlements and and processes (e.g. transport infrastructure) tend to have discrete borders
  • Raster datasets and methods dominate many environmental sciences because of the reliance on remote sensing data

There is much overlap in some fields and raster and vector datasets can be used together: ecologists and demographers, for example, commonly use both vector and raster data. Furthermore, it is possible to convert between the two forms Whether your work involves more use of vector or raster datasets, it is worth understanding the underlying data models before using them, as discussed in subsequent chapters. This book focusses on approaches that build on geopandas and rasterio packages to work with vector data and raster datasets, respectively.

1.2 Vector data

The geographic vector data model is based on points located within a coordinate reference system (CRS). Points can represent self-standing features (e.g., the location of a bus stop), or they can be linked together to form more complex geometries such as lines and polygons. Most point geometries contain only two dimensions (3-dimensional CRSs contain an additional \(z\) value, typically representing height above sea level).

In this system, London, for example, can be represented by the coordinates (-0.1,51.5). This means that its location is -0.1 degrees east and 51.5 degrees north of the origin. The origin, in this case, is at 0 degrees longitude (the Prime Meridian) and 0 degree latitude (the Equator) in a geographic (‘lon/lat’) CRS. The same point could also be approximated in a projected CRS with ‘Easting/Northing’ values of (530000, 180000) in the British National Grid, meaning that London is located 530 \(km\) East and 180 \(km\) North of the origin of the CRS. The location of National Grid’s origin, in the sea beyond South West Peninsular, ensures that most locations in the UK have positive Easting and Northing values.

geopandas provides classes for geographic vector data and a consistent command-line interface for reproducible geographic data analysis in Python. geopandas provides an interface to three mature libraries for geocomputation which, in combination, represent a strong foundation on which many geographic applications (including QGIS and R’s spatial ecosystem) builds:

  • GDAL, for reading, writing, and manipulating a wide range of geographic data formats, covered in Chapter 8
  • PROJ, a powerful library for coordinate system transformations, which underlies the content covered in Chapter 7
  • GEOS, a planar geometry engine for operations such as calculating buffers and centroids on data with a projected CRS, covered in Chapter 5

Tight integration with these geographic libraries makes reproducible geocomputation possible: an advantage of using a higher level language such as Python to access these libraries is that you do not need to know the intricacies of the low level components, enabling focus on the methods rather than the implementation. This section introduces geopandas classes in preparation for subsequent chapters (Chapters 5 and 8 cover the GEOS and GDAL interface, respectively).

1.2.1 Vector data classes

The main classes for working with geographic vector data in Python are hierarchical, meaning the highest level ‘vector layer’ class is composed of simpler ‘geometry column’ and individual ‘geometry’ components. This section introduces them in order, starting with the highest level class. For many applications, the high level vector layer class, which are essentially a data frame with geometry columns, are all that’s needed. However, it’s important to understand the structure of vector geographic objects and their component pieces for more advanced applications. The three main vector geographic data classes in Python are:

  • GeoDataFrame, a class representing vector layers, with a geometry column (class GeoSeries) as one of the columns
  • GeoSeries, a class that is used to represent the geometry column in GeoDataFrame objects
  • shapely geometry objects which represent individual geometries, such as a point or a polygon

The first two classes (GeoDataFrame and GeoSeries) are defined in geopandas. The third class is defined in the shapely package, which deals with individual geometries, and is a main dependency of the geopandas package.

1.2.2 Vector layers

The most commonly used geographic vector data structure is the vector layer. There are several approaches for working with vector layers in Python, ranging from low-level packages (e.g., osgeo, fiona) to the relatively high-level geopandas package that is the focus of this section. Before writing and running code for creating and working with geographic vector objects, we therefore import geopandas (by convention as gpd for more concise code) and shapely:

import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import shapely

We also limit the maximum number of printed rows to four, to save space, using the 'display.max_rows' option of pandas:

pd.set_option('display.max_rows', 4)

Projects often start by importing an existing vector layer saved as an ESRI Shapefile (.shp), a GeoPackage (.gpkg) file, or other geographic file format. The function read_file() in the following line of code imports a GeoPackage file named world.gpkg located in the data directory of Python’s working directory into a GeoDataFrame named gdf:

gdf = gpd.read_file('data/world.gpkg')

As result is an object of type (class) GeoDataFrame with 177 rows (features) and 11 columns, as shown in the output of the following code:

type(gdf)
gdf.shape
(177, 11)

The GeoDataFrame class is an extension of the DataFrame class from the popular pandas package. This means we can treat a vector layer as a table, and process it using the ordinary, i.e., non-spatial, established function methods. For example, standard data frame subsetting methods can be used. The code below creates a subset of the gdf dataset containing only the country name and the geometry:

gdf = gdf[['name_long', 'geometry']]
gdf
name_long geometry
0 Fiji MULTIPOLYGON (((-180.00000 -16.55522, -179.917...
1 Tanzania MULTIPOLYGON (((33.90371 -0.95000, 31.86617 -1...
... ... ...
175 Trinidad and Tobago MULTIPOLYGON (((-61.68000 10.76000, -61.66000 ...
176 South Sudan MULTIPOLYGON (((30.83385 3.50917, 31.24556 3.7...

177 rows × 2 columns

The following expression creates a subset based on a condition, such as equality of the value in the 'name_long' column to the string 'Egypt':

gdf[gdf['name_long'] == 'Egypt']
name_long geometry
163 Egypt MULTIPOLYGON (((36.86623 22.00000, 36.69069 22...

Finally, to get a sense of the spatial component of the vector layer, it can be plotted using the .plot method (Figure 1.1):

gdf.plot();

Figure 1.1: Basic plot of a GeoDataFrame

or using .explore to get an interactive plot (Figure 1.2):

gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 1.2: Basic interactive map with .explore

And consequently, a subset can be plotted using:

gdf[gdf['name_long'] == 'Egypt'].explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 1.3: Interactive map of a GeoDataFrame subset

1.2.3 Geometry columns

A vital column in a GeoDataFrame is the geometry column, of class GeoSeries. The geometry column contains the geometric part of the vector layer. The geometry column can be accessed by name, which typically (e.g., when reading from a file) is 'geometry', as in gdf['geometry']. However, the recommendation is to use the fixed .geometry property, which refers to the geometry column regardless whether its name is 'geometry' or any other name. In the case of the gdf object, the geometry column contains 'MultiPolygon's associated with each country:

gdf.geometry
0      MULTIPOLYGON (((-180.00000 -16.55522, -179.917...
1      MULTIPOLYGON (((33.90371 -0.95000, 31.86617 -1...
                             ...                        
175    MULTIPOLYGON (((-61.68000 10.76000, -61.66000 ...
176    MULTIPOLYGON (((30.83385 3.50917, 31.24556 3.7...
Name: geometry, Length: 177, dtype: geometry

The geometry column also contains the spatial reference information, if any (also accessible with the shortcut gdf.crs):

gdf.geometry.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Many geometry operations, such as calculating the centroid, buffer, or bounding box of each feature involve just the geometry. Applying this type of operation on a GeoDataFrame is therefore basically a shortcut to applying it on the GeoSeries object in the geometry column. For example, the two following commands return exactly the same result, a GeoSeries with country centroids:

gdf.centroid
/tmp/ipykernel_177/2017122361.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  gdf.centroid
0      POINT (163.85312 -17.31631)
1        POINT (34.75299 -6.25773)
                  ...             
175     POINT (-61.33037 10.42824)
176       POINT (30.19862 7.29289)
Length: 177, dtype: geometry
gdf.geometry.centroid
/tmp/ipykernel_177/3951700624.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  gdf.geometry.centroid
0      POINT (163.85312 -17.31631)
1        POINT (34.75299 -6.25773)
                  ...             
175     POINT (-61.33037 10.42824)
176       POINT (30.19862 7.29289)
Length: 177, dtype: geometry

Note that .centroid, and other similar operators in geopandas such as .buffer (Section 4.3.3) or .convex_hull, return only the geometry (i.e., a GeoSeries), not a GeoDataFrame with the original attribute data. In case we want the latter, we can create a copy of the GeoDataFrame and then “overwrite” its geometry (or, we can overwrite the geometries directly in case we don’t need the original ones, as in gdf.geometry=gdf.centroid):

gdf2 = gdf.copy()
gdf2.geometry = gdf.centroid
gdf2
/tmp/ipykernel_177/497331910.py:2: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  gdf2.geometry = gdf.centroid
name_long geometry
0 Fiji POINT (163.85312 -17.31631)
1 Tanzania POINT (34.75299 -6.25773)
... ... ...
175 Trinidad and Tobago POINT (-61.33037 10.42824)
176 South Sudan POINT (30.19862 7.29289)

177 rows × 2 columns

Another useful property of the geometry column is the geometry type, as shown in the following code. Note that the types of geometries contained in a geometry column (and, thus, a vector layer) are not necessarily the same for every row. Accordingly, the .type property returns a Series (of type string), rather than a single value (the same can be done with the shortcut gdf.geom_type):

gdf.geometry.type
0      MultiPolygon
1      MultiPolygon
           ...     
175    MultiPolygon
176    MultiPolygon
Length: 177, dtype: object

To summarize the occurrence of different geometry types in a geometry column, we can use the pandas method called value_counts:

gdf.geometry.type.value_counts()
MultiPolygon    177
dtype: int64

In this case, we see that the gdf layer contains only 'MultiPolygon' geometries. It is possible to have multiple geometry types in a single GeoSeries. A GeoDataFrame can also have multiple GeoSeries:

gdf['centroids'] = gdf.centroid
gdf['polygons'] = gdf.geometry
gdf
/tmp/ipykernel_177/104973583.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  gdf['centroids'] = gdf.centroid
name_long geometry centroids polygons
0 Fiji MULTIPOLYGON (((-180.00000 -16.55522, -179.917... POINT (163.85312 -17.31631) MULTIPOLYGON (((-180.00000 -16.55522, -179.917...
1 Tanzania MULTIPOLYGON (((33.90371 -0.95000, 31.86617 -1... POINT (34.75299 -6.25773) MULTIPOLYGON (((33.90371 -0.95000, 31.86617 -1...
... ... ... ... ...
175 Trinidad and Tobago MULTIPOLYGON (((-61.68000 10.76000, -61.66000 ... POINT (-61.33037 10.42824) MULTIPOLYGON (((-61.68000 10.76000, -61.66000 ...
176 South Sudan MULTIPOLYGON (((30.83385 3.50917, 31.24556 3.7... POINT (30.19862 7.29289) MULTIPOLYGON (((30.83385 3.50917, 31.24556 3.7...

177 rows × 4 columns

Only one geometry column at a time is “active”, in the sense that it is being accessed in operations involving the geometries (such as .centroid, .crs, etc.). To switch the active geometry column from one GeoSeries column to another, we use set_geometry. For example (Figure 1.4, Figure 1.5):

gdf = gdf.set_geometry('centroids')
gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 1.4: Switching to the 'centroids' geometry column in the world layer, and plotting it
gdf = gdf.set_geometry('polygons')
gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook