import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import rasterio
2 Attribute data operations
2.1 Prerequisites
Let’s import the required packages:
and load the sample data for this chapter:
Attempting to get the data
= gpd.read_file('data/world.gpkg')
world = rasterio.open('data/elev.tif')
src_elev = rasterio.open('data/landsat.tif') src_multi_rast
2.2 Introduction
Attribute data is non-spatial information associated with geographic (geometry) data. A bus stop provides a simple example: its position would typically be represented by latitude and longitude coordinates (geometry data), in addition to its name. The Elephant & Castle / New Kent Road stop in London, for example has coordinates of -0.098
degrees longitude and 51.495
degrees latitude which can be represented as POINT (-0.098 51.495)
in the Simple Feature representation described in Chapter 1. Attributes such as the name attribute of the POINT
feature (to use Simple Features terminology) are the topic of this chapter.
Another example is the elevation value (attribute) for a specific grid cell in raster data. Unlike the vector data model, the raster data model stores the coordinate of the grid cell indirectly, meaning the distinction between attribute and spatial information is less clear. To illustrate the point, think of a pixel in the 3rd row and the 4th column of a raster matrix. Its spatial location is defined by its index in the matrix: move from the origin four cells in the x direction (typically east and right on maps) and three cells in the y direction (typically south and down). The raster’s resolution defines the distance for each x- and y-step which is specified in a header. The header is a vital component of raster datasets which specifies how pixels relate to geographic coordinates (see also Chapter 3).
This chapter teaches how to manipulate geographic objects based on attributes such as the names of bus stops in a vector dataset and elevations of pixels in a raster dataset. For vector data, this means techniques such as subsetting and aggregation (see Section 2.3.1 and Section 2.3.2). Section 2.3.3 and Section 2.3.4 demonstrate how to join data onto simple feature objects using a shared ID and how to create new variables, respectively. Each of these operations has a spatial equivalent: [
operator for subsetting a (Geo)DataFrame
using a boolean Series
, for example, is applicable both for subsetting objects based on their attribute and spatial relations derived using methods such as .intersects
; you can also join attributes in two geographic datasets using spatial joins. This is good news: skills developed in this chapter are cross-transferable. Chapter 3 extends the methods presented here to the spatial world.
After a deep dive into various types of vector attribute operations in the next section, raster attribute data operations are covered in Section 2.4, which demonstrates extracting cell values from one or more layer (raster subsetting). Section 2.4.2 provides an overview of ‘global’ raster operations which can be used to summarize entire raster datasets.
2.3 Vector attribute manipulation
As mentioned in Section 1.2.2, vector layers (GeoDataFrame
, from package geopandas) are basically extended tables (DataFrame
from package pandas), the difference being that a vector layer has a geometry column. Since GeoDataFrame
extends DataFrame
, all ordinary table-related operations from package pandas are supported for vector layers as well, as shown below.
2.3.1 Vector attribute subsetting
pandas supports several subsetting interfaces, though the most recommended ones are:
.loc
, which uses pandas indices, and.iloc
, which uses (implicit) numpy-style numeric indices.
In both cases, the method is followed by square brackets, and two indices, separated by a comma. Each index can comprise:
- A specific value, as in
1
- A
list
, as in[0,2,4]
- A slice, as in
0:3
:
—indicating “all” indices
An exception to this rule is selecting columns using a list, which we do using shorter notation, as in df[['a','b']]
, instead of df.loc[:, ['a','b']]
, to select columns 'a'
and 'b'
from df
.
Here are few examples of subsetting the GeoDataFrame
of world countries (Figure 1.1).
Subsetting rows by position, e.g., the first three rows:
0:3, :] world.iloc[
iso_a2 | name_long | continent | ... | lifeExp | gdpPercap | geometry | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Oceania | ... | 69.960 | 8222.253784 | MULTIPOLYGON (((-180.00000 -16.... |
1 | TZ | Tanzania | Africa | ... | 64.163 | 2402.099404 | MULTIPOLYGON (((33.90371 -0.950... |
2 | EH | Western Sahara | Africa | ... | NaN | NaN | MULTIPOLYGON (((-8.66559 27.656... |
3 rows × 11 columns
which is equivalent to:
3] world.iloc[:
iso_a2 | name_long | continent | ... | lifeExp | gdpPercap | geometry | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Oceania | ... | 69.960 | 8222.253784 | MULTIPOLYGON (((-180.00000 -16.... |
1 | TZ | Tanzania | Africa | ... | 64.163 | 2402.099404 | MULTIPOLYGON (((33.90371 -0.950... |
2 | EH | Western Sahara | Africa | ... | NaN | NaN | MULTIPOLYGON (((-8.66559 27.656... |
3 rows × 11 columns
as well as:
3) world.head(
iso_a2 | name_long | continent | ... | lifeExp | gdpPercap | geometry | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Oceania | ... | 69.960 | 8222.253784 | MULTIPOLYGON (((-180.00000 -16.... |
1 | TZ | Tanzania | Africa | ... | 64.163 | 2402.099404 | MULTIPOLYGON (((33.90371 -0.950... |
2 | EH | Western Sahara | Africa | ... | NaN | NaN | MULTIPOLYGON (((-8.66559 27.656... |
3 rows × 11 columns
Subsetting columns by position, e.g., the first three columns:
0:3] world.iloc[:,
iso_a2 | name_long | continent | |
---|---|---|---|
0 | FJ | Fiji | Oceania |
1 | TZ | Tanzania | Africa |
2 | EH | Western Sahara | Africa |
... | ... | ... | ... |
174 | XK | Kosovo | Europe |
175 | TT | Trinidad and Tobago | North America |
176 | SS | South Sudan | Africa |
177 rows × 3 columns
Subsetting rows and columns by position:
0:3, 0:3] world.iloc[
iso_a2 | name_long | continent | |
---|---|---|---|
0 | FJ | Fiji | Oceania |
1 | TZ | Tanzania | Africa |
2 | EH | Western Sahara | Africa |
Subsetting columns by name:
'name_long', 'geometry']] world[[
name_long | geometry | |
---|---|---|
0 | Fiji | MULTIPOLYGON (((-180.00000 -16.... |
1 | Tanzania | MULTIPOLYGON (((33.90371 -0.950... |
2 | Western Sahara | MULTIPOLYGON (((-8.66559 27.656... |
... | ... | ... |
174 | Kosovo | MULTIPOLYGON (((20.59025 41.855... |
175 | Trinidad and Tobago | MULTIPOLYGON (((-61.68000 10.76... |
176 | South Sudan | MULTIPOLYGON (((30.83385 3.5091... |
177 rows × 2 columns
“Slice” of columns between given ones:
'name_long':'pop'] world.loc[:,
name_long | continent | region_un | ... | type | area_km2 | pop | |
---|---|---|---|---|---|---|---|
0 | Fiji | Oceania | Oceania | ... | Sovereign country | 19289.970733 | 885806.0 |
1 | Tanzania | Africa | Africa | ... | Sovereign country | 932745.792357 | 52234869.0 |
2 | Western Sahara | Africa | Africa | ... | Indeterminate | 96270.601041 | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
174 | Kosovo | Europe | Europe | ... | Sovereign country | 11230.261672 | 1821800.0 |
175 | Trinidad and Tobago | North America | Americas | ... | Sovereign country | 7737.809855 | 1354493.0 |
176 | South Sudan | Africa | Africa | ... | Sovereign country | 624909.099086 | 11530971.0 |
177 rows × 7 columns
Subsetting by a list of boolean values (0
and 1
, or True
and False
):
= [1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0]
x world.iloc[:, x]
name_long | name_long | iso_a2 | ... | name_long | iso_a2 | iso_a2 | |
---|---|---|---|---|---|---|---|
0 | Fiji | Fiji | FJ | ... | Fiji | FJ | FJ |
1 | Tanzania | Tanzania | TZ | ... | Tanzania | TZ | TZ |
2 | Western Sahara | Western Sahara | EH | ... | Western Sahara | EH | EH |
... | ... | ... | ... | ... | ... | ... | ... |
174 | Kosovo | Kosovo | XK | ... | Kosovo | XK | XK |
175 | Trinidad and Tobago | Trinidad and Tobago | TT | ... | Trinidad and Tobago | TT | TT |
176 | South Sudan | South Sudan | SS | ... | South Sudan | SS | SS |
177 rows × 11 columns
We can remove specific rows by id using the .drop
method, e.g., dropping rows 2, 3, and 5:
2, 3, 5]) world.drop([
iso_a2 | name_long | continent | ... | lifeExp | gdpPercap | geometry | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Oceania | ... | 69.960000 | 8222.253784 | MULTIPOLYGON (((-180.00000 -16.... |
1 | TZ | Tanzania | Africa | ... | 64.163000 | 2402.099404 | MULTIPOLYGON (((33.90371 -0.950... |
4 | US | United States | North America | ... | 78.841463 | 51921.984639 | MULTIPOLYGON (((-171.73166 63.7... |
... | ... | ... | ... | ... | ... | ... | ... |
174 | XK | Kosovo | Europe | ... | 71.097561 | 8698.291559 | MULTIPOLYGON (((20.59025 41.855... |
175 | TT | Trinidad and Tobago | North America | ... | 70.426000 | 31181.821196 | MULTIPOLYGON (((-61.68000 10.76... |
176 | SS | South Sudan | Africa | ... | 55.817000 | 1935.879400 | MULTIPOLYGON (((30.83385 3.5091... |
174 rows × 11 columns
Or remove specific columns using the .drop
method and axis=1
(i.e., columns):
'name_long', 'continent'], axis=1) world.drop([
iso_a2 | region_un | subregion | ... | lifeExp | gdpPercap | geometry | |
---|---|---|---|---|---|---|---|
0 | FJ | Oceania | Melanesia | ... | 69.960000 | 8222.253784 | MULTIPOLYGON (((-180.00000 -16.... |
1 | TZ | Africa | Eastern Africa | ... | 64.163000 | 2402.099404 | MULTIPOLYGON (((33.90371 -0.950... |
2 | EH | Africa | Northern Africa | ... | NaN | NaN | MULTIPOLYGON (((-8.66559 27.656... |
... | ... | ... | ... | ... | ... | ... | ... |
174 | XK | Europe | Southern Europe | ... | 71.097561 | 8698.291559 | MULTIPOLYGON (((20.59025 41.855... |
175 | TT | Americas | Caribbean | ... | 70.426000 | 31181.821196 | MULTIPOLYGON (((-61.68000 10.76... |
176 | SS | Africa | Eastern Africa | ... | 55.817000 | 1935.879400 | MULTIPOLYGON (((30.83385 3.5091... |
177 rows × 9 columns
We can rename columns using the .rename
method:
'name_long', 'pop']].rename(columns={'pop': 'population'}) world[[
name_long | population | |
---|---|---|
0 | Fiji | 885806.0 |
1 | Tanzania | 52234869.0 |
2 | Western Sahara | NaN |
... | ... | ... |
174 | Kosovo | 1821800.0 |
175 | Trinidad and Tobago | 1354493.0 |
176 | South Sudan | 11530971.0 |
177 rows × 2 columns
The standard numpy comparison operators can be used in boolean subsetting, as illustrated in Table Table 2.1.
Symbol |
Name |
---|---|
== |
Equal to |
!= |
Not equal to |
> , < |
Greater/Less than |
>= , <= |
Greater/Less than or equal |
& , | , ~ |
Logical operators: And, Or, Not |
The following example demonstrates logical vectors for subsetting by creating a new GeoDataFrame
object called small_countries
that contains only those countries whose surface area is smaller than 10,000 \(km^2\):
= world['area_km2'] < 10000 ## a logical 'Series'
idx_small = world[idx_small]
small_countries small_countries
iso_a2 | name_long | continent | ... | lifeExp | gdpPercap | geometry | |
---|---|---|---|---|---|---|---|
45 | PR | Puerto Rico | North America | ... | 79.390122 | 35066.046376 | MULTIPOLYGON (((-66.28243 18.51... |
79 | PS | Palestine | Asia | ... | 73.126000 | 4319.528283 | MULTIPOLYGON (((35.39756 31.489... |
89 | VU | Vanuatu | Oceania | ... | 71.709000 | 2892.341604 | MULTIPOLYGON (((166.79316 -15.6... |
... | ... | ... | ... | ... | ... | ... | ... |
160 | NaN | Northern Cyprus | Asia | ... | NaN | NaN | MULTIPOLYGON (((32.73178 35.140... |
161 | CY | Cyprus | Asia | ... | 80.173000 | 29786.365653 | MULTIPOLYGON (((32.73178 35.140... |
175 | TT | Trinidad and Tobago | North America | ... | 70.426000 | 31181.821196 | MULTIPOLYGON (((-61.68000 10.76... |
7 rows × 11 columns
The intermediary idx_small
(short for index representing small countries) is a boolean Series
that can be used to subset the seven smallest countries in the world by surface area. A more concise command, which omits the intermediary object, generates the same result:
= world[world['area_km2'] < 10000]
small_countries small_countries
iso_a2 | name_long | continent | ... | lifeExp | gdpPercap | geometry | |
---|---|---|---|---|---|---|---|
45 | PR | Puerto Rico | North America | ... | 79.390122 | 35066.046376 | MULTIPOLYGON (((-66.28243 18.51... |
79 | PS | Palestine | Asia | ... | 73.126000 | 4319.528283 | MULTIPOLYGON (((35.39756 31.489... |
89 | VU | Vanuatu | Oceania | ... | 71.709000 | 2892.341604 | MULTIPOLYGON (((166.79316 -15.6... |
... | ... | ... | ... | ... | ... | ... | ... |
160 | NaN | Northern Cyprus | Asia | ... | NaN | NaN | MULTIPOLYGON (((32.73178 35.140... |
161 | CY | Cyprus | Asia | ... | 80.173000 | 29786.365653 | MULTIPOLYGON (((32.73178 35.140... |
175 | TT | Trinidad and Tobago | North America | ... | 70.426000 | 31181.821196 | MULTIPOLYGON (((-61.68000 10.76... |
7 rows × 11 columns
The various methods shown above can be chained for any combination with several subsetting steps, e.g.:
'continent'] == 'Asia'] \
world[world['name_long', 'continent']] \
.loc[:, [0:5, :] .iloc[
name_long | continent | |
---|---|---|
5 | Kazakhstan | Asia |
6 | Uzbekistan | Asia |
8 | Indonesia | Asia |
24 | Timor-Leste | Asia |
76 | Israel | Asia |
We can also combine indexes:
= world['area_km2'] < 10000
idx_small = world['continent'] == 'Asia'
idx_asia & idx_asia, ['name_long', 'continent', 'area_km2']] world.loc[idx_small
name_long | continent | area_km2 | |
---|---|---|---|
79 | Palestine | Asia | 5037.103826 |
160 | Northern Cyprus | Asia | 3786.364506 |
161 | Cyprus | Asia | 6207.006191 |
2.3.2 Vector attribute aggregation
Aggregation involves summarizing data based on one or more grouping variables (typically values in a column; geographic aggregation is covered in Section 3.3.5). A classic example of this attribute-based aggregation is calculating the number of people per continent based on country-level data (one row per country). The world
dataset contains the necessary ingredients: the columns pop
and continent
, the population and the grouping variable, respectively. The aim is to find the sum()
of country populations for each continent, resulting in a smaller table or vector layer (of continents). (Since aggregation is a form of data reduction, it can be a useful early step when working with large datasets).
Non-spatial aggregation can be achieved using a combination of .groupby
and .sum
:
= world[['continent', 'pop']].groupby('continent').sum()
world_agg1 world_agg1
pop | |
---|---|
continent | |
Africa | 1.154947e+09 |
Antarctica | 0.000000e+00 |
Asia | 4.311408e+09 |
... | ... |
Oceania | 3.775783e+07 |
Seven seas (open ocean) | 0.000000e+00 |
South America | 4.120608e+08 |
8 rows × 1 columns
The result is a (non-spatial) table with eight rows, one per continent, and two columns reporting the name and population of each continent.
If we want to include the geometry in the aggregation result, we can use the .dissolve
method. That way, in addition to the summed population, we also get the associated geometry per continent, i.e., the union of all countries. Note that we use the by
parameter to choose which column(s) are used for grouping, and the aggfunc
parameter to choose the aggregation function for non-geometry columns:
= world[['continent', 'pop', 'geometry']] \
world_agg2 ='continent', aggfunc='sum') \
.dissolve(by
.reset_index() world_agg2
continent | geometry | pop | |
---|---|---|---|
0 | Africa | MULTIPOLYGON (((-11.43878 6.785... | 1.154947e+09 |
1 | Antarctica | MULTIPOLYGON (((-61.13898 -79.9... | 0.000000e+00 |
2 | Asia | MULTIPOLYGON (((48.67923 14.003... | 4.311408e+09 |
... | ... | ... | ... |
5 | Oceania | MULTIPOLYGON (((147.91405 -43.2... | 3.775783e+07 |
6 | Seven seas (open ocean) | POLYGON ((68.93500 -48.62500, 6... | 0.000000e+00 |
7 | South America | MULTIPOLYGON (((-68.63999 -55.5... | 4.120608e+08 |
8 rows × 3 columns
Figure 2.1 shows the result:
= plt.subplots(figsize=(6, 3))
fig, ax ='pop', edgecolor='black', legend=True, ax=ax); world_agg2.plot(column
The resulting world_agg2
object is a GeoDataFrame
containing 8 features representing the continents of the world (and the open ocean).
Other options for the aggfunc
parameter in .dissolve
include:
'first'
'last'
'min'
'max'
'sum'
'mean'
'median'
Additionally, we can pass custom functions.
As a more complex example, here is how we can calculate the total population, area, and count of countries, per continent:
= world.dissolve(
world_agg3 ='continent',
by={
aggfunc'name_long': 'count',
'pop': 'sum',
'area_km2': 'sum'
={'name_long': 'n'})
}).rename(columns world_agg3
geometry | n | pop | area_km2 | |
---|---|---|---|---|
continent | ||||
Africa | MULTIPOLYGON (((-11.43878 6.785... | 51 | 1.154947e+09 | 2.994620e+07 |
Antarctica | MULTIPOLYGON (((-61.13898 -79.9... | 1 | 0.000000e+00 | 1.233596e+07 |
Asia | MULTIPOLYGON (((48.67923 14.003... | 47 | 4.311408e+09 | 3.125246e+07 |
... | ... | ... | ... | ... |
Oceania | MULTIPOLYGON (((147.91405 -43.2... | 7 | 3.775783e+07 | 8.504489e+06 |
Seven seas (open ocean) | POLYGON ((68.93500 -48.62500, 6... | 1 | 0.000000e+00 | 1.160257e+04 |
South America | MULTIPOLYGON (((-68.63999 -55.5... | 13 | 4.120608e+08 | 1.776259e+07 |
8 rows × 4 columns
Figure Figure 2.2 visualizes the resulting layer (world_agg3
) of continents with the three aggregated attributes.
# Summed population
= plt.subplots(figsize=(5, 2.5))
fig, ax ='pop', edgecolor='black', legend=True, ax=ax);
world_agg3.plot(column
# Summed area
= plt.subplots(figsize=(5, 2.5))
fig, ax ='area_km2', edgecolor='black', legend=True, ax=ax);
world_agg3.plot(column
# Count of countries
= plt.subplots(figsize=(5, 2.5))
fig, ax ='n', edgecolor='black', legend=True, ax=ax); world_agg3.plot(column
Let’s proceed with the last result to demonstrate other table-related operations. Given the world_agg3
continent summary (Figure 2.2), we:
- drop the geometry columns,
- calculate population density of each continent,
- arrange continents by the number countries they contain, and
- keep only the 3 most populous continents.
= world_agg3.drop(columns=['geometry'])
world_agg4 'density'] = world_agg4['pop'] / world_agg4['area_km2']
world_agg4[= world_agg4.sort_values(by='n', ascending=False)
world_agg4 = world_agg4.head(3)
world_agg4 world_agg4
n | pop | area_km2 | density | |
---|---|---|---|---|
continent | ||||
Africa | 51 | 1.154947e+09 | 2.994620e+07 | 38.567388 |
Asia | 47 | 4.311408e+09 | 3.125246e+07 | 137.954201 |
Europe | 39 | 6.690363e+08 | 2.306522e+07 | 29.006283 |
2.3.3 Vector attribute joining
Combining data from different sources is a common task in data preparation. Joins do this by combining tables based on a shared “key” variable. pandas has a function named pd.merge
for joining (Geo)DataFrames
based on common column(s). The pd.merge
function follows conventions used in the database language SQL (Grolemund and Wickham 2016, add citation…). The pd.merge
function works the same on DataFrame
and GeoDataFrame
objects. The result of pd.merge
can be either a DataFrame
or a GeoDataFrame
object, depending on the inputs.
A common type of attribute join on spatial data is to join DataFrames
to GeoDataFrames
. To achieve this, we use pd.merge
with a GeoDataFrame
as the first argument and add columns to it from a DataFrame
specified as the second argument. In the following example, we combine data on coffee production with the world
dataset. The coffee data is in a DataFrame
called coffee_data
imported from a CSV file of major coffee-producing nations:
= pd.read_csv('data/coffee_data.csv')
coffee_data coffee_data
name_long | coffee_production_2016 | coffee_production_2017 | |
---|---|---|---|
0 | Angola | NaN | NaN |
1 | Bolivia | 3.0 | 4.0 |
2 | Brazil | 3277.0 | 2786.0 |
... | ... | ... | ... |
44 | Zambia | 3.0 | NaN |
45 | Zimbabwe | 1.0 | 1.0 |
46 | Others | 23.0 | 26.0 |
47 rows × 3 columns
Its columns are:
name_long
—country namecoffee_production_2016
andcoffee_production_2017
—estimated values for coffee production in units of 60-kg bags per year, for 2016 and 2017, respectively
A left join, which preserves the first dataset, merges world
with coffee_data
, based on the common 'name_long'
column:
= pd.merge(world, coffee_data, on='name_long', how='left')
world_coffee world_coffee
iso_a2 | name_long | continent | ... | geometry | coffee_production_2016 | coffee_production_2017 | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Oceania | ... | MULTIPOLYGON (((-180.00000 -16.... | NaN | NaN |
1 | TZ | Tanzania | Africa | ... | MULTIPOLYGON (((33.90371 -0.950... | 81.0 | 66.0 |
2 | EH | Western Sahara | Africa | ... | MULTIPOLYGON (((-8.66559 27.656... | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
174 | XK | Kosovo | Europe | ... | MULTIPOLYGON (((20.59025 41.855... | NaN | NaN |
175 | TT | Trinidad and Tobago | North America | ... | MULTIPOLYGON (((-61.68000 10.76... | NaN | NaN |
176 | SS | South Sudan | Africa | ... | MULTIPOLYGON (((30.83385 3.5091... | NaN | NaN |
177 rows × 13 columns
The result is a GeoDataFrame
object identical to the original world
object, but with two new variables (coffee_production_2016
and coffee_production_2017
) on coffee production. This can be plotted as a map, as illustrated (for coffee_production_2017
) in Figure 2.3:
= world_coffee.plot(color='white', edgecolor='lightgrey')
base = world_coffee.plot(ax=base, column='coffee_production_2017')
coffee_map 'Coffee production'); coffee_map.set_title(
To work, attribute-based joins need a “key variable” in both datasets (on
parameter of pd.merge
). In the above example, both world_coffee
and world
DataFrames contained a column called name_long
. (By default pd.merge
uses all columns with matching names. However, it is recommended to explicitly specify the names of the columns to be used for matching, like we did in the last example.)
In case where column names are not the same, you can use left_on
and right_on
to specify the respective columns.
Note that the result world_coffee
has the same number of rows as the original dataset world
. Although there are only 47 rows in coffee_data
, all 177 country records are kept intact in world_coffee
. Rows in the original dataset with no match are assigned np.nan
values for the new coffee production variables. This is a characteristic of a left join (specified with how='left'
) and is what we typically want to do.
What if we only want to keep countries that have a match in the key variable? In that case an inner join can be used:
='name_long', how='inner') pd.merge(world, coffee_data, on
iso_a2 | name_long | continent | ... | geometry | coffee_production_2016 | coffee_production_2017 | |
---|---|---|---|---|---|---|---|
0 | TZ | Tanzania | Africa | ... | MULTIPOLYGON (((33.90371 -0.950... | 81.0 | 66.0 |
1 | PG | Papua New Guinea | Oceania | ... | MULTIPOLYGON (((141.00021 -2.60... | 114.0 | 74.0 |
2 | ID | Indonesia | Asia | ... | MULTIPOLYGON (((104.36999 -1.08... | 742.0 | 360.0 |
... | ... | ... | ... | ... | ... | ... | ... |
42 | ET | Ethiopia | Africa | ... | MULTIPOLYGON (((47.78942 8.0030... | 215.0 | 283.0 |
43 | UG | Uganda | Africa | ... | MULTIPOLYGON (((33.90371 -0.950... | 408.0 | 443.0 |
44 | RW | Rwanda | Africa | ... | MULTIPOLYGON (((30.41910 -1.134... | 36.0 | 42.0 |
45 rows × 13 columns
An alternative way to join two (Geo)DataFrame
s is the aptly called join
function:
'name_long'), on='name_long', how='inner') world.join(coffee_data.set_index(
iso_a2 | name_long | continent | ... | geometry | coffee_production_2016 | coffee_production_2017 | |
---|---|---|---|---|---|---|---|
1 | TZ | Tanzania | Africa | ... | MULTIPOLYGON (((33.90371 -0.950... | 81.0 | 66.0 |
7 | PG | Papua New Guinea | Oceania | ... | MULTIPOLYGON (((141.00021 -2.60... | 114.0 | 74.0 |
8 | ID | Indonesia | Asia | ... | MULTIPOLYGON (((104.36999 -1.08... | 742.0 | 360.0 |
... | ... | ... | ... | ... | ... | ... | ... |
165 | ET | Ethiopia | Africa | ... | MULTIPOLYGON (((47.78942 8.0030... | 215.0 | 283.0 |
168 | UG | Uganda | Africa | ... | MULTIPOLYGON (((33.90371 -0.950... | 408.0 | 443.0 |
169 | RW | Rwanda | Africa | ... | MULTIPOLYGON (((30.41910 -1.134... | 36.0 | 42.0 |
45 rows × 13 columns
Note that in this case, we need to set the index of coffee_data
to the name_long
values.
2.3.4 Creating attributes and removing spatial information
Often, we would like to create a new column based on already existing columns. For example, we want to calculate population density for each country. For this we need to divide a population column, here pop
, by an area column, here area_km2
. Note that we are working on a copy of world
named world2
so that we do not modify the original layer:
= world.copy()
world2 'pop_dens'] = world2['pop'] / world2['area_km2']
world2[ world2
iso_a2 | name_long | continent | ... | gdpPercap | geometry | pop_dens | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Oceania | ... | 8222.253784 | MULTIPOLYGON (((-180.00000 -16.... | 45.920547 |
1 | TZ | Tanzania | Africa | ... | 2402.099404 | MULTIPOLYGON (((33.90371 -0.950... | 56.001184 |
2 | EH | Western Sahara | Africa | ... | NaN | MULTIPOLYGON (((-8.66559 27.656... | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
174 | XK | Kosovo | Europe | ... | 8698.291559 | MULTIPOLYGON (((20.59025 41.855... | 162.222400 |
175 | TT | Trinidad and Tobago | North America | ... | 31181.821196 | MULTIPOLYGON (((-61.68000 10.76... | 175.048628 |
176 | SS | South Sudan | Africa | ... | 1935.879400 | MULTIPOLYGON (((30.83385 3.5091... | 18.452237 |
177 rows × 12 columns
To paste (i.e., concatenate) together existing columns, we can use the ordinary Python string operator +
, as if we are working with individual strings rather than Series
. For example, we want to combine the continent
and region_un
columns into a new column named con_reg
, using ':'
as a separator. Subsequesntly, we remove the original columns using .drop
:
'con_reg'] = world['continent'] + ':' + world2['region_un']
world2[= world2.drop(['continent', 'region_un'], axis=1)
world2 world2
iso_a2 | name_long | subregion | ... | geometry | pop_dens | con_reg | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Melanesia | ... | MULTIPOLYGON (((-180.00000 -16.... | 45.920547 | Oceania:Oceania |
1 | TZ | Tanzania | Eastern Africa | ... | MULTIPOLYGON (((33.90371 -0.950... | 56.001184 | Africa:Africa |
2 | EH | Western Sahara | Northern Africa | ... | MULTIPOLYGON (((-8.66559 27.656... | NaN | Africa:Africa |
... | ... | ... | ... | ... | ... | ... | ... |
174 | XK | Kosovo | Southern Europe | ... | MULTIPOLYGON (((20.59025 41.855... | 162.222400 | Europe:Europe |
175 | TT | Trinidad and Tobago | Caribbean | ... | MULTIPOLYGON (((-61.68000 10.76... | 175.048628 | North America:Americas |
176 | SS | South Sudan | Eastern Africa | ... | MULTIPOLYGON (((30.83385 3.5091... | 18.452237 | Africa:Africa |
177 rows × 11 columns
The resulting GeoDataFrame
object has a new column called con_reg
representing the continent and region of each country, e.g., 'South America:Americas'
for Argentina and other South America countries. The opposite operation, splitting one column into multiple columns based on a separator string, is done using the .str.split
method. As a result we go back to the previous state of two separate continent
and region_un
columns (only that their position is now last, since they are newly created):
'continent', 'region_un']] = world2['con_reg'] \
world2[[str.split(':', expand=True)
. world2
iso_a2 | name_long | subregion | ... | con_reg | continent | region_un | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Melanesia | ... | Oceania:Oceania | Oceania | Oceania |
1 | TZ | Tanzania | Eastern Africa | ... | Africa:Africa | Africa | Africa |
2 | EH | Western Sahara | Northern Africa | ... | Africa:Africa | Africa | Africa |
... | ... | ... | ... | ... | ... | ... | ... |
174 | XK | Kosovo | Southern Europe | ... | Europe:Europe | Europe | Europe |
175 | TT | Trinidad and Tobago | Caribbean | ... | North America:Americas | North America | Americas |
176 | SS | South Sudan | Eastern Africa | ... | Africa:Africa | Africa | Africa |
177 rows × 13 columns
Renaming one or more columns can be done using the .rename
method combined with the columns
argument, which should be a dictionary of the form old_name:new_name
. The following command, for example, renames the lengthy name_long
column to simply name
:
={'name_long': 'name'}) world2.rename(columns
iso_a2 | name | subregion | ... | con_reg | continent | region_un | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Melanesia | ... | Oceania:Oceania | Oceania | Oceania |
1 | TZ | Tanzania | Eastern Africa | ... | Africa:Africa | Africa | Africa |
2 | EH | Western Sahara | Northern Africa | ... | Africa:Africa | Africa | Africa |
... | ... | ... | ... | ... | ... | ... | ... |
174 | XK | Kosovo | Southern Europe | ... | Europe:Europe | Europe | Europe |
175 | TT | Trinidad and Tobago | Caribbean | ... | North America:Americas | North America | Americas |
176 | SS | South Sudan | Eastern Africa | ... | Africa:Africa | Africa | Africa |
177 rows × 13 columns
To change all column names at once, we assign a list
of the “new” column names into the .columns
property. The list
must be of the same length as the number of columns (i.e., world.shape[1]
). This is illustrated below, which outputs the same world2
object, but with very short names:
= ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'geom', 'i', 'j', 'k', 'l']
new_names = new_names
world2.columns world2
a | b | c | ... | j | k | l | |
---|---|---|---|---|---|---|---|
0 | FJ | Fiji | Melanesia | ... | Oceania:Oceania | Oceania | Oceania |
1 | TZ | Tanzania | Eastern Africa | ... | Africa:Africa | Africa | Africa |
2 | EH | Western Sahara | Northern Africa | ... | Africa:Africa | Africa | Africa |
... | ... | ... | ... | ... | ... | ... | ... |
174 | XK | Kosovo | Southern Europe | ... | Europe:Europe | Europe | Europe |
175 | TT | Trinidad and Tobago | Caribbean | ... | North America:Americas | North America | Americas |
176 | SS | South Sudan | Eastern Africa | ... | Africa:Africa | Africa | Africa |
177 rows × 13 columns
To reorder columns, we can pass a modified columns list to the subsetting operator [
. For example, the following expressions reorder world2
columns in reverse alphabetical order:
= sorted(world2.columns, reverse=True)
names = world2[names]
world2 world2
l | k | j | ... | c | b | a | |
---|---|---|---|---|---|---|---|
0 | Oceania | Oceania | Oceania:Oceania | ... | Melanesia | Fiji | FJ |
1 | Africa | Africa | Africa:Africa | ... | Eastern Africa | Tanzania | TZ |
2 | Africa | Africa | Africa:Africa | ... | Northern Africa | Western Sahara | EH |
... | ... | ... | ... | ... | ... | ... | ... |
174 | Europe | Europe | Europe:Europe | ... | Southern Europe | Kosovo | XK |
175 | Americas | North America | North America:Americas | ... | Caribbean | Trinidad and Tobago | TT |
176 | Africa | Africa | Africa:Africa | ... | Eastern Africa | South Sudan | SS |
177 rows × 13 columns
Each of these attribute data operations, even though they are defined in the pandas
package and applicable to any DataFrame
, preserve the geometry column and the GeoDataFrame
class. Sometimes, however, it makes sense to remove the geometry, for example to speed-up aggregation or to export just the attribute data for statistical analysis. To go from GeoDataFrame
to DataFrame
we need to:
- Drop the geometry column
- Convert from
GeoDataFrame
into aDataFrame
For example, by the end of the following code section world2
becomes a DataFrame
:
= world2.drop('geom', axis=1)
world2 = pd.DataFrame(world2)
world2 world2
l | k | j | ... | c | b | a | |
---|---|---|---|---|---|---|---|
0 | Oceania | Oceania | Oceania:Oceania | ... | Melanesia | Fiji | FJ |
1 | Africa | Africa | Africa:Africa | ... | Eastern Africa | Tanzania | TZ |
2 | Africa | Africa | Africa:Africa | ... | Northern Africa | Western Sahara | EH |
... | ... | ... | ... | ... | ... | ... | ... |
174 | Europe | Europe | Europe:Europe | ... | Southern Europe | Kosovo | XK |
175 | Americas | North America | North America:Americas | ... | Caribbean | Trinidad and Tobago | TT |
176 | Africa | Africa | Africa:Africa | ... | Eastern Africa | South Sudan | SS |
177 rows × 12 columns
2.4 Manipulating raster objects
2.4.1 Raster subsetting
When using rasterio
, raster values are accessible through a numpy
array, which can be imported with the .read
method:
= src_elev.read(1)
elev elev
array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24],
[25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36]], dtype=uint8)
Then, we can access any subset of cell values using numpy
methods, e.g.:
1, 2] ## Value at row 2, column 3 elev[
9
Cell values can be modified by overwriting existing values in conjunction with a subsetting operation, e.g. to set cell at row 2, column 3 of elev
to 0
:
1, 2] = 0
elev[ elev
array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 0, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24],
[25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36]], dtype=uint8)
Multiple cells can also be modified in this way:
0, 0:3] = 0
elev[ elev
array([[ 0, 0, 0, 4, 5, 6],
[ 7, 8, 0, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24],
[25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36]], dtype=uint8)
2.4.2 Summarizing raster objects
Global summaries of raster values can be calculated by applying numpy
summary functions on the array with raster values, e.g. np.mean
:
np.mean(elev)
18.083333333333332
Note that “No Data”-safe functions–such as np.nanmean
—should be used in case the raster contains “No Data” values which need to be ignored. Before we can demontrate that, we must convert the array from int
to float
, as int
arrays cannot contain np.nan
(due to computer memory limitations):
= elev.copy()
elev1 = elev1.astype('float64')
elev1 elev1
array([[ 0., 0., 0., 4., 5., 6.],
[ 7., 8., 0., 10., 11., 12.],
[13., 14., 15., 16., 17., 18.],
[19., 20., 21., 22., 23., 24.],
[25., 26., 27., 28., 29., 30.],
[31., 32., 33., 34., 35., 36.]])
Now we can insert an np.nan
value into the array. (Trying to do so in the original elev
array raises an error, try it to see for yourself.)
0, 2] = np.nan
elev1[ elev1
array([[ 0., 0., nan, 4., 5., 6.],
[ 7., 8., 0., 10., 11., 12.],
[13., 14., 15., 16., 17., 18.],
[19., 20., 21., 22., 23., 24.],
[25., 26., 27., 28., 29., 30.],
[31., 32., 33., 34., 35., 36.]])
With the np.nan
value inplace, the summary value becomes unknown:
np.mean(elev1)
nan
To get a summary of all non-missing values, we need to use the specialized numpy functions that ignore “No Data” values:
np.nanmean(elev1)
18.6
Raster value statistics can be visualized in a variety of ways. One approach is to “flatten” the raster values into a one-dimensional array, then use a graphical function such as plt.hist
or plt.boxplot
(from matplotlib.pyplot
). For example, the following code section shows the distribution of values in elev
using a histogram (Figure 2.4):
= elev.flatten()
x ; plt.hist(x)
elev