This walkthrough covers some of the same concepts from Jamie Montgomery’s guide on spatial data in R, which is how I first learned to use the sf
package. It will cover some of the same concepts and also some different ones.
As her guide does, I will start with background from the sf vignette:
"Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.
"The standard is widely implemented in spatial databases (such as PostGIS), commercial GIS (e.g., ESRI ArcGIS) and forms the vector data basis for libraries such as GDAL. A subset of simple features forms the GeoJSON standard.
"R has well-supported classes for storing spatial data (sp) and interfacing to the above mentioned environments (rgdal, rgeos), but has so far lacked a complete implementation of simple features, making conversions at times convoluted, inefficient or incomplete. The package sf tries to fill this gap, and aims at succeeding sp in the long term."
We will read in a shapefile of wildfire perimeters from the US Geological Survey (USGS) Monitoring Trends in Burn Severity (MTBS) dataset. The data contained in this repo are for California wildfires from 2010 through 2016.
Taking a look at the data:
## Simple feature collection with 6 features and 7 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -117.1918 ymin: 32.54717 xmax: -116.2191 ymax: 33.01739
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 8
## Fire_ID Fire_Name Year StartMonth StartDay Fire_Type Acres
## <chr> <chr> <int> <int> <int> <chr> <dbl>
## 1 CA3260~ BORDER 6 2012 5 17 WF 5302
## 2 CA3260~ BORDER 3 2016 6 19 WF 7958
## 3 CA3264~ SHOCKEY 2012 9 23 WF 2667
## 4 CA3271~ OLD 2 2012 6 17 WF 1058
## 5 CA3289~ MONTE 2010 8 21 WF 1013
## 6 CA3300~ BERNARDO 2014 5 13 WF 1525
## # ... with 1 more variable: geometry <MULTIPOLYGON [°]>
An sf
object is similar to a dataframe except that it has an additional column, the geometry column, which contains a list of coordinates used to plot each polygon.
## [1] "sf" "tbl_df" "tbl" "data.frame"
Visualizing the data spatially using mapview
.
Occasionally I do not care about the spatial parts of a shapefile and I only want to recover the attributes. In the sf
environment one can easily convert to a dataframe and de-select the spatial components.
## Fire_ID Fire_Name Year StartMonth StartDay Fire_Type Acres
## 1 CA3260811624320120517 BORDER 6 2012 5 17 WF 5302
## 2 CA3260811664120160619 BORDER 3 2016 6 19 WF 7958
## 3 CA3264811636520120923 SHOCKEY 2012 9 23 WF 2667
## 4 CA3271011638520120617 OLD 2 2012 6 17 WF 1058
## 5 CA3289411681720100821 MONTE 2010 8 21 WF 1013
## 6 CA3300311713320140513 BERNARDO 2014 5 13 WF 1525
I could potentially take the above dataframe elsewhere, write it to a csv, and use it for something else. Another quick way to drop the geometry:
Suppose we would like to know what kinds of fires there are in the data field Fire_Type
.
## # A tibble: 4 x 1
## Fire_Type
## <chr>
## 1 WF
## 2 UNK
## 3 RX
## 4 WFU
Perhaps we are unconcerned with “RX” fires, which are prescribed burns. We can remove them from our sf
object using dplyr::filter
.
This dropped a small number of observations.
Another important feature of any spatial file is the coordinate reference system (CRS). I will once again reference Jamie Montgomery’s concise explanation of coordinate reference systems:
Every sf object needs a coordinate reference system (or crs) defined in order to work with it correctly. A coordinate reference system contains both a datum and a projection. The datum is how you georeference your points (in 3 dimensions!) onto a spheroid. The projection is how these points are mathematically transformed to represent the georeferenced point on a flat piece of paper. All coordinate reference systems require a datum. However, some coordinate reference systems are “unprojected” (also called geographic coordinate systems). Coordinates in latitude/longitude use a geographic (unprojected) coordinate system.
We can retrieve the coordinate reference system of our sf
object using sf::st_crs
.
## Coordinate Reference System:
## EPSG: 4269
## proj4string: "+proj=longlat +datum=NAD83 +no_defs"
By the output we see that the coordinates are unprojected and are instead in lat-long format. We can confirm this with a simple function.
## [1] TRUE
To make full use of sf
’s distance and area calculations you need a flat projection. sf::st_is_longlat
is a good sanity check if you are unsure why your distance and area calculations are not working.
What projection is best? A flat projection affects the aesthetics of a map, but it also will affect our distance calculations.
California generally recommends to use the “California (Teale) Albers” projection. Every map projection is associated with an “EPSG” code, and the EPSG code for California Albers is 3310. We can project the data using sf::st_transform
.
fire_proj <-
fire_perims %>%
st_transform(crs=3310) # california albers
# checking:
st_crs(fire_proj)
## Coordinate Reference System:
## EPSG: 3310
## proj4string: "+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
We now have more information about the sf
object, including the units of distance (which are in meters).
One simple calculation is an area measurement. Our dataset already gave us an “Acres” field, so we will see how closely our computation matches. The relevant function is sf::st_area
.
fire_proj <-
fire_proj %>%
mutate(area=st_area(fire_proj))
fire_proj %>%
select(Acres,area) %>%
head()
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 262641.8 ymin: -600372 xmax: 355406.3 ymax: -551166.1
## epsg (SRID): 3310
## proj4string: +proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 6 x 3
## Acres area geometry
## <dbl> [m^2] <MULTIPOLYGON [m]>
## 1 5302 21454598 (((353899.2 -595411.8, 353870.5 -595459.9, 353828.8 -5955~
## 2 7958 32205060 (((327468.5 -590381.1, 327493.7 -590426.7, 327529.6 -5904~
## 3 2667 10793043 (((348727.6 -591846.7, 348737.2 -591859, 348815.2 -591892~
## 4 1058 4281667 (((341736 -585676.8, 341686.2 -585663.4, 341633.3 -585637~
## 5 1013 4100897 (((297817.5 -562444.4, 297829.2 -562441.4, 297934.3 -5625~
## 6 1525 6170264 (((264767.9 -551584.8, 264774.3 -551588.6, 264791.1 -5515~
Whoops. We left the units in m^2. To easily convert to acres I like the function measurements::conv_unit
.
fire_proj <-
fire_proj %>%
mutate(area=as.numeric(st_area(fire_proj))) %>%
mutate(area=conv_unit(area,from="m2",to="acre"))
fire_proj %>%
select(Acres,area) %>%
head()
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 262641.8 ymin: -600372 xmax: 355406.3 ymax: -551166.1
## epsg (SRID): 3310
## proj4string: +proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 6 x 3
## Acres area geometry
## <dbl> <dbl> <MULTIPOLYGON [m]>
## 1 5302 5302. (((353899.2 -595411.8, 353870.5 -595459.9, 353828.8 -595511.~
## 2 7958 7958. (((327468.5 -590381.1, 327493.7 -590426.7, 327529.6 -590463.~
## 3 2667 2667. (((348727.6 -591846.7, 348737.2 -591859, 348815.2 -591892.2,~
## 4 1058 1058. (((341736 -585676.8, 341686.2 -585663.4, 341633.3 -585637.9,~
## 5 1013 1013. (((297817.5 -562444.4, 297829.2 -562441.4, 297934.3 -562536.~
## 6 1525 1525. (((264767.9 -551584.8, 264774.3 -551588.6, 264791.1 -551572.~
This is a very close match.
Sometimes we have spatial information in a non-spatial data format. We will import some data on federally managed campgrounds from Recreation.gov.
## # A tibble: 6 x 14
## facility_id facility_name parent_location~ parent_location agency_name
## <dbl> <chr> <dbl> <chr> <chr>
## 1 70003 OHARA 72719 MOOSE CREEK RD~ USFS
## 2 70005 MIDDLE FORK 74685 POWDER RIVER R~ USFS
## 3 70006 SOUTH FORK (~ 74685 POWDER RIVER R~ USFS
## 4 70008 RANGER CREEK 74687 PAINTROCK RD -~ USFS
## 5 70009 COOK LAKE RE~ 74689 BEARLODGE RD -~ USFS
## 6 70010 REUTER CAMPG~ 74689 BEARLODGE RD -~ USFS
## # ... with 9 more variables: facility_state <chr>, state_code <chr>,
## # facility_latitude <dbl>, facility_longitude <dbl>,
## # brochure_info_description <chr>, brochure_info_geography <chr>,
## # brochure_info_recreation <chr>, brochure_info_facilities <chr>,
## # brochure_info_nearby_attractions <chr>
I suggest exploring that data a bit. The brochure descriptions are fun. Of course, the data is not yet in a spatial form.
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
In fact, the lat/long coordinates are a numeric class:
## [1] "numeric"
To coerce to a spatial format, we can use the function sf::st_as_sf
.
campground_sf <-
st_as_sf(
campgrounds,
coords = c("facility_longitude","facility_latitude"),
crs=4269, # to maintain consistency with our fire data (NAD83)
remove=F # else it removes the original lat-long columns
)
We can now view the campgrounds on the map: