Overview
Department for Environment and Water
Nigel Willoughby
Wednesday, 04 September, 2024
Source:vignettes/010_overview.Rmd
010_overview.Rmd
pacs <- c("knitr"
, "envClean"
, "envReport"
, "envFunc", "fs", "purrr"
, "dplyr", "sf", "tibble"
, "tmap", "raster", "rstanarm"
)
purrr::walk(pacs
, ~suppressPackageStartupMessages(library(.
, character.only = TRUE
, quietly = TRUE)
)
)
# Load data
flor_all <- tibble::as_tibble(envClean::flor_all)
# What crs to use for maps?
use_crs <- 3577 # actually an epsg code. see epsg.io
# set area of interest coordinate reference system
aoi <- envClean::aoi %>%
sf::st_transform(crs = use_crs)
Installation
envClean
is not on CRAN.
Install the development version from GitHub
remotes::install_github("acanthiza/envClean")
Load envClean
Suggested workflow
After many, many iterations, the following workflow has been found to be ok. Only ok. There is no awesome when cleaning large, unstructured data.
clean | desc | order |
---|---|---|
all | starting point | 0 |
tem_range | within a date range | 1 |
tem_bin | assigning dates to temporal bins | 2 |
tem_rel | temporal reliability | 3 |
geo_range | within a geographic area | 4 |
geo_bin | assigning locations to spatial bins | 5 |
geo_rel | spatial reliability | 6 |
context | define context | 7 |
att | select attributes | 8 |
geo | add geographic context (e.g. IBRA) | 9 |
ann | non-persistent taxa | 10 |
taxa | align taxonomy and resolve any taxonomic duplication within bins | 11 |
single | singletons | 12 |
out | outliers | 13 |
NA | NA values in important columns | 14 |
effort | context effort | 15 |
prop | proportion of sites | 16 |
life | as a byproduct of assigning all records of a taxa a lifeform | 17 |
cov | as a byproduct of assigning all records of a taxa a cover value | 18 |
recent | the most recent visit to a cell | 19 |
lists | add list length | 20 |
filt_list_df | filter occurrence data to a set of criteria | 21 |
fst | fix spatial taxonomy | 22 |
fbd | filter by distribution | 23 |
pres | presences only | 24 |
coord | centroids of state, capital and institutions | 25 |
ind | indigenous species | 26 |
rm | geographic reliability | 27 |
include | taxa with presences, reliable distributions, and/or mcp around presences | 28 |
bin | reduce to distinct rows | 29 |
Key concepts
Filter/clean/tidy
envClean
, helps with implementing:
- filtering: remove rows of a data frame. These may be entirely legitimate observations but it is desirable to remove them for the purposes of a downstream analysis. For example, a [context] with only one (legitimate) record may not meet the expectations of an analysis that within each [context] there is a list of taxa recorded.
- cleaning: remove observations to reduce the risk that spurious observations are included in downstream analysis. For example, two different data sources may contain the same observation. Most analyses will perform better when records duplicated within a context are removed.
- tidying: as per tidy data (Wickham 2014) where each variable is a column and each observation is a unique row.
In practice these tasks are often blurred within each of the functions.
In general the process will be referred to as cleaning.
Bins (for sites, visits, records, taxa)
Due to the loose definition of bins (see below), the definitions of site, visit, record and taxa can change through the cleaning process.
- sites are spatial locations. they may be defined by latitude,
longitude, easting, northing and/or cell. These may be duplicated before
exclusive application of context. They are not necessarily defined by
all spatial concepts within context at all stages of the cleaning
process. In
env
spatial bins are usually set by add_raster_cell. - visits are sites plus a time, such as year, month, day (or, even
hour). Again, until context is applied exclusively, these may be
duplicated. In
env
temporal bins are usually year, month, or occasionally, day. - records are visits plus an observation to some level of the taxonomic hierarchy (refered to simply as ‘taxa’)
- taxa refers to some form of taxonomic entity. An entity may be
duplicated within a visit before taxonomy is resolved and context is
applied exclusively. In
env
taxonomic bins are usually set by make_taxonomy(target_rank = “desired rank”) where ‘desired_rank’ could be, say, ‘species’, or, say, ‘subspecies’.
Throughout the series of env
packages the concept of
context is used extensively, and at least currently, somewhat
loosely. Context supplies the bins: spatial, temporal and taxonomic
bins.
With respect to ‘loosely’: context may be defined by, say,
c("lat", "long", "cell", "year", "month")
. At various
stages through the cleaning process not every one of those variables may
be applicable. After running add_raster_cell
(to assign a
spatial bin) the variable lat
and long
may be
removed (depending on the add_xy
argument). However
context
can still be used in full in cleaning steps (via
the consistent use of tidyselect::any_of
in
envClean
functions).
Note that context
must be applied exclusively
at some point in the cleaning process (by, say,
dplyr::distinct(across(any_of(context)))
). Until that point
extraneous fields/columns beyond context
are maintained;
and no claim is made regarding the uniqueness of ‘records’ until this
step in the process.
Summarising the cleaning process
There are some cleaning process summary functions. Taking advantage of these requires:
- consistent naming with a prefix, default
bio_
- the suffix is a short name for that step in the cleaning process.
e.g.
bio_taxa
would be the object created when applying the taxonomic bins, andbio_geo_bin
is the object created when applying geogrphic bins - see envClean::luclean for the suffixes/short names (in the
clean
column)
- the suffix is a short name for that step in the cleaning process.
e.g.
- addition of a ctime (creation time) attribute, probably using envFunc::add_time_stamp()
The function clean_summary() then prepares information, based on the objects creating through the cleaning process, that can be used in summary reports. clean_summary() also, optionally (with default TRUE) saves the start and end objects from the cleaning process.
cleaning_text() prepares text, based on a cleaning summary, that can be used directly in .Rmd.
There are also small .Rmd files in /inst
that match the
suffix for each step. Looping through these child files from a main .Rmd
provides the structure for the output report.
Coordinate reference systems
There are two (possibly three) main coordinate reference systems (crs) to worry about:
- the crs for the original records. If these are in decimal degrees, using epsg = 4283 is likely to return the correct crs.
- the crs you’d like to use for most spatial data. Set here (in setup
chunk) to
use_crs
= 3577. It is likely that a projected crs will work best, particularly for buffering, filtering etc. - the crs for any other spatial data imported to help with cleaning.
Try using
sf::st_read("random_shape_file.shp") %>% sf::st_tranform(crs = use_crs)
to deal with this.
Import
Querying and uniting disparate data sources into a single data set is a challenge in its own right. See envImport for tools to assist there. Once you’ve imported and combined all your data, read on.