vignettes/010_overview.Rmd
010_overview.Rmd
pacs <- c("knitr"
, "envClean"
, "envReport"
, "envFunc", "fs", "purrr"
, "dplyr", "sf", "tibble"
, "tmap", "raster", "rstanarm"
)
purrr::walk(pacs
, ~suppressPackageStartupMessages(library(.
, character.only = TRUE
, quietly = TRUE)
)
)
# Load data
flor_all <- tibble::as_tibble(envClean::flor_all)
# What crs to use for maps?
use_crs <- 3577 # actually an epsg code. see epsg.io
# set area of interest coordinate reference system
aoi <- envClean::aoi %>%
sf::st_transform(crs = use_crs)
envClean
is not on CRAN.
Install the development version from GitHub
remotes::install_github("acanthiza/envClean")
Load envClean
After many, many iterations, the following workflow has been found to be ok. Only ok. There is no awesome when cleaning large, unstructured data.
clean | desc | order |
---|---|---|
all | starting point | 0 |
tem_range | within a date range | 1 |
tem_bin | assigning dates to temporal bins | 2 |
tem_rel | temporal reliability | 3 |
geo_range | within a geographic area | 4 |
geo_bin | assigning locations to spatial bins | 5 |
geo_rel | spatial reliability | 6 |
context | define context | 7 |
att | select attributes | 8 |
geo | add geographic context (e.g. IBRA) | 9 |
ann | non-persistent taxa | 10 |
taxa | align taxonomy and resolve any taxonomic duplication within bins | 11 |
single | singletons | 12 |
out | outliers | 13 |
NA | NA values in important columns | 14 |
effort | context effort | 15 |
prop | proportion of sites | 16 |
life | as a byproduct of assigning all records of a taxa a lifeform | 17 |
cov | as a byproduct of assigning all records of a taxa a cover value | 18 |
recent | the most recent visit to a cell | 19 |
lists | add list length | 20 |
filt_list_df | filter occurrence data to a set of criteria | 21 |
fst | fix spatial taxonomy | 22 |
fbd | filter by distribution | 23 |
pres | presences only | 24 |
coord | centroids of state, capital and institutions | 25 |
ind | indigenous species | 26 |
rm | geographic reliability | 27 |
include | taxa with presences, reliable distributions, and/or mcp around presences | 28 |
bin | reduce to distinct rows | 29 |
envClean
, helps with implementing:
In practice these tasks are often blurred within each of the functions.
In general the process will be referred to as cleaning.
Due to the loose definition of bins (see below), the definitions of site, visit, record and taxa can change through the cleaning process.
env
spatial bins are usually set by
add_raster_cell.env
temporal bins are usually year, month,
or occasionally, day.env
taxonomic bins are usually set
by make_taxonomy(target_rank = “desired rank”) where ‘desired_rank’
could be, say, ‘species’, or, say, ‘subspecies’.Throughout the series of env
packages the concept of
context is used extensively, and at least currently, somewhat
loosely. Context supplies the bins: spatial, temporal and taxonomic
bins.
With respect to ‘loosely’: context may be defined by, say,
c("lat", "long", "cell", "year", "month")
. At various
stages through the cleaning process not every one of those variables may
be applicable. After running add_raster_cell
(to assign a
spatial bin) the variable lat
and long
may be
removed (depending on the add_xy
argument). However
context
can still be used in full in cleaning steps (via
the consistent use of tidyselect::any_of
in
envClean
functions).
Note that context
must be applied exclusively
at some point in the cleaning process (by, say,
dplyr::distinct(across(any_of(context)))
). Until that point
extraneous fields/columns beyond context
are maintained;
and no claim is made regarding the uniqueness of ‘records’ until this
step in the process.
There are some cleaning process summary functions. Taking advantage of these requires:
bio_
bio_taxa
would be the object created when applying the
taxonomic bins, and bio_geo_bin
is the object created when
applying geogrphic binsclean
column)The function clean_summary() then prepares information, based on the objects creating through the cleaning process, that can be used in summary reports. clean_summary() also, optionally (with default TRUE) saves the start and end objects from the cleaning process.
cleaning_text() prepares text, based on a cleaning summary, that can be used directly in .Rmd.
There are also small .Rmd files in /inst
that match the
suffix for each step. Looping through these child files from a main .Rmd
provides the structure for the output report.
There are two (possibly three) main coordinate reference systems (crs) to worry about:
use_crs
= 3577. It is likely that a projected crs
will work best, particularly for buffering, filtering etc.sf::st_read("random_shape_file.shp") %>% sf::st_tranform(crs = use_crs)
to deal with this.Querying and uniting disparate data sources into a single data set is a challenge in its own right. See envImport for tools to assist there. Once you’ve imported and combined all your data, read on.