pacs <- c("knitr"
            , "envClean"
            , "envReport"
            , "envFunc", "fs", "purrr"
            , "dplyr", "sf", "tibble"
            , "tmap", "raster", "rstanarm"
            )

  purrr::walk(pacs
              , ~suppressPackageStartupMessages(library(.
                                                        , character.only = TRUE
                                                        , quietly = TRUE)
                                                )
              )

  #  Load data
  flor_all <- tibble::as_tibble(envClean::flor_all)
  
  # What crs to use for maps?
  use_crs <- 3577 # actually an epsg code. see epsg.io
  
  # set area of interest coordinate reference system
  aoi <- envClean::aoi %>%
    sf::st_transform(crs = use_crs)
  

Installation

envClean is not on CRAN.

Install the development version from GitHub

remotes::install_github("acanthiza/envClean")

Load envClean

Suggested workflow

After many, many iterations, the following workflow has been found to be ok. Only ok. There is no awesome when cleaning large, unstructured data.

Suggested steps in the cleaning process
clean desc order
all starting point 0
tem_range within a date range 1
tem_bin assigning dates to temporal bins 2
tem_rel temporal reliability 3
geo_range within a geographic area 4
geo_bin assigning locations to spatial bins 5
geo_rel spatial reliability 6
context define context 7
att select attributes 8
geo add geographic context (e.g. IBRA) 9
ann non-persistent taxa 10
taxa align taxonomy and resolve any taxonomic duplication within bins 11
single singletons 12
out outliers 13
NA NA values in important columns 14
effort context effort 15
prop proportion of sites 16
life as a byproduct of assigning all records of a taxa a lifeform 17
cov as a byproduct of assigning all records of a taxa a cover value 18
recent the most recent visit to a cell 19
lists add list length 20
filt_list_df filter occurrence data to a set of criteria 21
fst fix spatial taxonomy 22
fbd filter by distribution 23
pres presences only 24
coord centroids of state, capital and institutions 25
ind indigenous species 26
rm geographic reliability 27
include taxa with presences, reliable distributions, and/or mcp around presences 28
bin reduce to distinct rows 29

Key concepts

Filter/clean/tidy

envClean, helps with implementing:

  • filtering: remove rows of a data frame. These may be entirely legitimate observations but it is desirable to remove them for the purposes of a downstream analysis. For example, a [context] with only one (legitimate) record may not meet the expectations of an analysis that within each [context] there is a list of taxa recorded.
  • cleaning: remove observations to reduce the risk that spurious observations are included in downstream analysis. For example, two different data sources may contain the same observation. Most analyses will perform better when records duplicated within a context are removed.
  • tidying: as per tidy data (Wickham 2014) where each variable is a column and each observation is a unique row.

In practice these tasks are often blurred within each of the functions.

In general the process will be referred to as cleaning.

Bins (for sites, visits, records, taxa)

Due to the loose definition of bins (see below), the definitions of site, visit, record and taxa can change through the cleaning process.

  • sites are spatial locations. they may be defined by latitude, longitude, easting, northing and/or cell. These may be duplicated before exclusive application of context. They are not necessarily defined by all spatial concepts within context at all stages of the cleaning process. In env spatial bins are usually set by add_raster_cell.
  • visits are sites plus a time, such as year, month, day (or, even hour). Again, until context is applied exclusively, these may be duplicated. In env temporal bins are usually year, month, or occasionally, day.
  • records are visits plus an observation to some level of the taxonomic hierarchy (refered to simply as ‘taxa’)
  • taxa refers to some form of taxonomic entity. An entity may be duplicated within a visit before taxonomy is resolved and context is applied exclusively. In env taxonomic bins are usually set by make_taxonomy(target_rank = “desired rank”) where ‘desired_rank’ could be, say, ‘species’, or, say, ‘subspecies’.

Throughout the series of env packages the concept of context is used extensively, and at least currently, somewhat loosely. Context supplies the bins: spatial, temporal and taxonomic bins.

With respect to ‘loosely’: context may be defined by, say, c("lat", "long", "cell", "year", "month"). At various stages through the cleaning process not every one of those variables may be applicable. After running add_raster_cell (to assign a spatial bin) the variable lat and long may be removed (depending on the add_xy argument). However context can still be used in full in cleaning steps (via the consistent use of tidyselect::any_of in envClean functions).

Note that context must be applied exclusively at some point in the cleaning process (by, say, dplyr::distinct(across(any_of(context)))). Until that point extraneous fields/columns beyond context are maintained; and no claim is made regarding the uniqueness of ‘records’ until this step in the process.

Summarising the cleaning process

There are some cleaning process summary functions. Taking advantage of these requires:

  • consistent naming with a prefix, default bio_
    • the suffix is a short name for that step in the cleaning process. e.g. bio_taxa would be the object created when applying the taxonomic bins, and bio_geo_bin is the object created when applying geogrphic bins
    • see envClean::luclean for the suffixes/short names (in the clean column)
  • addition of a ctime (creation time) attribute, probably using envFunc::add_time_stamp()

The function clean_summary() then prepares information, based on the objects creating through the cleaning process, that can be used in summary reports. clean_summary() also, optionally (with default TRUE) saves the start and end objects from the cleaning process.

cleaning_text() prepares text, based on a cleaning summary, that can be used directly in .Rmd.

There are also small .Rmd files in /inst that match the suffix for each step. Looping through these child files from a main .Rmd provides the structure for the output report.

Coordinate reference systems

There are two (possibly three) main coordinate reference systems (crs) to worry about:

  1. the crs for the original records. If these are in decimal degrees, using epsg = 4283 is likely to return the correct crs.
  2. the crs you’d like to use for most spatial data. Set here (in setup chunk) to use_crs = 3577. It is likely that a projected crs will work best, particularly for buffering, filtering etc.
  3. the crs for any other spatial data imported to help with cleaning. Try using sf::st_read("random_shape_file.shp") %>% sf::st_tranform(crs = use_crs) to deal with this.

Import

Querying and uniting disparate data sources into a single data set is a challenge in its own right. See envImport for tools to assist there. Once you’ve imported and combined all your data, read on.

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.