ropensci/scrubr

Name: scrubr

Owner: rOpenSci

Description: Clean species occurrence records

Created: 2015-09-16 01:25:34.0

Updated: 2018-01-03 10:06:47.0

Pushed: 2018-01-05 19:52:49.0

Homepage:

Size: 257

Language: R

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

scrubr

Build Status codecov.io rstudio mirror downloads cran version

Clean Biological Occurrence Records

Clean using the following use cases (checkmarks indicate fxns exist - not necessarily complete):

A note about examples: We think that using a piping workflow with %>% makes code easier to build up, and easier to understand. However, in some examples we provide examples without the pipe to demonstrate traditional usage.

Install

Stable CRAN version

all.packages("scrubr")

Development version

ools::install_github("ropensci/scrubr")

ary("scrubr")
Coordinate based cleaning
("sampledata1")

Remove impossible coordinates (using sample data included in the pkg)

ord_impossible(dframe(sample_data_1)) # w/o pipe
me(sample_data_1) %>% coord_impossible()
scrubr dframe>
ize: 1500 X 5
at/Lon vars: latitude/longitude

              name  longitude latitude                date        key
             (chr)      (dbl)    (dbl)              (time)      (int)
  Ursus americanus  -79.68283 38.36662 2015-01-14 16:36:45 1065590124
  Ursus americanus  -82.42028 35.73304 2015-01-13 00:25:39 1065588899
  Ursus americanus  -99.09625 23.66893 2015-02-20 23:00:00 1098894889
  Ursus americanus  -72.77432 43.94883 2015-02-13 16:16:41 1065611122
  Ursus americanus  -72.34617 43.86464 2015-03-01 20:20:45 1088908315
  Ursus americanus -108.53674 32.65219 2015-03-29 17:06:54 1088932238
  Ursus americanus -108.53691 32.65237 2015-03-29 17:12:50 1088932273
  Ursus americanus -123.82900 40.13240 2015-03-28 23:00:00 1132403409
  Ursus americanus  -78.25027 36.93018 2015-03-20 21:11:24 1088923534
0 Ursus americanus  -76.78671 35.53079 2015-04-05 23:00:00 1088954559
.              ...        ...      ...                 ...        ...

Remove incomplete coordinates

ord_incomplete(dframe(sample_data_1)) # w/o pipe
me(sample_data_1) %>% coord_incomplete()
scrubr dframe>
ize: 1306 X 5
at/Lon vars: latitude/longitude

              name  longitude latitude                date        key
             (chr)      (dbl)    (dbl)              (time)      (int)
  Ursus americanus  -79.68283 38.36662 2015-01-14 16:36:45 1065590124
  Ursus americanus  -82.42028 35.73304 2015-01-13 00:25:39 1065588899
  Ursus americanus  -99.09625 23.66893 2015-02-20 23:00:00 1098894889
  Ursus americanus  -72.77432 43.94883 2015-02-13 16:16:41 1065611122
  Ursus americanus  -72.34617 43.86464 2015-03-01 20:20:45 1088908315
  Ursus americanus -108.53674 32.65219 2015-03-29 17:06:54 1088932238
  Ursus americanus -108.53691 32.65237 2015-03-29 17:12:50 1088932273
  Ursus americanus -123.82900 40.13240 2015-03-28 23:00:00 1132403409
  Ursus americanus  -78.25027 36.93018 2015-03-20 21:11:24 1088923534
0 Ursus americanus  -76.78671 35.53079 2015-04-05 23:00:00 1088954559
.              ...        ...      ...                 ...        ...

Remove unlikely coordinates (e.g., those at 0,0)

ord_unlikely(dframe(sample_data_1)) # w/o pipe
me(sample_data_1) %>% coord_unlikely()
scrubr dframe>
ize: 1488 X 5
at/Lon vars: latitude/longitude

              name  longitude latitude                date        key
             (chr)      (dbl)    (dbl)              (time)      (int)
  Ursus americanus  -79.68283 38.36662 2015-01-14 16:36:45 1065590124
  Ursus americanus  -82.42028 35.73304 2015-01-13 00:25:39 1065588899
  Ursus americanus  -99.09625 23.66893 2015-02-20 23:00:00 1098894889
  Ursus americanus  -72.77432 43.94883 2015-02-13 16:16:41 1065611122
  Ursus americanus  -72.34617 43.86464 2015-03-01 20:20:45 1088908315
  Ursus americanus -108.53674 32.65219 2015-03-29 17:06:54 1088932238
  Ursus americanus -108.53691 32.65237 2015-03-29 17:12:50 1088932273
  Ursus americanus -123.82900 40.13240 2015-03-28 23:00:00 1132403409
  Ursus americanus  -78.25027 36.93018 2015-03-20 21:11:24 1088923534
0 Ursus americanus  -76.78671 35.53079 2015-04-05 23:00:00 1088954559
.              ...        ...      ...                 ...        ...

Do all three

me(sample_data_1) %>%
ord_impossible() %>%
ord_incomplete() %>%
ord_unlikely()
scrubr dframe>
ize: 1294 X 5
at/Lon vars: latitude/longitude

              name  longitude latitude                date        key
             (chr)      (dbl)    (dbl)              (time)      (int)
  Ursus americanus  -79.68283 38.36662 2015-01-14 16:36:45 1065590124
  Ursus americanus  -82.42028 35.73304 2015-01-13 00:25:39 1065588899
  Ursus americanus  -99.09625 23.66893 2015-02-20 23:00:00 1098894889
  Ursus americanus  -72.77432 43.94883 2015-02-13 16:16:41 1065611122
  Ursus americanus  -72.34617 43.86464 2015-03-01 20:20:45 1088908315
  Ursus americanus -108.53674 32.65219 2015-03-29 17:06:54 1088932238
  Ursus americanus -108.53691 32.65237 2015-03-29 17:12:50 1088932273
  Ursus americanus -123.82900 40.13240 2015-03-28 23:00:00 1132403409
  Ursus americanus  -78.25027 36.93018 2015-03-20 21:11:24 1088923534
0 Ursus americanus  -76.78671 35.53079 2015-04-05 23:00:00 1088954559
.              ...        ...      ...                 ...        ...

Don't drop bad data

me(sample_data_1) %>% coord_incomplete(drop = TRUE) %>% NROW
1] 1306
me(sample_data_1) %>% coord_incomplete(drop = FALSE) %>% NROW
1] 1500
Deduplicate
ldf <- sample_data_1[1:20, ]
eate a duplicate record
ldf <- rbind(smalldf, smalldf[10,])
names(smalldf) <- NULL
ke it slightly different
ldf[21, "key"] <- 1088954555
(smalldf)
1] 21
- dframe(smalldf) %>% dedup()
(dp)
1] 20
(dp, "dups")
scrubr dframe>
ize: 1 X 5


             name longitude latitude                date        key
            (chr)     (dbl)    (dbl)              (time)      (dbl)
 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954555
Dates

Standardize/convert dates

- sample_data_1
te_standardize(dframe(df), "%d%b%Y") # w/o pipe
me(df) %>% date_standardize("%d%b%Y")
scrubr dframe>
ize: 1500 X 5


              name  longitude latitude      date        key
             (chr)      (dbl)    (dbl)     (chr)      (int)
  Ursus americanus  -79.68283 38.36662 14Jan2015 1065590124
  Ursus americanus  -82.42028 35.73304 13Jan2015 1065588899
  Ursus americanus  -99.09625 23.66893 20Feb2015 1098894889
  Ursus americanus  -72.77432 43.94883 13Feb2015 1065611122
  Ursus americanus  -72.34617 43.86464 01Mar2015 1088908315
  Ursus americanus -108.53674 32.65219 29Mar2015 1088932238
  Ursus americanus -108.53691 32.65237 29Mar2015 1088932273
  Ursus americanus -123.82900 40.13240 28Mar2015 1132403409
  Ursus americanus  -78.25027 36.93018 20Mar2015 1088923534
0 Ursus americanus  -76.78671 35.53079 05Apr2015 1088954559
.              ...        ...      ...       ...        ...

Drop records without dates

(df)
1] 1500
(dframe(df) %>% date_missing())
1] 1498

Create date field from other fields

me(sample_data_2) %>% date_create(year, month, day)
scrubr dframe>
ize: 1500 X 8


              name  longitude latitude        key  year month   day
             (chr)      (dbl)    (dbl)      (int) (chr) (chr) (chr)
  Ursus americanus  -79.68283 38.36662 1065590124  2015    01    14
  Ursus americanus  -82.42028 35.73304 1065588899  2015    01    13
  Ursus americanus  -99.09625 23.66893 1098894889  2015    02    20
  Ursus americanus  -72.77432 43.94883 1065611122  2015    02    13
  Ursus americanus  -72.34617 43.86464 1088908315  2015    03    01
  Ursus americanus -108.53674 32.65219 1088932238  2015    03    29
  Ursus americanus -108.53691 32.65237 1088932273  2015    03    29
  Ursus americanus -123.82900 40.13240 1132403409  2015    03    28
  Ursus americanus  -78.25027 36.93018 1088923534  2015    03    20
0 Ursus americanus  -76.78671 35.53079 1088954559  2015    04    05
.              ...        ...      ...        ...   ...   ...   ...
ariables not shown: date (chr).
Meta

ropensci_footer


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.