Name: scrubr
Owner: rOpenSci
Description: Clean species occurrence records
Created: 2015-09-16 01:25:34.0
Updated: 2018-01-03 10:06:47.0
Pushed: 2018-01-05 19:52:49.0
Size: 257
Language: R
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Clean Biological Occurrence Records
Clean using the following use cases (checkmarks indicate fxns exist - not necessarily complete):
taxize
(one method so far)A note about examples: We think that using a piping workflow with %>%
makes code easier to
build up, and easier to understand. However, in some examples we provide examples without the pipe
to demonstrate traditional usage.
Stable CRAN version
all.packages("scrubr")
Development version
ools::install_github("ropensci/scrubr")
ary("scrubr")
("sampledata1")
Remove impossible coordinates (using sample data included in the pkg)
ord_impossible(dframe(sample_data_1)) # w/o pipe
me(sample_data_1) %>% coord_impossible()
scrubr dframe>
ize: 1500 X 5
at/Lon vars: latitude/longitude
name longitude latitude date key
(chr) (dbl) (dbl) (time) (int)
Ursus americanus -79.68283 38.36662 2015-01-14 16:36:45 1065590124
Ursus americanus -82.42028 35.73304 2015-01-13 00:25:39 1065588899
Ursus americanus -99.09625 23.66893 2015-02-20 23:00:00 1098894889
Ursus americanus -72.77432 43.94883 2015-02-13 16:16:41 1065611122
Ursus americanus -72.34617 43.86464 2015-03-01 20:20:45 1088908315
Ursus americanus -108.53674 32.65219 2015-03-29 17:06:54 1088932238
Ursus americanus -108.53691 32.65237 2015-03-29 17:12:50 1088932273
Ursus americanus -123.82900 40.13240 2015-03-28 23:00:00 1132403409
Ursus americanus -78.25027 36.93018 2015-03-20 21:11:24 1088923534
0 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954559
. ... ... ... ... ...
Remove incomplete coordinates
ord_incomplete(dframe(sample_data_1)) # w/o pipe
me(sample_data_1) %>% coord_incomplete()
scrubr dframe>
ize: 1306 X 5
at/Lon vars: latitude/longitude
name longitude latitude date key
(chr) (dbl) (dbl) (time) (int)
Ursus americanus -79.68283 38.36662 2015-01-14 16:36:45 1065590124
Ursus americanus -82.42028 35.73304 2015-01-13 00:25:39 1065588899
Ursus americanus -99.09625 23.66893 2015-02-20 23:00:00 1098894889
Ursus americanus -72.77432 43.94883 2015-02-13 16:16:41 1065611122
Ursus americanus -72.34617 43.86464 2015-03-01 20:20:45 1088908315
Ursus americanus -108.53674 32.65219 2015-03-29 17:06:54 1088932238
Ursus americanus -108.53691 32.65237 2015-03-29 17:12:50 1088932273
Ursus americanus -123.82900 40.13240 2015-03-28 23:00:00 1132403409
Ursus americanus -78.25027 36.93018 2015-03-20 21:11:24 1088923534
0 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954559
. ... ... ... ... ...
Remove unlikely coordinates (e.g., those at 0,0)
ord_unlikely(dframe(sample_data_1)) # w/o pipe
me(sample_data_1) %>% coord_unlikely()
scrubr dframe>
ize: 1488 X 5
at/Lon vars: latitude/longitude
name longitude latitude date key
(chr) (dbl) (dbl) (time) (int)
Ursus americanus -79.68283 38.36662 2015-01-14 16:36:45 1065590124
Ursus americanus -82.42028 35.73304 2015-01-13 00:25:39 1065588899
Ursus americanus -99.09625 23.66893 2015-02-20 23:00:00 1098894889
Ursus americanus -72.77432 43.94883 2015-02-13 16:16:41 1065611122
Ursus americanus -72.34617 43.86464 2015-03-01 20:20:45 1088908315
Ursus americanus -108.53674 32.65219 2015-03-29 17:06:54 1088932238
Ursus americanus -108.53691 32.65237 2015-03-29 17:12:50 1088932273
Ursus americanus -123.82900 40.13240 2015-03-28 23:00:00 1132403409
Ursus americanus -78.25027 36.93018 2015-03-20 21:11:24 1088923534
0 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954559
. ... ... ... ... ...
Do all three
me(sample_data_1) %>%
ord_impossible() %>%
ord_incomplete() %>%
ord_unlikely()
scrubr dframe>
ize: 1294 X 5
at/Lon vars: latitude/longitude
name longitude latitude date key
(chr) (dbl) (dbl) (time) (int)
Ursus americanus -79.68283 38.36662 2015-01-14 16:36:45 1065590124
Ursus americanus -82.42028 35.73304 2015-01-13 00:25:39 1065588899
Ursus americanus -99.09625 23.66893 2015-02-20 23:00:00 1098894889
Ursus americanus -72.77432 43.94883 2015-02-13 16:16:41 1065611122
Ursus americanus -72.34617 43.86464 2015-03-01 20:20:45 1088908315
Ursus americanus -108.53674 32.65219 2015-03-29 17:06:54 1088932238
Ursus americanus -108.53691 32.65237 2015-03-29 17:12:50 1088932273
Ursus americanus -123.82900 40.13240 2015-03-28 23:00:00 1132403409
Ursus americanus -78.25027 36.93018 2015-03-20 21:11:24 1088923534
0 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954559
. ... ... ... ... ...
Don't drop bad data
me(sample_data_1) %>% coord_incomplete(drop = TRUE) %>% NROW
1] 1306
me(sample_data_1) %>% coord_incomplete(drop = FALSE) %>% NROW
1] 1500
ldf <- sample_data_1[1:20, ]
eate a duplicate record
ldf <- rbind(smalldf, smalldf[10,])
names(smalldf) <- NULL
ke it slightly different
ldf[21, "key"] <- 1088954555
(smalldf)
1] 21
- dframe(smalldf) %>% dedup()
(dp)
1] 20
(dp, "dups")
scrubr dframe>
ize: 1 X 5
name longitude latitude date key
(chr) (dbl) (dbl) (time) (dbl)
Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954555
Standardize/convert dates
- sample_data_1
te_standardize(dframe(df), "%d%b%Y") # w/o pipe
me(df) %>% date_standardize("%d%b%Y")
scrubr dframe>
ize: 1500 X 5
name longitude latitude date key
(chr) (dbl) (dbl) (chr) (int)
Ursus americanus -79.68283 38.36662 14Jan2015 1065590124
Ursus americanus -82.42028 35.73304 13Jan2015 1065588899
Ursus americanus -99.09625 23.66893 20Feb2015 1098894889
Ursus americanus -72.77432 43.94883 13Feb2015 1065611122
Ursus americanus -72.34617 43.86464 01Mar2015 1088908315
Ursus americanus -108.53674 32.65219 29Mar2015 1088932238
Ursus americanus -108.53691 32.65237 29Mar2015 1088932273
Ursus americanus -123.82900 40.13240 28Mar2015 1132403409
Ursus americanus -78.25027 36.93018 20Mar2015 1088923534
0 Ursus americanus -76.78671 35.53079 05Apr2015 1088954559
. ... ... ... ... ...
Drop records without dates
(df)
1] 1500
(dframe(df) %>% date_missing())
1] 1498
Create date field from other fields
me(sample_data_2) %>% date_create(year, month, day)
scrubr dframe>
ize: 1500 X 8
name longitude latitude key year month day
(chr) (dbl) (dbl) (int) (chr) (chr) (chr)
Ursus americanus -79.68283 38.36662 1065590124 2015 01 14
Ursus americanus -82.42028 35.73304 1065588899 2015 01 13
Ursus americanus -99.09625 23.66893 1098894889 2015 02 20
Ursus americanus -72.77432 43.94883 1065611122 2015 02 13
Ursus americanus -72.34617 43.86464 1088908315 2015 03 01
Ursus americanus -108.53674 32.65219 1088932238 2015 03 29
Ursus americanus -108.53691 32.65237 1088932273 2015 03 29
Ursus americanus -123.82900 40.13240 1132403409 2015 03 28
Ursus americanus -78.25027 36.93018 1088923534 2015 03 20
0 Ursus americanus -76.78671 35.53079 1088954559 2015 04 05
. ... ... ... ... ... ... ...
ariables not shown: date (chr).
scrubr
in R doing citation(package = 'scrubr')