dplyr - R: Quickly Performing Operations on Subsets of a Data Frame, then Re-aggregating the Result Without an Inner Function -
we have large data frame df
can split factors. on each subset of data frame created split, need perform operation increase number of rows of subset until it's length
. afterwards, rbind
subsets bigger version of df
.
is there way of doing without using inner function?
let's our subset operation (in separate .r file) is:
foo <- function(df) { magic }
we've come few ways of doing this:
1)
df <- split(df, factor) df <- lapply(df, foo) rbindlist(df)
2)
assign('list.df', list(), envir=.globalenv) assign('i', 1, envir=.globalenv) dplyr::group_by(df, factor) dplyr::mutate(df, foo.list(df.col)) df <- rbindlist(list.df) rm('list.df', envir=.globalenv) rm('i', envir=.globalenv) (in separate file) foo.list <- function(df.cols) { magic; list.df[[i]] <<- magic.df <<- + 1 return(dummy) }
the issue first approach time. lapply takes long desirable (on order of hour our data set).
the issue second approach extremely undesirable side-effect of tampering user's global environment. it's faster, we'd rather avoid if can.
we've tried passing in list , count variables , trying substitute
them variables in parent environment (a sort of hack around r's lack of pass-by-reference).
we've looked @ number of possibly-relevant questions (r applying function subset of data frame, calculations on subsets of data frame, r: pass reference, e.t.c.) none of them dealt our question well.
if want run code, here's can copy , paste:
x <- runif(n=10, min=0, max=3) y <- sample(x=10, replace=false) factors <- runif(n=10, min=0, max=2) factors <- floor(factors) df <- data.frame(factors, x, y)
## group factor, run foo on groups. foo <- function(df.subset) { min <- min(df.subset$y) max <- max(df.subset$y) ## fill out df.subset have between min , ## max values of y. assign old values of df.subset ## corresponding spots. df.fill <- data.frame(x=rep(0, max-min+1), y=min:max, factors=rep(df.subset$factors[1], max-min+1)) df.fill$x[which(df.subset$y %in%(min:max))] <- df.subset$x df.fill }
so can take sample code in first approach build new df (length 18):
using data.table
doesn't take long due speedy functionality. if can, rewrite function work specific variables. split-apply-combine processing may performance boost:
library(data.table) system.time( df2 <- setdt(df)[,foo(df), factors] ) # user system elapsed # 1.63 0.39 2.03
Comments
Post a Comment