dplyr - R: Quickly Performing Operations on Subsets of a Data Frame, then Re-aggregating the Result Without an Inner Function -

- April 15, 2014

we have large data frame df can split factors. on each subset of data frame created split, need perform operation increase number of rows of subset until it's length. afterwards, rbind subsets bigger version of df.

is there way of doing without using inner function?

let's our subset operation (in separate .r file) is:

foo <- function(df) { magic }

we've come few ways of doing this:

df <- split(df, factor) df <- lapply(df, foo) rbindlist(df)

assign('list.df', list(), envir=.globalenv)  assign('i', 1, envir=.globalenv)  dplyr::group_by(df, factor) dplyr::mutate(df, foo.list(df.col)) df <- rbindlist(list.df) rm('list.df', envir=.globalenv) rm('i', envir=.globalenv)  (in separate file) foo.list <- function(df.cols) {     magic;      list.df[[i]] <<- magic.df     <<- + 1     return(dummy) }

the issue first approach time. lapply takes long desirable (on order of hour our data set).

the issue second approach extremely undesirable side-effect of tampering user's global environment. it's faster, we'd rather avoid if can.

we've tried passing in list , count variables , trying substitute them variables in parent environment (a sort of hack around r's lack of pass-by-reference).

we've looked @ number of possibly-relevant questions (r applying function subset of data frame, calculations on subsets of data frame, r: pass reference, e.t.c.) none of them dealt our question well.

if want run code, here's can copy , paste:

 x <- runif(n=10, min=0, max=3)  y <- sample(x=10, replace=false)  factors <- runif(n=10, min=0, max=2)  factors <- floor(factors)  df <- data.frame(factors, x, y)

df looks (length 10):

 ## group factor, run foo on groups.   foo <- function(df.subset) {    min <- min(df.subset$y)    max <- max(df.subset$y)     ## fill out df.subset have between min ,    ## max values of y. assign old values of df.subset    ## corresponding spots.     df.fill <- data.frame(x=rep(0, max-min+1),                          y=min:max,                          factors=rep(df.subset$factors[1], max-min+1))    df.fill$x[which(df.subset$y %in%(min:max))] <- df.subset$x    df.fill  }

so can take sample code in first approach build new df (length 18):

using data.table doesn't take long due speedy functionality. if can, rewrite function work specific variables. split-apply-combine processing may performance boost:

library(data.table) system.time( df2 <- setdt(df)[,foo(df), factors] ) #   user  system elapsed  #   1.63    0.39    2.03

Search This Blog

JAV

dplyr - R: Quickly Performing Operations on Subsets of a Data Frame, then Re-aggregating the Result Without an Inner Function -

Comments

Post a Comment

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -

jqgrid - how to change theme of grid using jqwidgets -