r - Reading multiple files and calculating mean based on user input -

February 15, 2015

i trying write function in r takes 3 inputs:

directory
pollutant
id

i have directory on computer full of csv's files i.e. on 300. function shown in below prototype:

pollutantmean <- function(directory, pollutant, id = 1:332) {         ## 'directory' character vector of length 1 indicating         ## location of csv files          ## 'pollutant' character vector of length 1 indicating         ## name of pollutant calculate         ## mean; either "sulfate" or "nitrate".          ## 'id' integer vector indicating monitor id numbers         ## used          ## return mean of pollutant across monitors list         ## in 'id' vector (ignoring na values)         }

an example output of function shown here:

source("pollutantmean.r") pollutantmean("specdata", "sulfate", 1:10)  ## [1] 4.064  pollutantmean("specdata", "nitrate", 70:72)  ## [1] 1.706  pollutantmean("specdata", "nitrate", 23)  ## [1] 1.281

i can read whole thing in 1 go by:

path = "c:/users/sean/documents/r projects/data/specdata" filelist = list.files(path=path,pattern="\\.csv$",full.names=t) all.files.data = lapply(filelist,read.csv,header=true) data = do.call("rbind",all.files.data)

my issue are:

user enters id either atomic or in range e.g. suppose user enters 1 file name 001.csv or if user enters range 1:10 file names 001.csv ... 010.csv
column enetered user i.e. "sulfate" or "nitrate" he/she interested in getting mean of...there alot of missing values in these columns (which need omit column before calculating mean.

the whole data files :

summary(data)          date           sulfate          nitrate             id         2004-01-01:   250   min.   : 0.0     min.   : 0.0     min.   :  1.0    2004-01-02:   250   1st qu.: 1.3     1st qu.: 0.4     1st qu.: 79.0    2004-01-03:   250   median : 2.4     median : 0.8     median :168.0    2004-01-04:   250   mean   : 3.2     mean   : 1.7     mean   :164.5    2004-01-05:   250   3rd qu.: 4.0     3rd qu.: 2.0     3rd qu.:247.0    2004-01-06:   250   max.   :35.9     max.   :53.9     max.   :332.0    (other)   :770587   na's   :653304   na's   :657738

any idea how formulate highly appreciated...

cheers

so, can simulate situation this;

# simulate data: # create 332 data frames set.seed(1) df.list<-replicate(332,data.frame(sulfate=rnorm(100),nitrate=rnorm(100)),simplify=false) # generate names 001.csv , 010.csv file.names<-paste0('specdata/',sprintf('%03d',1:332),'.csv') # write them disk invisible(mapply(write.csv,df.list,file.names))

and here function read files:

pollutantmean <- function(directory, pollutant, id = 1:332) {   file.names <- list.files(directory)   file.numbers <- as.numeric(sub('\\.csv$','', file.names))   selected.files <- na.omit(file.names[match(id, file.numbers)])   selected.dfs <- lapply(file.path(directory,selected.files), read.csv)   mean(c(sapply(selected.dfs, function(x) x[ ,pollutant])), na.rm=true) }  pollutantmean('specdata','nitrate',c(1:100,141)) # [1] -0.005450574

Search This Blog

O9

r - Reading multiple files and calculating mean based on user input -

Comments

Post a Comment

Popular posts from this blog

java - How to specify maven bin in eclipse maven plugin? -

single sign on - Logging into Plone site with credentials passed through HTTP -

php - Why does AJAX not process login form? -