r - Reading multiple files and calculating mean based on user input -
i trying write function in r takes 3 inputs:
- directory
- pollutant
- id
i have directory on computer full of csv's files i.e. on 300. function shown in below prototype:
pollutantmean <- function(directory, pollutant, id = 1:332) { ## 'directory' character vector of length 1 indicating ## location of csv files ## 'pollutant' character vector of length 1 indicating ## name of pollutant calculate ## mean; either "sulfate" or "nitrate". ## 'id' integer vector indicating monitor id numbers ## used ## return mean of pollutant across monitors list ## in 'id' vector (ignoring na values) } an example output of function shown here:
source("pollutantmean.r") pollutantmean("specdata", "sulfate", 1:10) ## [1] 4.064 pollutantmean("specdata", "nitrate", 70:72) ## [1] 1.706 pollutantmean("specdata", "nitrate", 23) ## [1] 1.281 i can read whole thing in 1 go by:
path = "c:/users/sean/documents/r projects/data/specdata" filelist = list.files(path=path,pattern="\\.csv$",full.names=t) all.files.data = lapply(filelist,read.csv,header=true) data = do.call("rbind",all.files.data) my issue are:
- user enters id either atomic or in range e.g. suppose user enters 1 file name 001.csv or if user enters range 1:10 file names 001.csv ... 010.csv
- column enetered user i.e. "sulfate" or "nitrate" he/she interested in getting mean of...there alot of missing values in these columns (which need omit column before calculating mean.
the whole data files :
summary(data) date sulfate nitrate id 2004-01-01: 250 min. : 0.0 min. : 0.0 min. : 1.0 2004-01-02: 250 1st qu.: 1.3 1st qu.: 0.4 1st qu.: 79.0 2004-01-03: 250 median : 2.4 median : 0.8 median :168.0 2004-01-04: 250 mean : 3.2 mean : 1.7 mean :164.5 2004-01-05: 250 3rd qu.: 4.0 3rd qu.: 2.0 3rd qu.:247.0 2004-01-06: 250 max. :35.9 max. :53.9 max. :332.0 (other) :770587 na's :653304 na's :657738 any idea how formulate highly appreciated...
cheers
so, can simulate situation this;
# simulate data: # create 332 data frames set.seed(1) df.list<-replicate(332,data.frame(sulfate=rnorm(100),nitrate=rnorm(100)),simplify=false) # generate names 001.csv , 010.csv file.names<-paste0('specdata/',sprintf('%03d',1:332),'.csv') # write them disk invisible(mapply(write.csv,df.list,file.names)) and here function read files:
pollutantmean <- function(directory, pollutant, id = 1:332) { file.names <- list.files(directory) file.numbers <- as.numeric(sub('\\.csv$','', file.names)) selected.files <- na.omit(file.names[match(id, file.numbers)]) selected.dfs <- lapply(file.path(directory,selected.files), read.csv) mean(c(sapply(selected.dfs, function(x) x[ ,pollutant])), na.rm=true) } pollutantmean('specdata','nitrate',c(1:100,141)) # [1] -0.005450574
Comments
Post a Comment