Integration with FFCRegressionImputation

Connor Gilroy


This vignette demonstrates the integration of the ffc-data-processing methods with the imputation methods from the FFCRegressionImputation package by Anna Filippova.


First, install the FFCRegressionImputation package from GitHub using devtools.

if (!"devtools" %in% installed.packages()) install.packages("devtools")

if (!"FFCRegressionImputation" %in% installed.packages()) {


Then, source the required packages and functions from ffc-data-processing.


Get variable types

First, we use the Stata background.dta file to extract information about variable types. For this vignette, only the constructed variables are treated. Subsetting the variables makes later steps computationally feasible.

This function chain is very similar to the code in the vignette Example 2, but broken out into individual steps.

The results of the data processing are summarized below. Categorical variables are now factors, and continuous variables (as well as the challengeID identifier) are numeric.

# get covariate variable type information from Stata file
background_dta <- read_dta("data/background.dta")

background_variable_types <- 
  background_dta %>%
  subset_vars_keep(get_vars_constructed) %>%
  recode_na_character() %>%
  labelled_to_factor() %>%
  labelled_to_numeric() %>%
  character_to_factor() %>%
  character_to_numeric() %>%
  # use a less conservative threshold than default
  character_to_factor_or_numeric(threshold = 29) 

##  factor numeric 
##     269      79

Using this processed data set, we store the names of the categorical and continuous variables separately.

categorical_vars <- get_vars_categorical(background_variable_types)
continuous_vars <- get_vars_continuous(background_variable_types)

It’s important to check that we’ve classified the variables appropriately; we can do this using the metadata provided by the summarize_variables function.

variable_metadata <- 
  background_dta %>%
  subset_vars_keep(get_vars_constructed) %>%
  summarize_variables(background_variable_types) %>%
  arrange(variable_type, desc(unique_values))

variable_metadata %>% head(10) %>% print(width = 91)
## # A tibble: 10 x 4
##    variable                                                            label unique_values
##       <chr>                                                            <chr>         <dbl>
##  1   cf4age                               Constructed - Father's age (years)            41
##  2   cm4age                               Constructed - Mother's age (years)            34
##  3  cm3amrf             Constructed - Mother age when married father (years)            26
##  4  cm4amrf             Constructed - Mother age when married father (years)            26
##  5  cm3alvf Constructed - Mother age when started living with father (years)            24
##  6  cm4alvf Constructed - Mother age when started living with father (years)            24
##  7 cm4b_age     Constructed - Child age at time of mother interview (months)            17
##  8 cf4b_age     Constructed - Child age at time of father interview (months)            17
##  9  cm2relf    Constructed - Mother's romantic relationship w/BF at one-year            10
## 10  cm3relf      Constructed - Mother relationship with father at year three            10
## # ... with 1 more variables: variable_type <chr>

It turns out that some censored continuous variables are converted to factors instead of numerics. For this example, we want to discard the information about censoring for the first 8 variables shown above. In other words, we are going to treat someone who is labelled as “20 and younger” as if we knew they were exactly 20 years old.

We do this by removing those variable names from the categorical_vars list, and adding them to the continuous_vars list.

censored_vars <- variable_metadata$variable[1:8]
categorical_vars <- 
  categorical_vars[!categorical_vars %in% censored_vars]
continuous_vars <- c(continuous_vars, censored_vars)

We now have 261 categorical variables and 86 continuous variables, in addition to the challengeID identifier.

Do imputation

These steps make use of functions from FFCRegressionImputation to do logical and regression-based imputation.

Initial processing

The initialization function from FFCRegressionImputation reads in and runs on the background.csv file. That isn’t a problem, because we’ve already retrieved and stored the information we need from the background.dta file.

The function logically imputes some missing values for age-related variables. It also generates variables with NA count information, but these aren’t included in subsequent steps of this vignette.

# run initial cleaning and imputation for all variables on csv file
background_csv <- initImputation(data = "data/background.csv")
## Importing data...
## Run logical imputation...
## [1] "Generating refusalcount, dontknowcount, nacount..."
## [1] "Running logical age imputation ... "
## [1] "Done with logical age imputation!"
## Drop missing data...

Subset and split variables

Next, use the previously calculated information about which variables are categorical and which are continuous to split the new data frame in two sets of constructed variables. In this vignette, we will only run regression-based imputation on the continuous variables.

Note that initImputation renames three constructed variables by prefixing them with c-. These variables are included manually in get_vars_constructed and not renamed there, so the new variables need to be explicitly included here.

# split csv file using variable type information
background_categorical <- 
  background_csv %>%
  select(challengeID, one_of(categorical_vars), co5oint, ct5int)

background_continuous <- 
  background_csv %>%
  select(challengeID, one_of(continuous_vars), cn5d2_age)

# not used, included for illustrative purposes only
background_other <- 
  background_csv %>%
  select(-one_of(continuous_vars, categorical_vars))

Calculate correlation matrix

The corMatrix function produces a matrix of correlations between covariates, and an object containing both this matrix and a data frame for use in subsequent imputation.

This is the most computationally-intensive step, and subsetting the background covariates down to the constructed continuous covariates speeds it up considerably. It can also be run in parallel, as it is here.

# impute numeric covariates using regression imputation
output <- corMatrix(data = background_continuous, parallel = 1)
## Enabling parallelization

Do regression imputation

regImputation uses the most correlated predictors to impute missing values in the given data frame. Here, we impute only the continuous variables.

By default, this uses an OLS linear model; it can run a LASSO with the polywog package instead.

background_continuous_imputed <- 
  regImputation(output$df, output$corMatrix, parallel = 1)
## Using lm...
## Enabling parallelization

Merge data

Now, we can join the categorical variables we set aside and the newly imputed continuous variables back together. For consistency’s sake, we rearrange them into the original order.

Some variables have been dropped due to lack of variation, and these now generate warnings which can be ignored.

# merge continuous and categorical variables back together
background_imputed <- 
  full_join(background_categorical, background_continuous_imputed, 
            by = "challengeID") %>%
  # put variables in order
  select(challengeID, one_of(get_vars_constructed(background_csv)))
## Warning: Unknown variables: `cm3alvf`

Finally, read in and merge the outcome data. (merge_train is a wrapper around dplyr::full_join with special handling for the joining variable challengeID.)

# merge background data with train
train <- read_csv("data/train.csv")
## Parsed with column specification:
## cols(
##   challengeID = col_integer(),
##   gpa = col_double(),
##   grit = col_double(),
##   materialHardship = col_double(),
##   eviction = col_integer(),
##   layoff = col_integer(),
##   jobTraining = col_integer()
## )
ffc <- merge_train(background_imputed, train)

To build a model of these outcomes, you would want to treat the missing values in the categorical variables as well, either by treating them identically to the continuous variables and running them through regImputation or by using some other imputation strategy.