Feature engineering and variable imputation

Playing with multi-variable imputation on anonymized data from 10000 people representing the general US Population. This is not a tutorial, but more a personal learning experience. I am learning to work with medical datasets, and how to establish a build workflow, from R to the web.

This blogpost contains preprocessing steps for another blogpost about lasso-regression, also still unfinished.
At this time, imputation is done incorrectly here, I know that.

This is a follow-up to an exploratory data-analysis post of mine. When I was nearly done with writing that post, some new questions came to my mind.

Here I’ll continue with preprocessing. More basically, can I determine some other variables (out of the 76) that are -apparently- influencing the Blood Pressure in some way?

I am not trying to do groundbreaking medical research here. Hey I’m just a blogger, playing with a dataset for personal entertainment, trying out some of those fancy Machine Learning algorithms that I encountered during MOOC homework assignments. This time I’ll try these techniques out on a larger dataset that I find interesting.

Attribute selection and variable imputation are a topic worth studying, especially on a more complex dataset such as the NHANES data.

The NHANES dataset

NHANES is the US National Health and Nutrition Examination Study. It is a carefully curated, larger medical survey aiming to get a representative sample of the general US population. The survey is carried out periodically.

The NHANES dataset is available as an R package on CRAN. I have used package version 2.1.0, specifically “NHANES 2009-2012 with adjusted weighting”.

Table 1 shows the column names and further below, and in table 5, are some other important metadata of the NHANES dataset.

This dataset contains data from two collection periods, 2009-2010 and 2011-2012. Some attributes weres collected during period 1, but not in period 2, or they were collected in different ways. Therefore the dataset contains many NULL values just because of this. Moreover some attributes are valid only for women (“nBabies”), and other features such as “HeadCirc” (head circumference), were only collected for babies and children.

Feature Engineering

These are the feature selection steps I will apply. Some attributes are redundant, for exaample age-in-years, age-in-months and age-decade. Many blood-pressure values are redundant; keep only the average value. First, remove attributes that were collected for babies, children and generally people from the youngest age groups, less than 20 years of age. This leads to the selection of columns, marked by me as “drop” in Table 1 below.

Table 1: Column names in NHANES data set, and columns that will be dropped
	Column	Meaning	Action
1	SurveyYr	Which survey the participant participated in.	drop
2	ID	Participant identifier.	drop
3	Gender	Gender (sex) of study participant coded as male or female
4	Age	Age in years at screening of study participant. Note: Subjects 80 years or older were recorded as 80.
5	AgeDecade	Categorical variable derived from age with levels 0-9, 10-19, … 70+	drop
6	AgeMonths	Age in months at screening of study participant. Reported for participants aged 0 to 79 years for 2009 to 2010 data Reported for participants aged 0 to 2 years for 2011 to 2012 data.	drop
7	Race1	Reported race of study participant: Mexican, Hispanic, White, Black, or Other.
8	Race3	Reported race of study participant, including non-Hispanic Asian category: Mexican, Hispanic, White, Black, Asian, or Other. Not availale for 2009-10.	drop
9	Education	Educational level of study participant Reported for participants aged 20 years or older. One of 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
10	MaritalStatus	Marital status of study participant. Reported for participants aged 20 years or older. One of Married, Widowed, Divorced, Separated, NeverMarried, or LivePartner (living with partner).	drop
11	HHIncome	Total annual gross income for the household in US dollars. One of 0 - 4999, 5000 - 9,999, 10000 - 14999, 15000 - 19999, 20000 - 24,999, 25000 - 34999, 35000 - 44999, 45000 - 54999, 55000 - 64999, 65000 - 74999, 75000 - 99999, or 100000 or More.	drop
12	HHIncomeMid	Numerical version of HHIncome derived from the middle income in each category
13	Poverty	A ratio of family income to poverty guidelines. Smaller numbers indicate more poverty
14	HomeRooms	How many rooms are in home of study participant (counting kitchen but not bathroom). 13 rooms = 13 or more rooms.
15	HomeOwn	One of Home, Rent, or Other indicating whether the home of study participant or someone in their family is owned, rented or occupied by some other arrangement.
16	Weight	Weight in kg
17	Length	Recumbent length in cm. Reported for participants aged 0 - 3 years.	drop
18	HeadCirc	Head circumference in cm. Reported for participants aged 0 years (0 - 6 months).	drop
19	Height	Standing height in cm. Reported for participants aged 2 years or older.
20	BMI	Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older.
21	BMICatUnder20yrs	Body mass index category. Reported for participants aged 2 to 19 years. One of UnderWeight (BMI < 5th percentile) NormWeight (BMI 5th to < 85th percentile), OverWeight (BMI 85th to < 95th percentile), Obese (BMI >= 95th percentile).	drop
22	BMI_WHO	Body mass index category. Reported for participants aged 2 years or older. One of 12.0_18.4, 18.5_24.9, 25.0_29.9, or 30.0_plus.	drop
23	Pulse	60 second pulse rate
24	BPSysAve	Combined systolic blood pressure reading, following the procedure outlined for BPXSAR.
25	BPDiaAve	Combined diastolic blood pressure reading, following the procedure outlined for BPXDAR.
26	BPSys1	Systolic blood pressure in mm Hg – first reading	drop
27	BPDia1	Diastolic blood pressure in mm Hg – second reading (consecutive readings)	drop
28	BPSys2	Systolic blood pressure in mm Hg – second reading (consecutive readings)	drop
29	BPDia2	Diastolic blood pressure in mm Hg – second reading	drop
30	BPSys3	Systolic blood pressure in mm Hg third reading (consecutive readings)	drop
31	BPDia3	Diastolic blood pressure in mm Hg – third reading (consecutive readings)	drop
32	Testosterone	Testerone total (ng/dL). Reported for participants aged 6 years or older. Not available for 2009-2010.
33	DirectChol	Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older.
34	TotChol	Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older.
35	UrineVol1	Urine volume in mL – first test. Reported for participants aged 6 years or older.
36	UrineFlow1	Urine flow rate (urine volume/time since last urination) in mL/min – first test. Reported for participants aged 6 years or older.
37	UrineVol2	Urine volume in mL – second test. Reported for participants aged 6 years or older.
38	UrineFlow2	Urine flow rate (urine volume/time since last urination) in mL/min – second test. Reported for participants aged 6 years or older.
39	Diabetes	Study participant told by a doctor or health professional that they have diabetes. Reported for participants aged 1 year or older as Yes or No.
40	DiabetesAge	Age of study participant when first told they had diabetes. Reported for participants aged 1 year or older.
41	HealthGen	Self-reported rating of participant’s health in general Reported for participants aged 12 years or older. One of Excellent, Vgood, Good, Fair, or Poor.
42	DaysPhysHlthBad	Self-reported number of days participant’s physical health was not good out of the past 30 days. Reported for participants aged 12 years or older.
43	DaysMentHlthBad	Self-reported number of days participant’s mental health was not good out of the past 30 days. Reported for participants aged 12 years or older.
44	LittleInterest	Self-reported number of days where participant had little interest in doing things. Reported for participants aged 18 years or older. One of None, Several, Majority (more than half the days), or AlmostAll.
45	Depressed	Self-reported number of days where participant felt down, depressed or hopeless. Reported for participants aged 18 years or older. One of None, Several, Majority (more than half the days), or AlmostAll.
46	nPregnancies	How many times participant has been pregnant. Reported for female participants aged 20 years or older.
47	nBabies	How many of participants deliveries resulted in live births. Reported for female participants aged 20 years or older.
48	PregnantNow	Pregnancy status at the time of the health examination was ascertained for females 8-59 years of age. Due to disclosure risks pregnancy status was only be released for women 20-44 years of age. The information used included urine pregnancy test results and self-reported pregnancy status. Urine pregnancy tests were performed prior to the dual energy x-ray absorptiometry (DXA) exam. Persons who reported they were pregnant at the time of exam were assumed to be pregnant. As a result, if the urine test was negative, but the subject reported they were pregnant, the status was coded as Yes“. If the urine pregnancy results were negative and the respondent stated that they were not pregnant, the respondent was coded as”No" If the urine pregnancy results were negative and the respondent did not know her pregnancy status, the respondent was coded “unknown” Persons who were interviewed, but not examined also have a value of “unknown”. In addition there are missing values.
49	Age1stBaby	Age of participant at time of first live birth. 14 years or under = 14, 45 years or older = 45. Reported for female participants aged 20 years or older.
50	SleepHrsNight	Self-reported number of hours study participant usually gets at night on weekdays or workdays. Reported for participants aged 16 years and older.
51	SleepTrouble	Participant has told a doctor or other health professional that they had trouble sleeping. Reported for participants aged 16 years and older. Coded as Yes or No.
52	PhysActive	Participant does moderate or vigorous-intensity sports, fitness or recreational activities (Yes or No). Reported for participants 12 years or older.
53	PhysActiveDays	Number of days in a typical week that participant does moderate or vigorous-intensity activity. Reported for participants 12 years or older.
54	TVHrsDay	Number of hours per day on average participant watched TV over the past 30 days. Reported for participants 2 years or older. One of 0_to_1hr, 1_hr, 2_hr, 3_hr, 4_hr, More_4_hr. Not available 2009-2010.	drop
55	CompHrsDay	Number of hours per day on average participant used a computer or gaming device over the past 30 days. Reported for participants 2 years or older. One of 0_hrs, 0_to_1hr, 1_hr, 2_hr, 3_hr, 4_hr, More_4_hr. Not available 2009-2010.	drop
56	TVHrsDayChild	Number of hours per day on average participant watched TV over the past 30 days. Reported for participants 2 to 11 years. Not available 2011-2012.	drop
57	CompHrsDayChild	Number of hours per day on average participant used a computer or gaming device over the past 30 days. Reported for participants 2 to 11 years old. Not available 2011-2012.	drop
58	Alcohol12PlusYr	Participant has consumed at least 12 drinks of any type of alcoholic beverage in any one year. Reported for participants 18 years or older as Yes or No.
59	AlcoholDay	Average number of drinks consumed on days that participant drank alcoholic beverages. Reported for participants aged 18 years or older.
60	AlcoholYear	Estimated number of days over the past year that participant drank alcoholic beverages. Reported for participants aged 18 years or older.
61	SmokeNow	Study participant currently smokes cigarettes regularly. Reported for participants aged 20 years or older as Yes or No, provieded they answered Yes to having somked 100 or more cigarettes in their life time. All subjects who have not smoked 100 or more cigarettes are listed as NA here.
62	Smoke100	Study participant has smoked at least 100 cigarettes in their entire life. Reported for participants aged 20 years or older as Yes or No.
63	SmokeAge	Age study participant first started to smoke cigarettes fairly regularly. Reported for participants aged 20 years or older.
64	Marijuana	Participant has tried marijuana. Reported for participants aged 18 to 59 years as Yes or No.
65	RegularMarij	Participant has been/is a regular marijuana user (used at least once a month for a year). Reported for participants aged 18 to 59 years as Yes or No.
66	AgeRegMarij	Age of participant when first started regularly using marijuana. Reported for participants aged 18 to 59 years.
67	HardDrugs	Participant has tried cocaine, crack cocaine, heroin or methamphetamine. Reported for participants aged 18 to 69 years as Yes or No.
68	SexEver	Participant had had vaginal, anal, or oral sex. Reported for participants aged 18 to 69 years as Yes or No.
69	SexAge	Age of participant when had sex for the first time. Reported for participants aged 18 to 69 years.
70	SexNumPartnLife	Number of opposite sex partners participant has had any kind of sex with over their lifetime. Reported for participants aged 18 to 69 years.
71	SexNumPartYear	Number of opposite sex partners participant has had any kind of sex with over the past 12 months. Reported for participants aged 18 to 59 years.
72	SameSex	Participant has had any kind of sex with a same sex partner. Reported for participants aged 18 to 69 years ad Yes or No.
73	SexOrientation	participant’s sexual orientation (self-described). Reported for participants aged 18 to 59 years. One of Heterosexual, Homosexual, Bisexual.
74	WTINT2YR, WTMEC2YR, SDMVPSU, SDMVSTRA	Sample weighting variables. For more details see one of the following. http://www.cdc.gov/Nchs/tutorials/environmental/orientation/sample_design/index.htm http://www.cdc.gov/nchs/nhanes/nhanes2009-2010/DEMO_F.htm#WTINT2YR and http://www.cdc.gov/nchs/nhanes/nhanes2011-2012/DEMO_G.htm#WTINT2YR	drop

These are the remaining columns in the NHANES dataset and the number of NA values in each column, respectively:

Table 2: Column names in NHANES data set, after removal of redundant and inapplicable columns, and number of NAs. (PregnantNow is the most extreme example, and is kept on purpose.
	Column Name	Number of NAs
1	Gender	0
2	Age	0
3	Race1	0
4	Education	14
5	HHIncomeMid	603
6	Poverty	537
7	HomeRooms	55
8	HomeOwn	49
9	Work	1
10	Weight	57
11	Height	53
12	BMI	63
13	Pulse	254
14	BPSysAve	264
15	BPDiaAve	264
16	DirectChol	390
17	TotChol	390
18	UrineVol1	97
19	UrineFlow1	489
20	Diabetes	2
21	HealthGen	757
22	DaysPhysHlthBad	764
23	DaysMentHlthBad	762
24	LittleInterest	798
25	Depressed	794
26	SleepHrsNight	17
27	SleepTrouble	0
28	PhysActive	0
29	Alcohol12PlusYr	772
30	Smoke100	0
31	Smoke100n	0
32	PregnantNow	5539

Variable imputation

Imputation is done with the MICE method: “Multiple Imputation by Chained Equations”.

The MICE method removes all NA values by guessing intelligently. Here I have called the mice() function with default parameters. This is why it created nonsensical values in places, e.g. pregnant fathers., see below.

Table 3: Column ‘PregnantNow’, Before Imputation:
	Yes	No	Unknown	NA
female	72	1573	51	1987
male	0	0	0	3552
NA	0	0	0	0

Table 4: Column ‘PregnantNow’, After Imputation:
	Yes	No	Unknown
female	362	2930	391
male	197	2910	445
NA	0	0	0

PregnantNow is the most extreme example- more than 50% NA values were undefined for this column.

This demonstrates that the default choices that the mice() function makes are not always suitable for this dataset. For now, I’ll leave these new values inside the imputed dataset though.

According to the authors of the mice package these decisions need to be made:

Assumption: Missing at random satisfied?
Target Model / intended Use of the data
Set of variables
Include dependent variables
Order of imputation
intiializations and number of iterations
number of imputations (how many multiple imputations)

(TBC - imputation done better - the mice package is a lot more flexible and has more to offer to achieve better imputations)

#NHANES_imp_qik <- quickpred(NHANES)
NHANES_imp_qik <- mice (NHANES, pred = quickpred(NHANES, minpuc = 0.25), include = "Age", maxit= 1)
NHANES_imp_qik_complete <- mice::complete(NHANES_imp_qik)
#table(NHANES_imp_qik_complete$PregnantNow, NHANES_imp_qik_complete$Gender)

NOTE The imputation process with default values did not really produce meaningful values. Moreover, using ~50 columns is a bit too much, both for modeling as well as for imputation. However my goal was to try out model selection algorithms, and imputing correctly was less a concern.

However, after imputation there are no more NAs in any column of the NHANES Dataset.

Supplementary Materials

These metadata show, for example, the unique distinct values of columns where it makes sense to report them (e.g. male/female for column “gender”).

Top

Table 5: NHANES Column data types, and some values of categorical variables.
	Levels	Storage
Gender	2	integer
Age	0	integer
Race1	5	integer
Education	5	integer
HHIncomeMid	0	integer
Poverty	0	double
HomeRooms	0	integer
HomeOwn	3	integer
Work	3	integer
Weight	0	double
Height	0	double
BMI	0	double
Pulse	0	integer
BPSysAve	0	integer
BPDiaAve	0	integer
DirectChol	0	double
TotChol	0	double
UrineVol1	0	integer
UrineFlow1	0	double
Diabetes	2	integer
HealthGen	5	integer
DaysPhysHlthBad	0	integer
DaysMentHlthBad	0	integer
LittleInterest	3	integer
Depressed	3	integer
SleepHrsNight	0	integer
SleepTrouble	2	integer
PhysActive	2	integer
Alcohol12PlusYr	2	integer
Smoke100	2	integer
Smoke100n	2	integer
PregnantNow	3	integer

## $Gender
## [1] "female" "male"  
## 
## $Race1
## [1] "Black"    "Hispanic" "Mexican"  "White"    "Other"   
## 
## $Education
## [1] "8th Grade"      "9 - 11th Grade" "High School"    "Some College"  
## [5] "College Grad"  
## 
## $HomeOwn
## [1] "Own"   "Rent"  "Other"
## 
## $Work
## [1] "Looking"    "NotWorking" "Working"   
## 
## $Diabetes
## [1] "No"  "Yes"
## 
## $HealthGen
## [1] "Excellent" "Vgood"     "Good"      "Fair"      "Poor"     
## 
## $LittleInterest
## [1] "None"    "Several" "Most"   
## 
## $Depressed
## LittleInterest
## 
## $SleepTrouble
## Diabetes
## 
## $PhysActive
## Diabetes
## 
## $Alcohol12PlusYr
## Diabetes
## 
## $Smoke100
## Diabetes
## 
## $Smoke100n
## [1] "Non-Smoker" "Smoker"    
## 
## $PregnantNow
## [1] "Yes"     "No"      "Unknown"

Obligatory Disclaimer

(from the NHANES Package Documentation):

Please note that the data sets provided in this package are derived from the NHANES database and have been adapted for educational purposes. As such, they are NOT suitable for use as a research database.