Dummy Variables in R Programming
R programming is one of the most used languages for data mining and visualization of the data. Using this language, any type of machine learning algorithm can be processed like regression, classification, etc. Dummy coding is used in regression analysis for categorizing the variable. Dummy variable in R programming is a type of variable that represents a characteristic of an experiment. A dummy variable is either 1 or 0 and 1 can be represented as either True or False and 0 can be represented as False or True depending upon the user. This variable is used to categorize the characteristic of an observation. For example, a person is either male or female, discipline is either good or bad, etc. Further, new columns will be made accordingly which will specify if the person is male or not as the binary value of gender_m and if the person is female or not as the binary value of gender_f. Original dataframe: After creating dummy variable:
In this article, let us discuss to create dummy variables in R using 2 methods i.e., ifelse() method and another is by using dummy_cols() function.
Using ifelse() function
ifelse() function performs a test and based on the result of the test return true value or false value as provided in the parameters of the function. Using this function, dummy variable can be created accordingly.
Syntax: ifelse(test, yes, no) Parameters: test: represents test condition yes: represents the value which will be executed if test condition satisfies no: represents the value which will be executed if test condition does not satisfies
Example 1:
r
# Using PlantGrowth dataset pg <- PlantGrowth # Print cat ("Original dataset:\n") head (pg, 20) # Create dummy variable pg$group_ctr1 <- ifelse (pg$group == "ctrl", 1, 0) # Print cat ("After creating dummy variable:\n") head (pg, 20) |
Output:
Original dataset: weight group 1 4.17 ctrl 2 5.58 ctrl 3 5.18 ctrl 4 6.11 ctrl 5 4.50 ctrl 6 4.61 ctrl 7 5.17 ctrl 8 4.53 ctrl 9 5.33 ctrl 10 5.14 ctrl 11 4.81 trt1 12 4.17 trt1 13 4.41 trt1 14 3.59 trt1 15 5.87 trt1 16 3.83 trt1 17 6.03 trt1 18 4.89 trt1 19 4.32 trt1 20 4.69 trt1 After creating dummy variable: weight group group_ctr1 1 4.17 ctrl 1 2 5.58 ctrl 1 3 5.18 ctrl 1 4 6.11 ctrl 1 5 4.50 ctrl 1 6 4.61 ctrl 1 7 5.17 ctrl 1 8 4.53 ctrl 1 9 5.33 ctrl 1 10 5.14 ctrl 1 11 4.81 trt1 0 12 4.17 trt1 0 13 4.41 trt1 0 14 3.59 trt1 0 15 5.87 trt1 0 16 3.83 trt1 0 17 6.03 trt1 0 18 4.89 trt1 0 19 4.32 trt1 0 20 4.69 trt1 0
Example 2:
Python3
# Create a dataframe df < - data.frame(gender = c("m", "f", "m"), age = c( 19 , 20 , 20 ), city = c("Delhi", "Mumbai", "Delhi")) # Print original dataset print (df) # Create dummy variable df$gender_m < - ifelse(df$gender = = "m", 1 , 0 ) df$gender_f < - ifelse(df$gender = = "f", 1 , 0 ) # Print resultant print (df) |
Output:
gender age city 1 m 19 Delhi 2 f 20 Mumbai 3 m 20 Delhi gender age city gender_m gender_f 1 m 19 Delhi 1 0 2 f 20 Mumbai 0 1 3 m 20 Delhi 1 0
Using dummy_cols() function
dummy_cols() function is present in fastDummies package. It creates dummy variables on the basis of parameters provided in the function. If columns are not selected in the function call for which dummy variable has to be created, then dummy variables are created for all characters and factors column in the dataframe.
Syntax: dummy_cols(.data, select_columns = NULL) Parameters: .data: represents object for which dummy columns has to be created select_columns: represents columns for which dummy variables has to be created
Example 1:
r
# Install the required package install.packages ("fastDummies") # Load the library library (fastDummies) # Using PlantGrowth dataset data <- PlantGrowth # Create dummy variable data <- dummy_cols (data, select_columns = "group") # Print print (data) |
Output:
weight group group_ctrl group_trt1 group_trt2 1 4.17 ctrl 1 0 0 2 5.58 ctrl 1 0 0 3 5.18 ctrl 1 0 0 4 6.11 ctrl 1 0 0 5 4.50 ctrl 1 0 0 6 4.61 ctrl 1 0 0 7 5.17 ctrl 1 0 0 8 4.53 ctrl 1 0 0 9 5.33 ctrl 1 0 0 10 5.14 ctrl 1 0 0 11 4.81 trt1 0 1 0 12 4.17 trt1 0 1 0 13 4.41 trt1 0 1 0 14 3.59 trt1 0 1 0 15 5.87 trt1 0 1 0 16 3.83 trt1 0 1 0 17 6.03 trt1 0 1 0 18 4.89 trt1 0 1 0 19 4.32 trt1 0 1 0 20 4.69 trt1 0 1 0 21 6.31 trt2 0 0 1 22 5.12 trt2 0 0 1 23 5.54 trt2 0 0 1 24 5.50 trt2 0 0 1 25 5.37 trt2 0 0 1 26 5.29 trt2 0 0 1 27 4.92 trt2 0 0 1 28 6.15 trt2 0 0 1 29 5.80 trt2 0 0 1 30 5.26 trt2 0 0 1
Example 2:
r
# Create a dataframe df <- data.frame (gender = c ("m", "f", "m"), age = c (19, 20, 20), city = c ("Delhi", "Mumbai", "Delhi")) # Create dummy variables # select_columns = NULL uses all # character and factor columns # to create dummy variable df <- dummy_cols (df) # Print print (df) |
Output:
gender age city gender_f gender_m city_Delhi city_Mumbai 1 m 19 Delhi 0 1 1 0 2 f 20 Mumbai 1 0 0 1 3 m 20 Delhi 0 1 1 0
Example :
R
# Create a sample data frame df <- data.frame (color = c ( "red" , "green" , "blue" , "red" , "green" )) # Create a dummy variable for color df$color_red <- ifelse (df$color == "red" , 1, 0) df$color_green <- ifelse (df$color == "green" , 1, 0) df$color_blue <- ifelse (df$color == "blue" , 1, 0) # Show the updated data frame df |
output :
color color_red color_green color_blue 1 red 1 0 0 2 green 0 1 0 3 blue 0 0 1 4 red 1 0 0 5 green 0 1 0
Please Login to comment...