Support for categorical data is important for any data analysis
tool. This summer I implemented categorical data capabilities for:
Convenient and efficient data wrangling for categorical data in Daru
Visualization of categorical data
Multiple linear regression and generalized linear models (GLM) with categorical variables in Statsample and Statsample-GLM
Lets talk about each of them in detail.
Analyzing catgorical data with Daru
Categorical data is now readily recognized by
Daru and Daru has all the necessary
procedures for dealing with it.
To analyze categorical variable, simply turn the numerical vector to categorical and you are ready to go.
We will use, for demonstration purposes, animal shelter data taken
from the Kaggle Competition. It is
stored in shelter_data.
123456789
# Tell Daru which variables are categoricalshelter_data.to_category'OutcomeType','AnimalType','SexuponOutcome','Breed','Color'# Or quantize a numerical variable to categoricalshelter_data['AgeuponOutcome']=shelter_data['AgeuponOutcome(Weeks)'].cut[0,1,4,52,260,1500],labels:[:less_than_week,:less_than_month,:less_than_year,:one_to_five_years,:more_than__five_years]# Do your operations on categorical datashelter_data['AgeuponOutcome'].frequencies.sortascending:false
123456789
small['Breed'].categories.size#=> 1380# Merge infrequent categories to make data analysis easyother_cats=shelter_data['Breed'].categories.select{|i|shelter_data['Breed'].count(i)<10}other_cats_hash=other_cats.zip(['other']*other_cats.size).to_hshelter_data['Breed'].rename_categoriesother_cats_hashshelter_data['Breed'].frequencies# View the datasmall['Breed'].frequencies.sort(ascending:false).head(10)
With the help of Nyaplot, GnuplotRB and Gruff, Daru now provides ability to visualize categorical data as it does with numerical data.
To plot a vector with Nyaplot one needs to call the function #plot.
1234567
# dv is a caetgorical vectordv=Daru::Vector.new['III']*10+['II']*5+['I']*5,type::category,categories:['I','II','III']dv.plot(type::bar,method::fraction)do|p,d|p.x_label'Categories'p.y_label'Fraction'end
Given a dataframe, one can plot the scatter plot such that the points
color, shape and size can be varied acording to a categorical
variable.
123456789101112
# df is a dataframe with categorical variable :cdf=Daru::DataFrame.new({a:[1,2,4,-2,5,23,0],b:[3,1,3,-6,2,1,0],c:['I','II','I','III','I','III','II']})df.to_category:cdf.plot(type::scatter,x::a,y::b,categorized:{by::c,method::color})do|p,d|p.xrange[-10,10]p.yrange[-10,10]end
In a similar manner Gnuplot and Gruff also support plotting of categorical variables.
An additional work I did was to add Gruff with Daru. Now one can plot
vectors and dataframes also using Gruff.
See more notebooks on visualizing categorical data with Daru
here.
Regression with categorical data
Now categorical data is supported in multiple linear regression and
generalized linear models (GLM) in
Statsample and
Statsample-GLM.
A new formula language (like that used in R or
Patsy) has been introduced to ease
the task of specifying regressions.
Now there’s no need to manually create a dataframe for regression.