SciRuby

Tools for Scientific Computing in Ruby

GSoC 2016: Adding Categorical Data Support

Support for categorical data is important for any data analysis tool. This summer I implemented categorical data capabilities for:

  • Convenient and efficient data wrangling for categorical data in Daru
  • Visualization of categorical data
  • Multiple linear regression and generalized linear models (GLM) with categorical variables in Statsample and Statsample-GLM

Lets talk about each of them in detail.

Analyzing catgorical data with Daru

Categorical data is now readily recognized by Daru and Daru has all the necessary procedures for dealing with it.

To analyze categorical variable, simply turn the numerical vector to categorical and you are ready to go.

We will use, for demonstration purposes, animal shelter data taken from the Kaggle Competition. It is stored in shelter_data.

1
2
3
4
5
6
7
8
9
# Tell Daru which variables are categorical
shelter_data.to_category 'OutcomeType', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'

# Or quantize a numerical variable to categorical
shelter_data['AgeuponOutcome'] = shelter_data['AgeuponOutcome(Weeks)'].cut [0, 1, 4, 52, 260, 1500],
    labels: [:less_than_week, :less_than_month, :less_than_year, :one_to_five_years, :more_than__five_years]

# Do your operations on categorical data
shelter_data['AgeuponOutcome'].frequencies.sort ascending: false

1
2
3
4
5
6
7
8
9
small['Breed'].categories.size
#=> 1380
# Merge infrequent categories to make data analysis easy
other_cats = shelter_data['Breed'].categories.select { |i| shelter_data['Breed'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
shelter_data['Breed'].rename_categories other_cats_hash
shelter_data['Breed'].frequencies
# View the data
small['Breed'].frequencies.sort(ascending: false).head(10)

Please refer to this blog post to know more.

Visualizing categorical data

With the help of Nyaplot, GnuplotRB and Gruff, Daru now provides ability to visualize categorical data as it does with numerical data.

To plot a vector with Nyaplot one needs to call the function #plot.

1
2
3
4
5
6
7
# dv is a caetgorical vector
dv = Daru::Vector.new ['III']*10 + ['II']*5 + ['I']*5, type: :category, categories: ['I', 'II', 'III']

dv.plot(type: :bar, method: :fraction) do |p, d|
  p.x_label 'Categories'
  p.y_label 'Fraction'
end

Given a dataframe, one can plot the scatter plot such that the points color, shape and size can be varied acording to a categorical variable.

1
2
3
4
5
6
7
8
9
10
11
12
# df is a dataframe with categorical variable :c
df = Daru::DataFrame.new({
  a: [1, 2, 4, -2, 5, 23, 0],
  b: [3, 1, 3, -6, 2, 1, 0],
  c: ['I', 'II', 'I', 'III', 'I', 'III', 'II']
  })
df.to_category :c

df.plot(type: :scatter, x: :a, y: :b, categorized: {by: :c, method: :color}) do |p, d|
  p.xrange [-10, 10]
  p.yrange [-10, 10]
end

In a similar manner Gnuplot and Gruff also support plotting of categorical variables.

An additional work I did was to add Gruff with Daru. Now one can plot vectors and dataframes also using Gruff.

See more notebooks on visualizing categorical data with Daru here.

Regression with categorical data

Now categorical data is supported in multiple linear regression and generalized linear models (GLM) in Statsample and Statsample-GLM.

A new formula language (like that used in R or Patsy) has been introduced to ease the task of specifying regressions.

Now there’s no need to manually create a dataframe for regression.

1
2
3
4
5
6
7
require 'statsample-glm'

formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'
glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic
glm_adoption.model.coefficients :hash

#=> {:AnimalType_Cat=>0.8376443692275163, :"Breed_Pit Bull Mix"=>0.28200753488859803, :"Breed_German Shepherd Mix"=>1.0518504638731023, :"Breed_Chihuahua Shorthair Mix"=>1.1960242033878856, :"Breed_Labrador Retriever Mix"=>0.445803000000512, :"Breed_Domestic Longhair Mix"=>1.898703165797653, :"Breed_Siamese Mix"=>1.5248210169271197, :"Breed_Domestic Medium Hair Mix"=>-0.19504965010288533, :Breed_other=>0.7895601504638325, :"Color_Blue/White"=>0.3748263925801828, :Color_Tan=>0.11356334165122918, :"Color_Black/Tan"=>-2.6507089126322114, :"Color_Blue Tabby"=>0.5234717706465536, :"Color_Brown Tabby"=>0.9046099720184905, :Color_White=>0.07739310267363662, :Color_Black=>0.859906249787038, :Color_Brown=>-0.003740755055106689, :"Color_Orange Tabby/White"=>0.2336674067343927, :"Color_Black/White"=>0.22564205490196415, :"Color_Brown Brindle/White"=>-0.6744314269278774, :"Color_Orange Tabby"=>2.063785952843677, :"Color_Chocolate/White"=>0.6417921901449108, :Color_Blue=>-2.1969040091451704, :Color_Calico=>-0.08386525532631824, :"Color_Brown/Black"=>0.35936722899161305, :Color_Tricolor=>-0.11440457799048752, :"Color_White/Black"=>-2.3593561796090383, :Color_Tortie=>-0.4325130799770577, :"Color_Tan/White"=>0.09637439333330515, :"Color_Brown Tabby/White"=>0.12304448360566177, :"Color_White/Brown"=>0.5867441296328475, :Color_other=>0.08821407092892847, :"SexuponOutcome_Spayed Female"=>0.32626712478395975, :"SexuponOutcome_Intact Male"=>-3.971505056680895, :"SexuponOutcome_Intact Female"=>-3.619095491410668, :SexuponOutcome_Unknown=>-102.73807712615843, :"AgeuponOutcome(Weeks)"=>-0.006959545305620043}

Additionally, through the work of Alexej Grossmann, one can also predict on new data using the model.

1
2
3
predict = glm_adoption.predict test
predict.map! { |i| i < 0.5 ? 0 : 1 }
predict.head 5

This, I believe, makes Statsample-GLM very convenient to use.

See this for a complete example.

Other

In addition to the aforementioned, there are some other considerable changes:

Documentation

You can read about all my work in detail here.. Additionally, my project page can be found here.

I hope with these additions one will be able to see data more clearly with Daru.

Comments