SciRuby

Tools for Scientific Computing in Ruby

GSoC 2017 : Data Visualization Using Daru-view

Hello, Rubyist! Are looking for good ruby gem for interactive data visualization? Then you must try daru-view, a plugin gem for daru.

What makes daru-view different ?

  • daru-view is designed for interactive plotting of charts and tables.It provide different plotting tools like Nyaplot, HighCharts, GoogleCharts, DataTable. So you don’t have to write any JavaScript code from these sites and no need to shift to other language to get charts.

  • It can work with any ruby web application framework like Rails/Sinatra/Nanoc/Hanami. If you want to try few examples then please look into the daru-view/spec/dummy_* examples of Rails, Sinatra, Nanoc web applications.

  • Now Ruby developers are using IRuby notebook for interactive programming. daru-view support IRuby notebook as well. So if you just want to see chart for some DataFrame or Array of data, you can use daru-view.

  • daru-view can generate chart images to download and save.

  • daru-view adapters googlecharts, highcharts are able to geneate 3D charts as well.

  • Table have some main features like pagination, search and many more to be added.It is designed to load large data set smoothly.

Introduction

Daru is doing pretty good work as the data analysis & manipulation in IRuby notebook as well as backend part of web application. Ruby web application frameworks like Ruby on Rails, Sinatra, Nanoc are popular frameworks. So if Ruby developers get the gem like daru which can do data analysis and visualization work in applications, then there is no need of shifting to another language or usage of other gem.

My project for GSoC 2017 was to “make Daru more ready for integration with modern Web framework” in terms of visualization.

To improve in terms of viewing data, daru-view, a plugin gem for daru is created. daru-view is for easy and interactive plotting in web application & IRuby notebook. It can work in frameworks like Rails, Sinatra, Nanoc and hopefully in others too.

To see a quick overview of daru-view’s features, have a look at these examples:

Examples

This is how we can create a Plot class object:

1
Daru::View::Plot.new(data, options)

Note: User must have some knowledge about the plotting tool(that you want to use) to use it in daru-view. So that you can pass the correct options.

GoogleCharts:

Set the plotting library to :googlecharts to use this adapter. This will load the required js files in your webpage or IRuby notebook.

1
2
require 'daru/view'
Daru::View.plotting_library = :googlecharts

Let’s create a DataFrame :

1
2
3
4
5
6
7
8
9
10
11
idx = Daru::Index.new ['Year', 'Sales']
data_rows = [
          ['2004',  1000],
          ['2005',  1170],
          ['2006',  660],
          ['2007',  1030]
]
df_sale_exp = Daru::DataFrame.rows(data_rows)
df_sale_exp.vectors = idx

# perform data manipulations, if you want.

Now time to plot it:

1
2
line_basic_chart = Daru::View::Plot.new(df_sale_exp)
line_basic_chart.chart

This will return the chart object we created using GoogleCharts. In IRuby notebook, you will see this:

Basic line chart using GoogleCharts

You can find the IRuby notebook example in this link.

These are various charts type we can use e.g. line, area, bar, bubble, candlestick, combo, histogram, org, pie, stepped area chart, timeline, treemap, gauge, column, scatter, etc. We can find the customization options in the google charts site.

Let me try another chart type Geo :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
country_population = [
          ['Germany', 200],
          ['United States', 300],
          ['Brazil', 400],
          ['Canada', 500],
          ['France', 600],
          ['RU', 700]
]

df_cp = Daru::DataFrame.rows(country_population)
df_cp.vectors = Daru::Index.new(['Country', 'Population'])

geochart = Daru::View::Plot.new(
    df_cp, type: :geo, adapter: :googlecharts
)

Note: If you have already loaded the dependent JS files for the library then you can use adapter: :googlecharts in your Plot initialization.

Basic Geo chart using GoogleCharts

HighCharts:

Set the plotting library to :highcharts to use this adapter. This will load the required js files in your webpage or IRuby notebook.

1
2
require 'daru/view'
Daru::View.plotting_library = :highcharts

Let’s pass the data as HighCharts support (we can pass a DataFrame as well):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
data = [
    ['Firefox',   45.0],
    ['IE',       26.8],
    {
       :name=> 'Chrome',
       :y=> 12.8,
       :sliced=> true,
       :selected=> true
    },
    ['Safari',    8.5],
    ['Opera',     6.2],
    ['Others',   0.7]
]
plt_pie = Daru::View::Plot.new data, type: :pie

This will return the Plot object we created. In IRuby notebook, you will see this:

Basic pie chart using HighCharts

You can find the IRuby notebook example in this link.

There are various charts type we can use e.g. line, area, bar, bubble, dynamic chart, pie, column, scatter, etc. We can find the customization options in the HighCharts site.

Nyaplot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
require 'daru/view'

#  set adapter
Daru::View.plotting_library = :nyaplot

# define dataframe
df = Daru::DataFrame.new({
  a: [1, 2, 4, -2, 5, 23, 0],
  b: [3, 1, 3, -6, 2, 1, 0],
  c: ['I', 'II', 'I', 'III', 'I', 'III', 'II']
  })
df.to_category :c

# creating scatter chart
scatter_chart = Daru::View::Plot.new(df, type: :scatter, x: :a, y: :b, categorized: {by: :c, method: :color})

In IRuby notebook:

Basic scatter chart using Nyaplot

GoogleChart data table

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
require 'daru/view'

#  set adapter
# You don't need this line if you have already using google chart for plotting.
# It is just for loading the dependent js files.
Daru::View.table_library = :googlechart

# Lets use array as `data` (we can pass Daru::DataFrame as well)
data = [
  ['Galaxy', 'Distance', 'Brightness'],
          ['Canis Major Dwarf', 8000, 230.3],
          ['Sagittarius Dwarf', 24000, 4000.5],
          ['Ursa Major II Dwarf', 30000, 1412.3],
          ['Lg. Magellanic Cloud', 50000, 120.9],
          ['Bootes I', 60000, 1223.1]
  ]
galaxy_table = Daru::View::Table.new(data)
galaxy_table.table

This will return the table object we created using GoogleCharts tool. In IRuby notebook, you will see this:

Basic table using GoogleCharts

We can create table using Vectors as well.

1
2
3
4
5
6
7
8
9
dv = Daru::Vector.new [43934, 52503, 57177, 69658, 97031, 119931, 137133, 154175]

# adding pagination and some customization [optional]
opts_pagination = {
  width: '100%', height: '100%' ,
  pageSize: 5,
}

table_vec = Daru::View::Table.new(dv, opts_pagination)

In Ruby Notebook:

Basic vector table using GoogleCharts

DataTable

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require 'daru/view'

#  set adapter.
Daru::View.table_library = :datatables

# Lets use Daru::DataFrame as `data` (we can pass Array as well)
df = Daru::DataFrame.new(
  {
    b: [11,12,13,14,15],
    a: [1,2,3,4,5],
    c: [11,22,33,44,55]
  },
    order: [:a, :b, :c],
    index: [:one, :two, :three, :four, :five]
)
df_datatable = Daru::View::Table.new(df, pageLength: 3)

Currently there is some problem to diplay it in IRuby notebook, but in web application you can see something like this using df_datatable.div :

Basic table using Datatables

How to use it in Ruby web application

As we know, we can get the HTML, JS code for the chart from the Daru::View::Plot or Daru:View::Table object using #div method. So just need to add that HTML, JS code in webpage in proper place.

There is few things to be noted:

  1. In layout of the webpage, you have to load all the dependent JS files. So that HTML, JS code that is genearted work smoothly in that webpage. You can load the dependent js file for nyaplot library using Daru::View.dependent_script(:nyaplot), similarly for other library.

  2. If you are using multiple library in one webpage then load multiple dependent JS files, in that webpage layout (generally in head tag).

We can set default adapter using Daru::View.plotting_library = :googlecharts and also we can change it for particular object while initializing object, i.e. Daru::View::Plot.new(data, {adapter: :googlecharts}). Just we have to make sure that dependent JS files are loaded for it.

To make it easy, we have defined daru_chart (that works same as Daru::View::Plot.new) , daru_table (works same as Daru::View::Table.new) for Rails application.

So you can easily use it in controller or view of the application. For reference you can check the demo Rails app.

Design of daru-view

daru-view, currently using Nyaplot, HighCharts, GoogleCharts for plotting the charts. It is also generating tables using DataTables and GoogleCharts with pagination, search and various features.

Design Pattern in daru-view

daru-view mainly uses the adapter design pattern and composite design pattern.

  • Why Adapter design pattern:

    • Adapter pattern’s motivation is that we can reuse existing gems if we can modify the interface.

    • daru-view joins functionalities of independent or incompatible interfaces of different gems.

    • daru-view have Plot and Table class, which are using a adapter when adapter(library to be used for plotting) is set for Plot, Table instance.

  • Why Composite design pattern:

    • To define common objects and use it for defining composite objects.

    • In daru-view we try to write common functions in a module and include it whenever needed.

Implementation

daru-view ensure that it’s functions are usable in both IRuby notebook as well as ruby web application frameworks.

The main thing we need to display something in web application or IRuby notebook is HTML code of it. daru-view generates the HTML code of the chart, table and the same can be used to display in web application & IRuby notebook.

These are the libraries which is used in daru-view currently:

Nyaplot

Nyaplot is good library for visualization in IRuby notebook only. When we use Nyaplot as the adapter in daru-view, it is usable in both IRuby notebook and web applications. Daru DataFrame or Vector is used as the data source of the chart. It works similar to the initial daru plotting system.

If user want to use the Nyaplot methods then it can be done on Nyaplot object.We can get nyplot object using daru_plot_obj.chart.

i.e.

1
2
3
daru_view_obj = Daru::View::Plot.new(
                  daru_dataframe, options={adapter: :nyaplot})
nyaplot_obj = daru_view_obj.chart

Now user can operate all the methods for Nyaplot object. Same thing is for all other adapter in daru-view.

HighCharts

To add the HighCharts features for plotting various chart types, daru-view uses the lazy_high_charts gem with additional features.

In this adapter data source can be Array of data, Daru::DataFrame, Daru::Vector or HTML table code of the data.

There are various of options in HighCharts. One can see the options that can be used in HighCharts demo link, which can be directly used in daru-view Plot.

HighCharts adaptor can work offline as well in daru-view. Developers can update the saved the JS files (in daru-view) using rake task automatically.

If you is familiar with lazy_high_chart gem and want to use it for config the chart then user can access the lazy_high_chart object using Daru::View::Plot#chart and can do necessary operations.

GoogleCharts

To add the GoogleCharts features for plotting various chart types, daru-view uses the google_visualr gem with additional features(in this module more new features are updated).

We want GoogleChart adapter to be very strong since Google chart tools always gets updated and it has amazing plotting features. Similar to the HighCharts module, here also we can use all the options described in Google Charts website.

User can access the google_visualr object using Daru::View::Plot#chart, if they want to operate google_visualr methods.

GoogleCharts as data table

One of the good thing about google chart tool is, it can be used for generating table for web application and IRuby Notebook with pagination and other features.

Daru::View::Plot can take data Array, Daru::DataFrame, Daru::Vector, Daru::View::Table as data source.

Daru::View::Table can take data Array, daru DataFrame, Daru Vector as data source.

DataTables

DataTables has interaction controls to any HTML table. It can handle large set of data and have many cool features.

To use it, daru-view uses https://github.com/Shekharrajak/data_tables gem. [Note: the gem name will be changed in near future]

It basically uses the HTML table code and add features that user want. So internally HTML table code of Daru::DataFrame and Daru::Vector is passed as data source parameter.

Future Work

daru-view will be more powerful and simple in near future. Developers can add more libraries in daru-view easily, if required. To add library follow the setups given in CONTRIBUTING.md

Conclusion

The aim of the daru-view is to plot charts in IRuby notebook and ruby web application easily, so that developers need not have to use any other gem or language for visualization.

It can work smoothly in Rails/Sinatra/Nanoc web frameworks and I hope it can work in other ruby frameworks as well, because daru-view is generating the html code and javascript code for the chart, which is basic need of the webpage.

Why not use the plotting libraries directly?

If you are using daru gem for analyzing the data and want to visualize it, then it will be good if you have data-visualization within daru and can plot it directly using DataFrame/Vector objects of daru.

daru-view will be helpful in plotting charts and tables directly from the Daru::DataFrame and Daru::Vector . daru-view using nyaplot, highcharts , google charts right now to plot the chart. So user can set the plotting library and get the chart accordingly.

Most of the plotting libraries doesn’t provide the features of plotting charts in iruby notebook. They are defined only for web applications (mostly for Rails). But daru-view can plot charts in any ruby web application as well as iruby notebook.

Acknowledgements

I would like to thank to my mentors Sameer Deshmukh ,Lokesh Sharma and Victor Shepelev for their response and support and I am very grateful to the Ruby Science Foundation for this golden opportunity.

I thank my fellow GSoC participants Athitya Kumar and Prasun Anand for their support and discussions on various topics.

Thanks to Google for conducting Google Summer of Code.

GSoC 2017 : Support to Import & Export of More Formats

Introduction

“Hello friend. Hello friend? That’s lame.” - S01E01 (Pilot), Mr.Robot

My name is Athitya Kumar, and I’m a 4th year undergrad from IIT Kharagpur, India. I was selected as a GSoC 2017 student developer by Ruby Science Foundation for project daru-io.

Daru-IO is a plugin-gem to Daru gem, that extends support for many Import and Export methods of Daru::DataFrame. This gem is intended to help Rubyists who are into Data Analysis or Web Development, by serving as a general purpose conversion library.

Through this summer, I worked on adding support for various Importers and Exporters while also porting some existing modules. Feel free to find a comprehensive set of useful links in Final Work Submission and README. Before proceeding any further, you might also be interested in checking out a sample showcase of Rails example and the code making it work.

Mark Anthony’s Speech (ft. daru)

“Rubyists, Data Analysts and Web Developers, lend me your ears;

I come to write about my GSoC project, not to earn praise for it.”

For the uninitiated, Google Summer of Code (GSoC) 2017 is a 3-month program that focuses on introducing selected students to open-source software development. To know more about GSoC, feel free to click here.

daru is a Ruby gem that stands for Data Analysis in RUby. My initial proposal was to make daru easier to integrate with Ruby web frameworks through better import-export features (daru-io) and visualization methods (daru-view). However, as both Shekhar and I were selected for the same proposal, we split this amongst ourselves : daru-io was allocated to me and daru-view was allocated to Shekhar.

“The open-source contributions that people do, live after them;

But their private contributions, are oft interred with their bones.”

This is one of the reasons why I (and all open-source developers) are enthusiastic about open-source. In open-source, one’s work can be re-used in other projects in accordance with the listed LICENSE and attribution, compared to the restrictions and risk of Intellectual Property Right claims in private work.

“So be it. The noble Pythonistas and R developers;

Might not have chosen to try daru yet.”

It is quite understandable that Pythonistas and R developers feel that their corresponding languages have sufficient tools for Data Analysis. So, why would they switch to Ruby and start using daru?

“If it were so, it may have been a grievous fault;

Give daru a try, with daru-io and daru-view.”

First of all, I don’t mean any offense when I say “grievous fault”. But please, do give Ruby and daru family a try, with an open mind.

Voila - the daru family has two new additions, namely daru-io and daru-view. Ruby is a language which is extensively used in Web Development with multiple frameworks such as Rails, Sinatra, Nanoc, Jekyll, etc. With such a background, it only makes sense for daru to have daru-io and daru-view as separate plugins, thus making the daru family easily integrable with Ruby web frameworks.

“Here, for attention of Rubyists and the rest–

For Pandas is an honourable library;

So are they all, all honourable libraries and languages–

Come I to speak about daru-io’s inception.”

Sure, the alternatives in other languages like Python, R and Hadoop are also good data analysis tools. But, how readily can they be integrated into any web application? R & Hadoop don’t have a battle-tested web framework yet, and are usually pipelined into the back-end of any application to perform any analysis. I’m no one to judge such pipelines, but I feel that pipelines are hackish workarounds rather than being a clean way of integrating.

Meanwhile, though Python too has its own set of web frameworks (like Django, Flask and more), Pandas doesn’t quite work out-of-the-box with these frameworks and requires the web developer to write lines and lines of code to integrate Pandas with parsing libraries and plotting libraries.

“daru-io is a ruby gem, and open-sourced to all of us;

But some might think it was an ambitious idea;

And they are all honourable men.”

As described above, daru-io is open-sourced under the MIT License with attribution to myself and Ruby Science Foundation. Being a ruby gem, daru-io follows the best practices mentioned in the Rubygems guides and is all geared up with a v0.1.0 release.

Disclaimer - By “men”, I’m not stereotyping “them” to be all male, but I’m just merely retaining the resemblence to the original speech of Mark Anthony.

“daru-io helps convert data in many formats to Daru::DataFrame;

Whose methods can be used to analyze huge amounts of data.

Does this in daru-io seem ambitious?”

Daru has done a great job of encapsulating the two main structures of Data Analysis - DataFrames and Vectors - with a ton of functionalities that are growing day by day. But obviously, the huge amounts of data aren’t going to be manually fed into the DataFrames right?

One part of daru-io is the battalion of Importers that ship along with it. Importers are used to read from a file / Ruby instance, and create DataFrame(s). These are the Importers being supported by v0.1.0 of daru-io :

  • General file formats : CSV, Excel (xls and xlsx), HTML, JSON, Plaintext.
  • Special file formats : Avro, RData, RDS.
  • Database related : ActiveRecord, Mongo, Redis, SQLite, DBI.

For more specific information about the Importers, please have a look at the README and YARD Docs.

Let’s take a simple example of the JSON Importer, to import from GitHub’s GraphQL API response. By default, the API response is paginated and 30 repositories are listed in the url : https://api.github.com/users/#{username}/repos.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
require 'daru/io/importers/json'

dataframe = %w[athityakumar zverok v0dro lokeshh].map do |username|
  Daru::DataFrame.read_json(
    "https://api.github.com/users/#{username}/repos",
    RepositoryName: '$..full_name',
    Stars: '$..stargazers_count',
    Size: '$..size',
    Forks: '$..forks_count'
  )
end.reduce(:concat)

#=> #<Daru::DataFrame(120x4)>
#      Repository   Stars   Size   Forks
#   0  athityakum       0      6       0
#   1  athityakum       0    212       0
#   2  athityakum       0    112       0
#  ...    ...          ...   ...     ...

“When working with a team of Pythonistas and R developers;

daru-io helps convert Daru::DataFrame to multiple formats.

Does this in daru-io seem ambitious?

The second part of daru-io is the collection of Exporters that ship with it. Exporters are used to write the data in a DataFrame, to a file / database. These are the Exporters being supported by v0.1.0 of daru-io :

  • General file formats : CSV, Excel (xls), JSON.
  • Special file formats : Avro, RData, RDS.
  • Database related : SQL.

For more specific information about the Exporters, please have a look at the README and YARD Docs.

Let’s take a simple example of the RDS Exporter. Say, your best friend is a R developer who’d like to analyze a Daru::DataFrame that you have obtained, and perform further analysis. You don’t want to break your friendship, and your friend is skeptical of learning Ruby. No issues, simply use the RDS Exporter to export your Daru::DataFrame into a .rds file, which can be easily loaded by your friend in R.

1
2
3
4
5
6
7
8
9
10
11
12
require 'daru/io/exporters/rds'

dataframe #! Say, the DataFrame is obtained from the above JSON Importer example

#=> #<Daru::DataFrame(120x4)>
#      Repository   Stars   Size   Forks
#   0  athityakum       0      6       0
#   1  athityakum       0    212       0
#   2  athityakum       0    112       0
#  ...    ...          ...   ...     ...

dataframe.write_rds('github_api.rds', 'github.api.dataframe')

“You all did see that in the repository’s README;

Codeclimate presented a 4.0 GPA;

Code and tests were humbly cleaned;

with help of rubocop, rspec, rubocop-rspec and saharspec.

Ambition shouldn’t have been made of humble stuff.

Yet some might think it is an ambitious idea;

And sure, they are all honourable men.”

Thanks to guidance from my mentors Victor Shepelev, Sameer Deshmukh and Lokesh Sharma, I’ve come to know about quite a lot of Ruby tools that could be used to keep the codebase sane and clean.

  • rubocop : A Ruby static code analyzer, which enforces specified Ruby style guidelines.
  • rspec : A unit-testing framework, which makes sure that codes of block are doing what they’re logically supposed to do.
  • rubocop-rspec : A plugin gem to rubocop, that extends rspec-related rules.
  • saharspec : A gem with a punny name, that extends a few features to rspec-its that are more readable. For example, its_call.

“I speak not to disapprove of what other libraries do;

But here I am to speak what I do know.

Give daru-io a try and y’all will love it, not without cause:

Does anything withhold you then, from using daru-io?”

I really mean it, when I discretely specify “I speak not to disapprove of what other libraries do”. In the world of open-source, there should never be hate among developers regarding languages, or libraries. Developers definitely have their (strong) opinions and preferences, and it’s understandable that difference in opinion do arise. But, as long as there’s mutual respect for each other’s opinion and choice, all is well.

“O Ruby community! Thou should definitely try out daru-io,

With daru and daru-view. Bear with me;

My heart is thankful to the community of Ruby Science Foundation,

And I must pause till I write another blog post.”

If you’ve read all the way till down here, I feel that you’d be interested in trying out the daru family, after having seen the impressive demonstration of Importers & Exporters above, and the Rails example (Website | Code). I’m very thankful to mentors Victor Shepelev, Sameer Deshmukh and Lokesh Sharma for their timely Pull Request reviews and open discussions regarding features. Daru-IO would not have been possible without them and the active community of Ruby Science Foundation, who provided their useful feedback(s) whenever they could. The community has been very supportive overall, and hence I’d definitely be interested to involve with SciRuby via more open-source projects.

GSoC 2016: Adding Categorical Data Support

Support for categorical data is important for any data analysis tool. This summer I implemented categorical data capabilities for:

  • Convenient and efficient data wrangling for categorical data in Daru
  • Visualization of categorical data
  • Multiple linear regression and generalized linear models (GLM) with categorical variables in Statsample and Statsample-GLM

Lets talk about each of them in detail.

Analyzing catgorical data with Daru

Categorical data is now readily recognized by Daru and Daru has all the necessary procedures for dealing with it.

To analyze categorical variable, simply turn the numerical vector to categorical and you are ready to go.

We will use, for demonstration purposes, animal shelter data taken from the Kaggle Competition. It is stored in shelter_data.

1
2
3
4
5
6
7
8
9
# Tell Daru which variables are categorical
shelter_data.to_category 'OutcomeType', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'

# Or quantize a numerical variable to categorical
shelter_data['AgeuponOutcome'] = shelter_data['AgeuponOutcome(Weeks)'].cut [0, 1, 4, 52, 260, 1500],
    labels: [:less_than_week, :less_than_month, :less_than_year, :one_to_five_years, :more_than__five_years]

# Do your operations on categorical data
shelter_data['AgeuponOutcome'].frequencies.sort ascending: false

1
2
3
4
5
6
7
8
9
small['Breed'].categories.size
#=> 1380
# Merge infrequent categories to make data analysis easy
other_cats = shelter_data['Breed'].categories.select { |i| shelter_data['Breed'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
shelter_data['Breed'].rename_categories other_cats_hash
shelter_data['Breed'].frequencies
# View the data
small['Breed'].frequencies.sort(ascending: false).head(10)

Please refer to this blog post to know more.

Visualizing categorical data

With the help of Nyaplot, GnuplotRB and Gruff, Daru now provides ability to visualize categorical data as it does with numerical data.

To plot a vector with Nyaplot one needs to call the function #plot.

1
2
3
4
5
6
7
# dv is a caetgorical vector
dv = Daru::Vector.new ['III']*10 + ['II']*5 + ['I']*5, type: :category, categories: ['I', 'II', 'III']

dv.plot(type: :bar, method: :fraction) do |p, d|
  p.x_label 'Categories'
  p.y_label 'Fraction'
end

Given a dataframe, one can plot the scatter plot such that the points color, shape and size can be varied acording to a categorical variable.

1
2
3
4
5
6
7
8
9
10
11
12
# df is a dataframe with categorical variable :c
df = Daru::DataFrame.new({
  a: [1, 2, 4, -2, 5, 23, 0],
  b: [3, 1, 3, -6, 2, 1, 0],
  c: ['I', 'II', 'I', 'III', 'I', 'III', 'II']
  })
df.to_category :c

df.plot(type: :scatter, x: :a, y: :b, categorized: {by: :c, method: :color}) do |p, d|
  p.xrange [-10, 10]
  p.yrange [-10, 10]
end

In a similar manner Gnuplot and Gruff also support plotting of categorical variables.

An additional work I did was to add Gruff with Daru. Now one can plot vectors and dataframes also using Gruff.

See more notebooks on visualizing categorical data with Daru here.

Regression with categorical data

Now categorical data is supported in multiple linear regression and generalized linear models (GLM) in Statsample and Statsample-GLM.

A new formula language (like that used in R or Patsy) has been introduced to ease the task of specifying regressions.

Now there’s no need to manually create a dataframe for regression.

1
2
3
4
5
6
7
require 'statsample-glm'

formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'
glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic
glm_adoption.model.coefficients :hash

#=> {:AnimalType_Cat=>0.8376443692275163, :"Breed_Pit Bull Mix"=>0.28200753488859803, :"Breed_German Shepherd Mix"=>1.0518504638731023, :"Breed_Chihuahua Shorthair Mix"=>1.1960242033878856, :"Breed_Labrador Retriever Mix"=>0.445803000000512, :"Breed_Domestic Longhair Mix"=>1.898703165797653, :"Breed_Siamese Mix"=>1.5248210169271197, :"Breed_Domestic Medium Hair Mix"=>-0.19504965010288533, :Breed_other=>0.7895601504638325, :"Color_Blue/White"=>0.3748263925801828, :Color_Tan=>0.11356334165122918, :"Color_Black/Tan"=>-2.6507089126322114, :"Color_Blue Tabby"=>0.5234717706465536, :"Color_Brown Tabby"=>0.9046099720184905, :Color_White=>0.07739310267363662, :Color_Black=>0.859906249787038, :Color_Brown=>-0.003740755055106689, :"Color_Orange Tabby/White"=>0.2336674067343927, :"Color_Black/White"=>0.22564205490196415, :"Color_Brown Brindle/White"=>-0.6744314269278774, :"Color_Orange Tabby"=>2.063785952843677, :"Color_Chocolate/White"=>0.6417921901449108, :Color_Blue=>-2.1969040091451704, :Color_Calico=>-0.08386525532631824, :"Color_Brown/Black"=>0.35936722899161305, :Color_Tricolor=>-0.11440457799048752, :"Color_White/Black"=>-2.3593561796090383, :Color_Tortie=>-0.4325130799770577, :"Color_Tan/White"=>0.09637439333330515, :"Color_Brown Tabby/White"=>0.12304448360566177, :"Color_White/Brown"=>0.5867441296328475, :Color_other=>0.08821407092892847, :"SexuponOutcome_Spayed Female"=>0.32626712478395975, :"SexuponOutcome_Intact Male"=>-3.971505056680895, :"SexuponOutcome_Intact Female"=>-3.619095491410668, :SexuponOutcome_Unknown=>-102.73807712615843, :"AgeuponOutcome(Weeks)"=>-0.006959545305620043}

Additionally, through the work of Alexej Grossmann, one can also predict on new data using the model.

1
2
3
predict = glm_adoption.predict test
predict.map! { |i| i < 0.5 ? 0 : 1 }
predict.head 5

This, I believe, makes Statsample-GLM very convenient to use.

See this for a complete example.

Other

In addition to the aforementioned, there are some other considerable changes:

Documentation

You can read about all my work in detail here.. Additionally, my project page can be found here.

I hope with these additions one will be able to see data more clearly with Daru.

GSoC 2016: SpiceRub::KernelPool and Kernels

So this post comes rather late, but I’d still like to talk about what I did in the first week, and a little into the second week. Firstly, thanks to a pull request offered by my GSoC mentor John (@mohawkjohn), I now have rake tasks set up properly in the repository and rake seems to be working.

There are installation instructions in the readme and there are already some sample Kernel files in the spec folder to try out various routines on. Apart from cloning the repository, one additional thing you will need to do is download the SPICE toolkit. You may want to keep the entire compressed file for later, but you’ll only need the cspice.a file in the lib/ subdirectory. After this follow the instructions in the readme and you should be good to go. (Be sure to run bundle install to install any dependencies that you don’t already have.)

After you’re done compiling and installing, run rake pry in the gem root directory.

If you remember, almost any useful task involving the SPICE Toolkit is preceded by loading data through kernel files. The relevant routine to do this is called furnsh_c() , and the most direct way to access it through Ruby is by calling the function SpiceRub::Native.furnsh. (However, this is not recommended because SpiceRub has a specific Ruby class unifying all the kernel related methods, and also because SPICE maintains its own internal variables for both tracking loaded kernel files and unloading them.)

Below is a crude dependency sequence:

1
2
3
4
5
SpiceRub::KernelPool.load ->  SpiceRub::Native.furnsh -> furnsh_c

Ruby Interface            ->  C Accessor              -> External library

Ruby                      ->  Native Ruby-C API       -> Native 3rd Party C

That’s the basic design of the wrapper, so here are a few simple examples on using the Kernel API. => denotes the interpreted output of pry.

First of all, the main KernelPool class is a singleton class, that means it can only be instantiated with the #instance method and the usual #new is private. Any subsequent calls to #instance will produce the same object.

1
2
3
4
5
6
7
8
9
10
11
12
kernel_pool = SpiceRub::KernelPool.instance
=> #<SpiceRub::KernelPool:0x00000002db3218>

rogue_kernel_pool = SpiceRub::KernelPool.new
NoMethodError: private method `new' called for SpiceRub::KernelPool:Class
from (pry):2:in `__pry__'

rogue_kernel_pool = SpiceRub::KernelPool.instance
=> #<SpiceRub::KernelPool:0x00000002db3218>

rogue_kernel_pool == kernel_pool
=> true

This is to make sure there is only one KernelPool state being maintained at a time. Now we have a bunch of kernel files in the spec/data/kernels folder which we’ll use for this example. There is an accessor attribute called @path which can be used to point to a particular folder.

1
kernel_pool.path = 'spec/data/kernels'

From here on it’s required that you’re running pry or irb from the gem root folder in order for the paths to match in these examples.

Now let’s load a couple of kernel files, you can type system("ls", kernel_pool.path) into your console to get a list of all the test kernels available in that folder. The KernelPool object has a #load method to load kernel files. If the variable path is set, then you only need to enter the file name, otherwise the entire path needs to be provided. An integer denoting the index of the kernel in the pool is returned if the load is successful.

1
2
kernel_pool.load("naif0011.tls")
=> 0

Note that this is the same as providing the full relative path of spec/data/kernels/naif0011.tls when the path variable is not set or nil.

Let’s load two more files:-

1
2
3
4
5
6
7
8
9
10
11
12
13
kernel_pool.load("moon_pa_de421_1900-2050.bpc")
=> 1

kernel_pool.load("de405_1960_2020.bsp")
=> 2

kernel_pool
=> #<SpiceRub::KernelPool:0x00000002db3218
 @path="spec/data/kernels",
 @pool=
  [#<SpiceRub::KernelPool::SpiceKernel:0x0000000270fab0 @loaded=true, @path_to="spec/data/kernels/naif0011.tls">,
   #<SpiceRub::KernelPool::SpiceKernel:0x000000026bd698 @loaded=true, @path_to="spec/data/kernels/moon_pa_de421_1900-2050.bpc">,
   #<SpiceRub::KernelPool::SpiceKernel:0x000000024ce698 @loaded=true, @path_to="spec/data/kernels/de405_1960_2020.bsp">]>

So now if you try to view the @pool member of kernel_pool, you’ll find three SpiceKernel objects with @loaded=true and their respective file paths. There isn’t much to do with a SpiceKernel object except unload it, or check its state. Note that you can only load kernel files into a KernelPool object and unload them via the SpiceKernel object.

You can access the kernel pool by calling the #[] operator and using the index that was returned on load:

1
2
3
4
5
6
7
8
9
10
11
kernel_pool[0]
=> #<SpiceRub::KernelPool::SpiceKernel:0x0000000270fab0 @loaded=true, @path_to="spec/data/kernels/naif0011.tls">

kernel_pool.count
=> 3

kernel_pool[0].unload!
=> true

kernel_pool.count
=> 2

So here we unload the first kernel and note that the count drops to 2. If you look up kernel_pool[0], you’ll find that the kernel is still present — but its @loaded variable has been set to false, which means that it has been removed from the CSPICE internal kernel pool.

1
2
kernel_pool[0]
=> #<SpiceRub::KernelPool::SpiceKernel:0x0000000270fab0 @loaded=false, @path_to="spec/data/kernels/naif0011.tls">

To unload all kernels simultaneously and delete the kernel pool, use #clear!

1
2
3
4
5
6
7
8
kernel_pool.clear!
=> true

kernel_pool.empty?
=> true

kernel_pool.pool
=> []

And that about wraps up this blog post on basic kernel handling. Since we know how to load data but not use it yet, I’ll cover that and the various kernel types in the next blog post. Thank you for reading.

GSoC 2016 : A Look at SpiceRub::Body

One of the main goals of this project was to make the ephemerides functions easily available to an end user. Ephemerides basically refers to an object in space whose motion is being tracked and observed. SPICE required you to provide the following parameters to observe a body in space.

  • Target: The body of interest
  • Frame: A rotational frame of reference (Default is J2000 [Not to be confused with the J2000 epoch])
  • Observer: An observing body whose viewpoint is used to chart the vector
  • Epoch : An epoch in Ephemeris Time

SPICE has an integer-key convention for the kind of bodies that it has support for. Each body can be referenced via a string or an integer id. While there isn’t an actual strict range for integer ID classification, it is mentioned here and can be summed up in the following if and elsif clauses. (In Ruby constant strings are better off as symbols, so the constructor takes either an integer ID of a string symbol)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  if body_id > 2000000
    :asteroid
  elsif body_id > 1000000
    :comet
  elsif body_id > 1000
    :body
  elsif body_id > 10
    body_id % 100 == 99 ?
      :planet : :satellite
  elsif body_id == 10
    :star
  elsif body_id > 0
    :barycenter
  elsif body_id > -1000
    :spacecraft
  elsif body_id > -100000
    :instrument
  else
    :spacecraft

It was very tempting to involve inheritance and extend a base Body class onto these potential classes, but I simply did not see the need for it at this point. The way it is at the moment, the Body object has a reader attribute type that stores some metadata about the body for the user’s convenience. Perhaps as coverage of SPICE improves, this minor thing can be changed later on.

To create a Body object, you instantiate with either a body name or a body id. Certain bodies such as instruments will require additional kernels to be loaded. To proceed seamlessly, load a leap seconds kernel, a planetary constants kernel, and an ephemeris kernel. (All avaialable in spec/data/kernels)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SpiceRub::Body.new(399)
=> #<SpiceRub::Body:0x00000002769da8
     @code=399,
     @frame=:J2000,
     @name=:earth,
     @type=:planet>

SpiceRub::Body.new(:earth)
=> #<SpiceRub::Body:0x000000026c73c8
   @code=399,
   @frame=:J2000,
   @name=:earth,
   @type=:planet>

SpiceRub::Body.new(:moon)
=> #<SpiceRub::Body:0x0000000214ac88
   @code=301,
   @frame=:J2000,
   @name=:moon,
   @type=:satellite>

399 and :earth map to the same body in SPICE data. The frame of reference can also be specified as a named parameter during instantiation to set a custom default frame for that particular object.

1
2
3
4
5
6
SpiceRub::Body.new(399, frame: IAU_EARTH)
=> #<SpiceRub::Body:0x000000020b1df8
     @code=399,
     @frame=:IAU_EARTH,
     @name=:earth,
     @type=:planet>

In SPICE, a state is a 6 length column vector that stores position and velocity in 3D cartesian co-ordinates

As a base case, let’s find out the the position of the Earth with respect to itself.

1
2
3
4
5
earth.position_at(SpiceRub::Time.now, observer: earth)
=>
[
  [0.0]   [0.0]   [0.0]
]

The origin as seen from itself is still the origin, so this makes sense. The methods #velocity_at and #state_at take an identical set of parameters. While there is a bit of redundancy going on, splitting them makes the API more elegant, but the basic relationship between these three vectors is the following :-

1
2
3
4
state = [
          position[0],position[1],position[2],
          velocity[0],velocity[1],velocity[2]
        ]

One thing to note is that state/velocity/position vectors will always be returned as an NMatrix object, SciRuby’s numerical matrix core, to allow for calculations via the NMatrix API.

As an example that is used in the code, one line can turn a position vector into distance from origin (here using Euclidean distance):

1
2
  position = earth.position_at(SpiceRub::Time.now, observer: observer)
  Math.sqrt( (position ** 2).sum[0] )

As a simple imprecise experiment, let’s find out how the speed of light can be “estimated” up with this data.

1
2
3
4
5
6
7
8
distance = moon.distance_from(earth, now)
=> 367441.0260814745

time = moon.light_time_from(earth, now)
=> 1.2256513340354764

distance / time
=> 299792.458

The unit of distance here is kilometers, so the speed of light by this measurement is about pretty close to the textbook figure of 3e+8 m/s.

There is also a function to check if a list of bodies are within a radial proximity from an observing body. We already calculated the distance of the moon to be about 367,000 km. The function within_proximity returns a list of all bodies that are within the specified radial distance from the calling body object.

1
2
3
4
5
6
7
8
#assuming venus and mercury are instantiated

earth.within_proximity([moon, venus, mercury], 400000, now)
=> [#<SpiceRub::Body:0x0000000191c4f8
  @code=301,
  @frame=:J2000,
  @name=:moon,
  @type=:satellite>]

Now that we’ve come to the end of the functionality, I would like to mention that there is another named argument aberration_correction which is basically an error reduction method to provide a more accurate result than the default observation. The default :none option for aberration correction basically provides the geometric observations without any corrections for reception or transmission of photons. For a list of various aberration correction methods available, have a look at the documentation for spkpos_c to find out if you need an aberration correction on SPICE data.

1
2
3
4
5
6
7
d1 = moon.distance_from(earth, SpiceRub::Time.now, aberration_correction: :none)
=> 369111.0550333138
d2 = moon.distance_from(earth, SpiceRub::Time.now, aberration_correction: :LT)
=> 369146.60640691273

d2 - d1
=> 35.55137359892251

If you want to look at it another way, no aberration correction would give you the textbook response of rigid geometry, while introducing an aberration correction would give you a somewhat more realistic output accounting for the errors that do happen when these observations are made.

Finally, if you need to generate a continuous time series for a body, then SpiceRub::Time has two functions to aid in that:

1
2
3
4
5
6
7
SpiceRub::Time.linear_time_series(now, now + 86400, 4)
=> [
    #<SpiceRub::Time:0x00000001fe8b60 @et=525180780.18277323>,
    #<SpiceRub::Time:0x00000001fe8a20 @et=525209580.18277323>,
    #<SpiceRub::Time:0x00000001fe88b8 @et=525238380.18277323>,
    #<SpiceRub::Time:0x00000001fe8750 @et=525267180.18277323>
   ]

In this case, I took a start time and an end time that was one day after and requested 4 linearly spaced epochs. This is basically an interface to NMatrix.linspace.

The other function requires you to input a start time and an end time and a step size that keeps getting added to the start time till the end time is reached. As a contrived example, we’ll take two epochs, five days apart and ask for a step size of a day, expecting six elements.

1
2
3
4
5
6
7
8
9
SpiceRub::Time.time_series(now, now + 5 * 86400, step: 86400)
=> [
    #<SpiceRub::Time:0x00000001646580 @et=525180780.18277323>,
    #<SpiceRub::Time:0x00000002f315b8 @et=525267180.18277323>,
    #<SpiceRub::Time:0x00000002f31590 @et=525353580.18277323>,
    #<SpiceRub::Time:0x00000002f31568 @et=525439980.18277323>,
    #<SpiceRub::Time:0x00000002f31540 @et=525526380.18277323>,
    #<SpiceRub::Time:0x00000002f31518 @et=525612780.18277323>
   ]

And that’s it for this blog post. I would appreciate any feedback regarding this as I’ve been juggling the design back and forth very frequently. There is large potential of expansion of the Body class, particularly creating new classes as when different Bobdy objects would have a corresponding function. (For example, the function getfov_c which returns the field of view of an instrument could be an instance function attached to the Instrument subclass of Body, but this is just potential expansions in the future.)

GSoC 2016 : A Look at SpiceRub::Time

Many popular programming languages these days ship with powerful Time classes/interfaces to make the inevitable task of dealing with time-related computations a more DRY experience. Humans like their time representations to be stringy concatenations of various numbers and alphabets that enable a large variety of date and time formats to mean the same thing and still make instant sense to the human eye. It was the creation of time zones that made our convenient 24-hour clocks stay the way they are without experiencing day and night at the same “time” across the globe. So now 00:00 A.M. UTC is 5:30 A.M. IST for me, and the world remains sane.

But what happens when you accept the fact that you’re just a speck of micro-dust adjusting time relatively for an only slightly bigger speck of dust floating in the universe? Twenty-four hours in a day and thus we reset after 2300, but consider: how would a resident of Venus know when tea-time is on Venus if he had an Earth wristwatch that reset after twenty-four hours? Barely a tenth of Venus’ day is complete in that time! (If you know anybody intent on relocating to Mars, do not gift them a clock or watch.)

So a decimal floating point representation must be the answer for uniformity. Time zones can be dealt with; we’ll just pick a convenient point in time and count the seconds from there onwards so that the location on Earth doesn’t matter henceforth. It’ll drive humans insane with the arithmetic but machines will work just fine with this. This sort of a time system is called epoch time.

And so the internal time of most UNIX machines is the number of seconds after midnight on Thursday, 1 January 1970 UTC. (And this very convention is going to open a can of worms by 2038 if there is even a small set of critical machines that haven’t moved on from 32-bit architectures.)

But we’re still not okay universally. Try going on an infinite journey to space and you’ll find that counting seconds leads to some inconsistencies with your local time when you try to synchronize with Earth. How can the number of seconds after January 1970 be different in any case? Well, your MacBook Pro has not been adjusted for … relativity! Gravity bends light and thus the perception of time. There’s a lot more mass, and thus a lot more gravitatonal fields in neighborhoods away from Earth. The exact details of how this works is beyond the scope of this blog post.

If the past few paragraphs were incessant and seemingly irrelevant, they were there to drive home the point that Earth time simply will not do when we step out of the ghetto to see what’s happening. But astronomy’s been around for way longer, and astrophysicists came forth with a time system adjusted for the relativity effects of the solar system, called Barycentric Dynamical Time, or TDB. Like our machines, it counts the seconds after a certain known reference time point, except that it adjusts for relativity and can become a standard for astronomical time.

There are many similar time scales like this, but SPICE has chosen to use TDB as the standard for most of their design. Within the SPICE API, TDB is the same as Ephemeris Time which is the main system used to specify time points of astronomical bodies. Even though spacecrafts have their clocks coordinated with UTC on Earth, you would require that time in Ephemeris Time to be able to calculate their positions and velocities using SPICE. SpiceRub::Time is built for this very purpose, to revolve around a main attribute @et for Ephemeris Time and provide many methods to convert to and from.

If you’d like to proceed with the examples, you’ll need a Leap Second Kernel file to use SpiceRub::Time. This is a generic kernel, so you can easily use naif0011.tls in spec/data/kernels of the repository folder.

So Ephemeris Time is the number of seconds elapsed after Noon, January 1, 2000, TDB. This point in time is also known as the J2000 epoch. We find that out in an instant by using the Time.parse function which is a wrapper function for SPICE’s str2et_c that converts many formats of strings to Ephemeris Time. You can have a look at the various string formats supported in its documentation here

1
2
 SpiceRub::Time.parse("12:00 Jan 1 2000 TDB")
=> #<SpiceRub::Time:0x0000000325b1f8 @et=0.0>

So as a base case, using the reference epoch gives us 0 seconds as we would expect. Now would also be a good time to find out the discrepancy in UTC as well.

1
2
SpiceRub::Time.parse("12:00 Jan 1 2000 UTC")
=> #<SpiceRub::Time:0x000000031c28e0 @et=64.18392728473108>

So right away we know that UTC was 64-ish seconds off from TDB / ET at the time of the reference J2000 epoch. What would the difference be around right now?

1
2
3
4
5
now = SpiceRub::Time.now
=> #<SpiceRub::Time:0x000000030bf8f8 @et=525173312.1827749>

(now - now.to_utc).to_f
=> 68.18277490139008

Well, here’s a surprise, it’s 68.18 now. Before I explain why that is, here is a brief overview of what the above code does:

Time.now is a convenient way to specify your current UTC timezone. It uses Ruby’s core Time.now method so this method is only good if you’re working in UTC or Earth like Timezones. For a similar purpose, the function Time.from_time let’s you create a SpiceRub Time object from a Ruby Time object.

The +/- operators return a new Time object where the right operand is added/subtracted to the left operand’s @et when it is a float or integer. If a Time object is supplied , then it does the same with the right operand’s ephemeris time instead. (Note that there really isn’t a significant meaning to having a Time object whose @et is the difference/sum of two other epochs, however you can increase a certain epoch or decrease it by a constant offset of seconds)

In our case we used #to_utc to convert from ephemeris time to UTC, and then the minus operator gave us a Time object that wasn’t really an epoch, but a difference of two epochs, so using #to_f got us exactly that.

It appears that UTC has changed by 4 seconds since 2000 with respect to ephemeris time. This is actually the adjustment of “leap seconds” that gets added to UTC to prevent it from falling too far behind other time systems. (Humans really like to hack everything, don’t they?)

To verify this yourself, if you open up the kernel naif0011.tls in your text editor and search for DELTET/DELTA_AT, you’ll find a list like representation of the following sort :-

1
2
3
4
5
6
7
DELTET/DELTA_AT        = ( 10,   @1972-JAN-1
                           ..,   ...........
                           32,   @1999-JAN-1
                           33,   @2006-JAN-1
                           34,   @2009-JAN-1
                           35,   @2012-JUL-1
                           36,   @2015-JUL-1 )

Here you can see that just before the year 2000, there were 32 leap seconds added to UTC, and in 2015 when the last leap second was added, there were 36. It’s an ongoing and indefinite process and so there really is no way to account for leap second errors far in the future for leap seconds that are yet to be added. As of now, the next scheduled addition is in December, 2016.

Coming back to our Time object, let’s look at its basic construction. One tricky task in the API was the option to specify different epochs of reference in different time scales, like International Atomic Time. As of now, Time.new requires that you have kept your word of using the J2000 epoch and allows you to use a named parameter seconds: for specifying the time scale. The use of scale as a key was avoided as it sometimes is also used to refer to the reference epoch used.

1
2
3
4
5
6
epochs = [:utc, :tdb, :tai].map
  { |scale| SpiceRub::Time.new(0, seconds: scale) }

=> [#<SpiceRub::Time:0x00000002756fc8 @et=64.18392728466942>,
    #<SpiceRub::Time:0x00000002756eb0 @et=0>,
    #<SpiceRub::Time:0x00000002756cf8 @et=32.18392727400827>]

:tai here refers to International Atomic Time. For a list of more parameters and their keyword abbreviations, have a look at this SPICE documentation for the function that the conversion is wrapped on top of.

But there is also a way to reference other epochs without doing the manual conversions yourself, you can call the class method Time.at to perform the same function as the constructor, with the option of a different reference epoch. The resultant Time object will however have its internal time referring to J2000.

A more readable way would involve step by step calculations, but that would consume runtime resources everytime Time.at is called, so I’ve basically pre-calculated the ephemeris times of the reference epochs and subtracted them from the epoch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  case reference.downcase
  when :j2100
    new(offset + 3155760000.0, seconds: seconds)
  when :j2000, :et
    new(offset, seconds: seconds)
  when :j1950
    new(offset - 1577880000.0, seconds: seconds)
  when :j1900
    new(offset - 3155760000.0, seconds: seconds)
  when :gps
    new(offset - 630763148.8159368, seconds: seconds)
  when :unix
    new(offset - 946727958.8160644, seconds: seconds)
  end

To quickly verify the last one with the #to_s method:

1
2
SpiceRub::Time.new(-946727958.8160644).to_s
=> "Thu Jan 01 00:00:00 UTC 1970"

It’s exactly the UNIX epoch! Let’s try out 1 day (86400 seconds) after this epoch:

1
2
SpiceRub::Time.at(86400, :unix).to_s
=> "Thu Jan 01 23:59:59 UTC 1970"

Just a second short of heading into the next day, because we’ve added 86400 TDB seconds and converted the time into a UTC string.

There are some more functions provided to work in tandem with the Body class that I’ll explain more about in the next blog post, but this more or less covers the core of SpiceRub:Time. Till then, thanks for reading.

GSoC 2016: NMatrix and JRuby

Introduction

I worked on “Port NMatrix to JRuby” in the context of the Google Summer of Code (GSoC) 2016 and I am pleased to announce that NMatrix can now be used in JRuby.

With JRuby NMatrix, a linear algebra library, wraps Apache Commons Math for its most basic functionalities. NMatrix supports dense matrices containing either doubles or Ruby objects as the data type. The performance of JRuby with Apache Commons Maths is quite satisfactory (see below for performance comparisons) even without making use of JRuby threading capabilities.

I have also ported the mixed_models gem, which uses NMatrix heavily at its core, to JRuby. This gem allowed us to test NMatrix-JRuby with real-life data.

This blog post summarizes my work on the project with SciRuby, and reports the final status of the project.

The original GSoC proposal, plan and application can be found here. Until merging is complete, commits are available here.

Performance

I have benchmarked some of the NMatrix functionalities. The following plots compare the performance between NMatrix-JRuby, NMatrix-MRI, and NMatrix-MRI using LAPACK/ATLAS libraries. (Note: MRI refers to the reference implementation of Ruby, for those who are new.)

Notes:

  1. LAPACK and ATLAS aren’t involved in most element-wise operations, such as addition and subtraction.
  2. NMatrix-MRI relies on LAPACK/ATLAS for calculating determinants and LU Decomposition (lud).

Alt Matrix Addition Alt Matrix Subtraction Alt Matrix Multiplication Alt Gamma operator Alt Determinant Alt LU Facorization

Result

  1. For two-dimensional matrices, NMatrix-JRuby is currently slower than NMatrix-MRI for matrix multiplication and matrix decomposition functionalities (calculating determinant and factoring a matrix). NMatrix-JRuby is faster than NMatrix-MRI for other functionalities of a two-dimensional matrix — like addition, subtraction, trigonometric operations, etc.

  2. NMatrix-JRuby is a clear winner when we are working with matrices of arbitrary dimensions.

Implementation

Storing n-dimensional matrices as one-dimensional arrays

The major components of an NMatrix are shape, elements, dtype and stype. When initialized, the dense type stores the elements as a one-dimensional array; in the JRuby port, the ArrayRealVector class is used to store the elements.

@s stores elements, @shape stores the shape of the matrix, while @dtype and @stype store the data type and storage type respectively. Currently, I have nmatrix-jruby implemented only for :float64 (double) and Ruby :object data types.

NMatrix-MRI uses struct as a type to store dim, shape, offset, count, src of an NMatrix. ALLOC and xfree are used to wrap the NMatrix attributes to C structs and release the unrequired memory.

NMatrix

Slicing and Rank

Implementing slicing was the toughest part of NMatrix-JRuby implementation. NMatrix@s stores the elements of a matrix as a one-dimensional array. The elements along any dimension are accessed with the help of the stride. NMatrix#get_stride calculates the stride with the help of the dimension and shape and returns an Array.

1
2
3
4
5
6
7
8
9
10
def get_stride(nmatrix)
  stride = Array.new()
  (0...nmatrix.dim).each do |i|
    stride[i] = 1;
    (i+1...dim).each do |j|
      stride[i] *= nmatrix.shape[j]
    end
  end
  stride
end

NMatrix#[] and NMatrix#[]= are thus able to read and write the elements of a matrix. NMatrix#MRI uses the @s object which stores the stride when the nmatrix is initialized.

NMatrix#[] calls the #xslice operator which calls #get_slice operator that use the stride to determine whether we are accessing a single element or multiple elements. If there are multiple elements, #dense_storage_get returns an NMatrix object with the elements along the dimension.

NMatrix-MRI differs from NMatrix-JRuby implementation as it makes sure that memory is properly utilized as the memory needs to be properly garbage collected.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def xslice(args)
  result = nil

  if self.dim < args.length
    raise(ArgumentError,"wrong number of arguments\
       (#{args} for #{effective_dim(self)})")
  else
    result = Array.new()
    slice = get_slice(@dim, args, @shape)
    stride = get_stride(self)
    if slice[:single]
      if (@dtype == :object)
        result = @s[dense_storage_get(slice,stride)]
      else
        s = @s.toArray().to_a
        result = @s.getEntry(dense_storage_get(slice,stride))
      end
    else
      result = dense_storage_get(slice,stride)
    end
  end
  return result
end

NMatrix#[]= calls the #dense_storage_set operator which calls #get_slice operator that use the stride to find out whether we are accessing a single element or multiple elements. If there are multiple elements #set_slice recursively sets the elements of the matrix then returns an NMatrix object with the elements along the dimension.

All the relevant code for slicing can be found here.

Enumerators

NMatrix-MRI uses the C code for enumerating the elements of a matrix. Just as with slicing, the NMatrix-JRuby uses pure Ruby code in place of the C code. Currently, all the enumerators for dense matrices with real data-type have been implemented and are properly functional. Enumerators for objects have not yet been implemented.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def each_with_indices
  nmatrix = create_dummy_nmatrix
  stride = get_stride(self)
  offset = 0
  #Create indices and initialize them to zero
  coords = Array.new(dim){ 0 }

  shape_copy =  Array.new(dim)
  (0...size).each do |k|
    dense_storage_coords(nmatrix, k, coords, stride, offset)
    slice_index = dense_storage_pos(coords,stride)
    ary = Array.new
    if (@dtype == :object)
      ary << self.s[slice_index]
    else
      ary << self.s.toArray.to_a[slice_index]
    end
    (0...dim).each do |p|
      ary << coords[p]
    end

    # yield the array which now consists of the value and the indices
    yield(ary)
  end if block_given?
  nmatrix.s = @s

  return nmatrix
 end

Two-Dimensional Matrices

Linear algebra is mostly about two-dimensional matrices. In NMatrix, when performing calculations in a two-dimensional matrix, a one-dimensional array is converted to a two-dimensional matrix. A two-dimensional matrix is stored in the JRuby implementation as a BlockRealMatrix or Array2DRowRealMatrix. Each has its own advantages.

Getting a 2D Matrix

Alt Getting a 2D-matrix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class MatrixGenerator
{
  public static double[][] getMatrixDouble(double[] array, int row, int col)
  {
    double[][] matrix = new double[row][col];
    for (int index=0, i=0; i < row ; i++){
        for (int j=0; j < col; j++){
            matrix[i][j]= array[index];
            index++;
        }
    }
    return matrix;
  }
}

Convert a 2D-matrix to 1D-array

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class ArrayGenerator
{
  public static double[] getArrayDouble(double[][] matrix, int row, int col)
  {
    double[] array = new double[row * col];
    for (int index=0, i=0; i < row ; i++){
        for (int j=0; j < col; j++){
            array[index] = matrix[i][j];
            index++;
        }
    }
    return array;
  }
}

Why use a Java method instead of Ruby method?

  1. Memory Usage and Garbage Collection: A scientific library is memory intensive and hence, every step counts. The JRuby interpreter doesn’t need to dynamically guess the data type and uses less memory, typically around one-tenth of it. If the memory is properly utilized, when the GC kicks in, the GC has to clear less used memory space.

  2. Speed: Using java method greatly improves the speed — by around 1000 times, when compared to using the Ruby method.

Operators

All the operators from NMatrix-MRI have been implemented except modulus. The binary operators were easily implemented through Commons Math API and Java Math API.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def +(other)
  result = create_dummy_nmatrix
  if (other.is_a?(NMatrix))
    #check dimension
    raise(ShapeError, "Cannot add matrices with different dimension")\
    if (@dim != other.dim)
    #check shape
    (0...dim).each do |i|
      raise(ShapeError, "Cannot add matrices with different shapes") \
      if (@shape[i] != other.shape[i])
    end
    result.s = @s.copy.add(other.s)
  else
    result.s = @s.copy.mapAddToSelf(other)
  end
  result
end

Unary Operators (Trigonometric, Exponentiation and Log operators) have been implemented using #mapToSelf method that takes a Univariate function as an argument. #mapToSelf maps every element of ArrayRealVector object to the Univariate function, that is passed to it and returns self object.

1
2
3
4
5
def sin
  result = create_dummy_nmatrix
  result.s = @s.copy.mapToSelf(Sin.new())
  result
end

NMatrix#method(arg) has been implemented using bivariate functions provided by Commons Math API and Java Math API.

1
2
3
4
5
def gamma
  result = create_dummy_nmatrix
  result.s = ArrayRealVector.new MathHelper.gamma(@s.toArray)
  result
end
1
2
3
4
5
6
7
8
9
10
11
12
13
import org.apache.commons.math3.special.Gamma;

public class MathHelper{
  ...
  public static double[] gamma(double[] arr){
    double[] result = new double[arr.length];
    for(int i = 0; i< arr.length; i++){
      result[i] = Gamma.gamma(arr[i]);
    }
    return result;
  }
  ...
}

Decomposition

NMatrix-MRI relies on LAPACK and ATLAS for matrix decomposition and solving functionalities. Apache Commons Math provides a different set of API for decomposing a matrix and solving an equation. For example, #potrf and other LAPACK specific functions have not been implemented as they are not required at all.

Calculating determinant in NMatrix is tricky where a matrix is reduced either to a lower or upper matrix and the diagonal elements of the matrix are multiplied to get the result. Also, the correct sign of the result (whether positive or negative) is taken into account while calculating the determinant. However, NMatrix-JRuby uses Commons Math API to calculate the determinant.

1
2
3
4
5
6
7
def det_exact
  if (@dim != 2 || @shape[0] != @shape[1])
    raise(ShapeError, "matrices must be square to have a determinant defined")
    return nil
  end
  to_return = LUDecomposition.new(self.twoDMat).getDeterminant()
end

Given below is code that shows how Cholesky decomposition has been implemented by using Commons Math API.

Cholesky Decomposition

1
2
3
4
5
6
7
8
9
10
11
12
13
  def factorize_cholesky
    cholesky = CholeskyDecomposition.new(self.twoDMat)
    l = create_dummy_nmatrix
    twoDMat = cholesky.getL
    l.s = ArrayRealVector.new(ArrayGenerator.getArrayDouble\
        (twoDMat.getData, @shape[0], @shape[1]))

    u = create_dummy_nmatrix
    twoDMat = cholesky.getLT
    u.s = ArrayRealVector.new(ArrayGenerator.getArrayDouble\
      (twoDMat.getData, @shape[0], @shape[1]))
    return [u,l]
  end

Similarly, LU Decomposition and QR factorization have been implemented.

LU Decomposition

Code

QR Factorization

Code

NMatrix#solve

The solve method currently uses LU and Cholesky decomposition.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  def solve(b, opts = {})
    raise(ShapeError, "Must be called on square matrix")\
       unless self.dim == 2 && self.shape[0] == self.shape[1]
    raise(ShapeError, "number of rows of b must equal number\
       of cols of self") if self.shape[1] != b.shape[0]
    raise(ArgumentError, "only works with dense matrices") if self.stype != :dense
    raise(ArgumentError, "only works for non-integer, non-object dtypes")\
       if integer_dtype? or object_dtype? or b.integer_dtype? or b.object_dtype?

    opts = { form: :general }.merge(opts)
    x    = b.clone
    n    = self.shape[0]
    nrhs = b.shape[1]

    nmatrix = create_dummy_nmatrix
    case opts[:form]
    when :general, :upper_tri, :upper_triangular, :lower_tri, :lower_triangular
      #LU solver
      solver = LUDecomposition.new(self.twoDMat).getSolver
      nmatrix.s = solver.solve(b.s)
      return nmatrix
    when :pos_def, :positive_definite
      solver = Choleskyecomposition.new(self.twoDMat).getSolver
      nmatrix.s = solver.solve(b.s)
      return nmatrix
    else
      raise(ArgumentError, "#{opts[:form]} is not a valid form option")
    end

  end

NMatrix#matrix_solve

Suppose we need to solve a system of linear equations:

                    AX = B

where A is an m×n matrix, B and X are n×p matrices, we need to solve this equation by iterating through B.

NMatrix-MRI implements this functionality using NMatrix::BLAS::cblas_trsm method. However, for NMatrix-JRuby, NMatrix#matrix_solve is the analogous method used.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  def matrix_solve rhs
    if rhs.shape[1] > 1
      nmatrix = NMatrix.new :copy
      nmatrix.shape = rhs.shape
      res = []
      #Solve a matrix and store the vectors in a matrix
      (0...rhs.shape[1]).each do |i|
        res << self.solve(rhs.col(i)).s.toArray.to_a
      end
      #res is in col major format
      result = ArrayGenerator.getArrayColMajorDouble \
         res.to_java :double, rhs.shape[0], rhs.shape[1]
      nmatrix.s = ArrayRealVector.new result

      return nmatrix
    else
      return self.solve rhs
    end
  end

Currently, Hessenberg transformation for NMatrix-JRuby has not been implemented.

Other dtypes

I have tried implementing float dtypes using FloatMatrix class provide by jblas. jblas was used instead of Commons Math as the latter uses Field Elements for Floats and it had some issues with Reflection and Type Erasure. However, using jblas resulted in errors due to precision.

Code Organisation and Deployment

To minimise conflict with the MRI codebase all the JRuby front end code has been placed in the /lib/nmatrix/jruby directory. lib/nmatrix/nmatrix.rb decides whether to load nmatrix.so or nmatrix_jruby.rb after detecting the Ruby platform.

The added advantage is that the Ruby interpreter must not decide which function to call at run-time. The impact on performance can be seen when programs which intensively use NMatrix for linear algebraic computations (e.g., mixed_models) are run.

Spec Report

After the port; this is the final report that summarizes the number of tests that successfully pass:

NMatrix

Spec file Total Tests Success Failure Pending
00_nmatrix_spec 188 139 43 6
01_enum_spec 17 8 09 0
02_slice_spec 144 116 28 0
03_nmatrix_monkeys_spec 12 11 01 0
elementwise_spec 38 21 17 0
homogeneous_spec.rb 07 06 01 0
math_spec 737 541 196 0
shortcuts_spec 81 57 24 0
stat_spec 72 40 32 0
slice_set_spec 6 2 04 0

Why do some tests fail?

  1. Complex dtype has not been implemented.
  2. Sparse matrices (list and yale) have not been implemented.
  3. Decomposition methods that are specific to LAPACK and ATLAS have not been implemented.
  4. Integer dtype is not properly assigned to floor, ceil, and round methods.

Mixed_Models

Spec file Total Test Success Failure Pending
Deviance_spec 04 04 0 0
LMM_spec 195 195 0 0
LMM_categorical_data_spec.rb 48 45 3 0
LMMFormula_spec.rb 05 05 0 0
LMM_interaction_effects_spec.rb 82 82 0 0
LMM_nested_effects_spec.rb 40 40 0 0
matrix_methods_spec.rb 52 52 0 0
ModelSpecification_spec.rb 07 07 0 0
NelderMeadWithConstraints_spec.rb 08 08 0 0

Future Work

NMatrix on JRuby offers comparable speeds to MRI. For specific computations it will be possible to leverage the threading support of JRuby and speed up things using multiple cores.

Adding new functionality to NMatrix-JRuby will be easy from here. Personally, I am interested to add OpenCL support to leverage the GPU computational capacity available on most machines today.

Conclusion

The main goal of this project was to to gain from the performance JRuby offers, and bring a unified interface for linear algebra between MRI and JRuby.

By the end of GSoC, I have been able to successfully create a linear algebra library, NMatrix for JRuby users, which they can easily run on their machines — unless they want to use complex numbers, at least for now.

I have mixed_models gem simultaneously ported to JRuby. Even here, NMatrix-JRuby is very close to NMatrix-MRI, considering the performance .

Acknowledgements

I would like to express my sincere gratitude to my mentor Pjotr Prins for the continuous support through the summers, and for his patience, motivation, enthusiasm, and immense knowledge. I could not have imagined having a better advisor and mentor, for this project.

I am very grateful to Google and the Ruby Science Foundation for this golden opportunity.

I am very thankful to Charles Nutter, John Woods, Sameer Deshmukh, Kenta Murata and Alexej Gossmann, who mentored me through the project. It has been a great learning experience.

I thank my fellow GSoC participants Rajith, Lokesh and Gaurav who helped me with certain aspects of my project.

GSOC 2016 : Final Report (SpiceRub)

Finally, GSOC 2016 has come to an end. This last blog post will aim to sum up the work done in these past few months and how the project can proceed to hit more milestones. Below are some links to my work :-

Working Repository : GitHub Repo

List of Commits : List of commits to the above repository

Example iRuby Notebooks : A bunch of iRuby Notebook examples of the Ruby API

Firstly, I must admit that I was unable to meet all the goals of my proposal. In retrospect, perhaps it was a bit too ambitious for my skill level at the time, and I undoubtedly spent a lot of time learning, about Ruby-C extensions, the SPICE Toolkit, and good Ruby code and API.

General software requirements such as a shippable gem with an install script that downloads the correct binary dependencies and a dataset fetcher were among the targets in my proposal that I could not meet. If you’d like to read the full proposal, you can have a look at it here.

With that said, I am happy with what I did complete, and along with most of the functions in the proposal list being ported successfuly, I have added 3 Ruby classes to provide a better abstracted experience while using the Ephemerides subsystem of SPICE. Given the vastness of SPICE itself, this seems like meagre coverage, but it is a stepping stone that I (and I hope others) would build upon.

The SpiceRub API

Please follow the blog links below to read the posts concerning the Ruby classes, the posts basically demonstrate the simple API that can be used for various tasks involving ephemerides. It would help if you followed the order of the links below as they build on top of each other.

KernelPool :- A Singleton class to handle the loading and unloading of SPICE Data (Kernels)

Time :- A class that references ephemeris time and has flexible construction functions

Body :- A class that represents a body in space whose motion can be observed with respect to another body.

I spent majority of the post mid-term period on the latter two classes, while most of the time before was spent on writing C extensions to port the SPICE API to Ruby. (and learning good spec manners, something that was not that quick to grow on me but towards the end got ingrained into me)

I ran into a lot of bugs, learnt more about compiler flags and other building options, and it gave me surprisingly good exposure to low level language trivia for a high level Ruby project. Honestly, I can type SpiceRub faster than most words on my keyboard now =)

Future Road Map

Things that I’ll tackle (after a short break) include :-

1) Installation Integrity so that gem install does everything needed for installation.

2) Complete Documentation

3) The test coverage of SpiceRub::Time is currently lacking because I changed the API at the last minute, but as most of these functions wrap around Native functions that have already been tested, this won’t be too high a priority, but it will be done.

4) Kernel Fetcher Script : Having to crawl the web for relevant Kernels was a massive headache during this project, and adding something that directly refers to NAIF’s FTP servers would be a neat addition.

5) Expand the API : There is a lot of SPICE concept I am not well versed with, and there are a bunch of functions ported that would work very well with a better API. (Most immediate task I can see is making a better API for the Geometry Finder subsystem, of which many functions have already been ported in /ext/spice_rub/spice_geometry.c)

Acknowledgements

I’d like to thank my mentors for this project, Dr. John Woods, Shaun Stewart, and Victor Shepelev for their guidance and knowledge. John and Shaun with their expertise in Astro research and Victor’s vast knowledge about professional Ruby code has helped keep this whole ship together, and I’m looking forward to sailing it a few more journeys.

Also, John made NMatrix which is sort of the backbone for a lot of SpiceRub’s functionality.

Finally, I would like to thank Andrew Annex who wrote SpicePy, and Philip Rasch who wrote spiceminer , two Python wrappers for the SPICE Toolkit. SpicePy has a high port and test coverage (I lifted a lot of tests from here that weren’t available in the SPICE documentation and SpiceMiner had an OOP style API which I referred to while designing the Body class.

It’s been an incredible summer, thank you for reading :)

Statistical Linear Mixed Models in Ruby With Mixed_models (GSoC2015)

Google Summer of Code 2015 is coming to an end. During this summer, I have learned too many things to list here about statistical modeling, Ruby and software development in general, and I had a lot of fun in the process!

Linear mixed models

My GSoC project is the Ruby gem mixed_models. Mixed models are statistical models which predict the value of a response variable as a result of fixed and random effects. The gem in its current version can be used to fit statistical linear mixed models and perform statistical inference on the model parameters as well as to predict future observations. A number of tutorials/examples in IRuby notebook format are accessible from the mixed_models github repository.

Linear mixed models are implemented in the class LMM. The constructor method LMM#initialize provides a flexible model specification interface, where an arbitrary covariance structure of the random effects terms can be passed as a Proc or a block.

A convenient user-friendly interface to the basic model fitting algorithm is LMM#from_formula, which uses the formula language of the R mixed models package lme4 for model specification. With the #from_formula method, the user can conveniently fit models with categorical predictor variables, interaction fixed or random effects, as well as multiple crossed or nested random effects, all with just one line of code.

Examples are given in the sections below.

Implementation

The parameter estimation in LMM#initialize is largely based on the approach developed by the authors of the R mixed models package lme4, which is delineated in the lme4 vignette. I have tried to make the code of the model fitting algorithm in LMM#initialize easy to read, especially compared to the corresponding implementation in lme4.

The lme4 code is largely written in C++, which is integrated in R via the packages Rcpp and RcppEigen. It uses CHOLMOD code for various sparse matrix tricks, and it involves passing pointers to C++ object to R (and vice versa) many times, and passing different R environments from function to function. All this makes the lme4 code rather hard to read. Even Douglas Bates, the main developer of lme4, admits that “The end result is confusing (my fault entirely) and fragile”, because of all the utilized performance improvements. I have analyzed the lme4 code in three blog posts (part 1, part 2 and part 3) before starting to work on my gem mixed_models.

The method LMM#initialize is written in a more functional style, which makes the code shorter and (I find) easier to follow. All matrix calculations are performed using the gem nmatrix, which has a quite intuitive syntax and contributes to the overall code readability as well. The Ruby gem loses with respect to memory consumption and speed in comparison to lme4, because it is written in pure Ruby and does not utilize any sparse matrix tricks. However, for the same reasons the mixed_models code is much shorter and easier to read than lme4. Moreover, the linear mixed model formulation in mixed_models is a little bit more general, because it does not assume that the random effects covariance matrix is sparse. More about the implementation of LMM#initialize can be found in this blog post.

Other existing tools

Popular existing software packages for mixed models include the R package lme4 (which is arguably the standard software for linear mixed models), the R package nlme (an older package developed by the same author as lme4, still widely used), Python’s statmodels, and the Julia package MixedModels.jl.

Below, I give a couple of examples illustrating some of the capabilities of mixed_models and explore how it compares to the alternatives.

A usage example and discussion

As an example, we use data from the UCI machine learning repository, which originate from blog posts from various sources in 2010-2012, in order to model (the logarithm of) the number of comments that a blog post receives. The linear predictors are the text length, the log-transform of the average number of comments at the hosting website, the average number of trackbacks at the hosting website, and the parent blog posts. We assume a random effect on the number of comments due to the day of the week on which the blog post was published. In mixed_models this model can be fit with

1
2
model_fit = LMM.from_formula(formula: "log_comments ~ log_host_comments_avg + host_trackbacks_avg + length + has_parent_with_comments + (1 | day)",
                              data: blog_data)

and we can display some information about the estimated fixed effects with

1
puts model_fit.fix_ef_summary.inspect(24)

which produces the following output:

1
2
3
4
5
6
                                         coef                       sd                  z_score            WaldZ_p_value
           intercept       1.2847896684307731     0.030380582281933178        42.28983027737477                      0.0
   log_host_comments_avg        0.415586319225577     0.007848368759350875        52.95193586953086                      0.0
 host_trackbacks_avg     -0.07551588997745964     0.010915623834434068       -6.918146971979714    4.575895218295045e-12
              length   1.8245853808280765e-05    2.981631039432429e-06        6.119420400102211    9.391631916599863e-10
has_parent_with_comments      -0.4616662830553772      0.13936886611993773      -3.3125496095955715    0.0009244972814528296

We can also display the estimated random effects coefficients and the random effects standard deviation,

1
2
3
4
puts "Random effects coefficients:"
puts model_fit.ran_ef
puts "Random effects correlation structure:"
puts model_fit.ran_ef_summary.inspect

which produces

1
2
3
4
5
6
7
Random effects coefficients:
{:intercept_fr=>0.0, :intercept_mo=>0.0, :intercept_sa=>0.0, :intercept_su=>0.0, :intercept_th=>0.0, :intercept_tu=>0.0, :intercept_we=>0.0}
Random effects standard deviation:

#<Daru::DataFrame:70278348234580 @name = 8e11a27f-81b0-48a0-9771-085a8f30693d @size = 1>
                  day
       day        0.0

Interestingly, the estimates of the random effects coefficients and standard deviation are all zero! That is, we have a singular fit. Thus, our results imply that the day of the week on which a blog post is published has no effect on the number of comments that the blog post will receive.

It is worth pointing out that such a model fit with a singular covariance matrix is problematic with the current version of Python’s statmodels (described as “numerically challenging” in the documentation) and the R package nlme (“Singular covariance matrices correspond to infinite parameter values”, a mailing list reply by Douglas Bates, the author of nlme). However, mixed_models, lme4 and MixedModels.jl can handle singular fits without problems. In fact, like mixed_models above, lme4 estimates the random effects coefficients and standard deviation to be zero, as we can see from the following R output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> mod <- lmer(log_comments ~ log_host_comments_avg + host_trackbacks_avg + length + has_parent_with_comments + (1|day), data = df)
Warning message:
Some predictor variables are on very different scales: consider rescaling 
> ranef(mod)
$day
   (Intercept)
fr           0
mo           0
sa           0
su           0
th           0
tu           0
we           0

> VarCorr(mod)
 Groups   Name        Std.Dev.
 day      (Intercept) 0.0000  
 Residual             1.2614

Unfortunately, mixed_models is rather slow when applied to such a large data set (blog_data is a data frame of size 22435×8), especially when compared to lme4 which uses many sparse matrix tricks and is mostly written in C++ (integrated in R via Rcpp) to speed up computation. The difference in performance between mixed_models and lme4 is on the order of hours for large data, and Julia’s MixedModels.jl promises to be even faster than lme4. However, there is no noticeable difference in performance speed for smaller data sets.

The full data analysis of the blog post data can be found in this IRuby notebook.

A second example and statistical inference on the parameter estimates

Often, the experimental design or the data suggests a linear mixed model whose random effects are associated with multiple grouping factors. A specification of multiple random effects terms which correspond to multiple grouping factors is often referred to as crossed random effect, or nested random effects if the corresponding grouping factors are nested in each other. A good reference on such models is Chapter 2 of Douglas Bates’ lme4 book.

Like lme4, mixed_models is particularly well suited for models with crossed or nested random effects. The current release of statmodels, however, does not support crossed or nested random effects (according to the documentation).

As an example we fit a linear mixed model with nested random effects to a data frame with 100 rows, of the form:

1
2
3
4
5
6
7
#<Daru::DataFrame:69912847885160 @name = 2b161c5d-00de-4240-be50-8fa84f3aed24 @size = 5>
                    a          b          x          y
         0         a3         b1 0.38842531 5.10364866
         1         a3         b2 0.44622300 6.23307061
         2         a3         b1 1.54993657 12.2050404
         3         a3         b1 1.52786614 12.0067595
         4         a3         b2 0.76011212 8.20054527

We consider the following model:

  • We take y to be the response and x its predictor.
  • We consider the factor b to be nested within the factor a.
  • We assume that the intercept varies due to variable a; that is, a different (random) intercept term for each level of a.
  • Moreover, we assume that the intercept varies due to the factor b which is nested in a; that is, different (random) intercept for each combination of levels of a and b.

That is, mathematically the model can be expressed as

1
y = beta_0 + beta_1 * x + gamma(a) + delta(a,b) + epsilon

where gamma(a) ~ N(0, phi**2) and delta(a,b) ~ N(0, psi**2) are normally distributed random variables which assume different realizations for different values of a and b, and where epsilon is a random Gaussian noise term with variance sigma**2. The goal is to estimate the parameters beta_0, beta_1, phi, psi and sigma.

We fit this model in mixed_models, and display the estimated random effects correlation structure with

1
2
3
mod = LMM.from_formula(formula: "y ~ x + (1|a) + (1|a:b)",
                       data: df, reml: false)
puts mod.ran_ef_summary.inspect

which produces the output

1
2
3
                a    a_and_b
     a 1.34108300        nil
   a_and_b        nil 0.97697500

The correlation between the factor a and the nested random effect a_and_b is denoted as nil, because the random effects in the model at hand are assumed to be independent.

An advantage of mixed_models over some other tools is the simplicity with which p-values and confidence intervals for the parameter estimates can be calculated using a multitude of available methods. Such methods include a likelihood ratio test implementation, multiple bootstrap based methods (which run in parallel by default), and methods based on the Wald Z statistic.

We can compute five types of 95% confidence intervals for the fixed effects coefficients with the following line of code:

1
mod.fix_ef_conf_int(method: :all, nsim: 1000)

which yields the result

1
2
3
4
5
6
                                      intercept                                        x
wald_z [-1.0442515623151203, 2.433416817887737]   [4.302419420148841, 5.038899876985704]
boot_basic [-0.9676586601496888, 2.486799230544233]    [4.30540212917657, 5.028701160534481]
 boot_norm [-1.0575520080398213, 2.4667867000424115   [4.295959190826356, 5.043382379744274]
boot_t [-0.9676586601496886, 2.486799230544233]    [4.30540212917657, 5.028701160534481]
 boot_perc [-1.0976339749716164, 2.3568239157223054   [4.312618136600064, 5.035917167957975]

For example, we see here that the intercept term is likely not significantly different from zero. We could proceed now by performing hypotheses tests using #fix_ef_p or #likelihood_ratio_test, or by refitting a model without an intercept using #drop_fix_ef.

We can also test the nested random effect for significance, in order to decide whether we should drop that term from the model to reduce model complexity. We can use a bootstrap based version of likelihood ratio test as follows.

1
2
mod.ran_ef_p(variable: :intercept, grouping: [:a, :b],
             method: :bootstrap, nsim: 1000)

We get a p-value of 9.99e-4, suggesting that we probably should keep the term (1|a:b) in the model formula.

A third example — a less conventional model fit

Another advantage of mixed_models against comparable tools is the ease of fitting models with arbitrary covariance structures of the random effects, which are not covered by the formula interface of lme4. This can be done in a user-friendly manner by providing a block or a Proc to the LMM constructor. This unique feature of the Ruby language makes the implementation and usage of the method incredibly convenient. A danger of allowing for arbitrary covariance structures is, of course, that such a flexibility gives the user the freedom to specify degenerate and computationally unstable models.

As an example we look at an application to genetics, namely to SNP data (single-nucleotide polymorphism) with known pedigree structures (family relationships of the subjects). The family information is prior knowledge that we can model in the random effects of a linear mixed effects model.

We model the quantitative trait y (a vector of length 1200) as

1
y = X * beta + b + epsilon,

where X is a 1200 x 130 matrix containing the genotypes (i.e. 130 SNPs for each of the 1200 subjects); epsilon is a vector of independent random noise terms with variances equal to sigma**2; beta is a vector of unknown fixed effects coefficients measuring the contribution of each SNP to the quantitative trait y; and b is a vector of random effects.

If we denote the kinship matrix by K, then we can express the probability distribution of b as b ~ N(0, delta**2 * 2 * K), where we multiply K by 2 because the diagonal of K is constant 0.5, and where delta**2 is a unknown scaling factor.

The goal is to estimate the unknown parameters beta, sigma, and delta, and to determine which of the fixed effects coefficients are significantly different from 0 (i.e. which SNPs are possibly causing the variability in the trait y).

In order to specify the covariance structure of the random effects, we need to pass a block or Proc that produces the upper triangular Cholesky factor of the covariance matrix of the random effects from an input Array. In this example, that would be the multiplication of the prior known Cholesky factor of the kinship matrix with a scaling factor.

Having all the model matrices and vectors, we compute the Cholesky factor of the kinship matrix and fit the model with

1
2
3
4
5
6
7
8
# upper triangulat Cholesky factor
kinship_mat_cholesky_factor = kinship_mat.factorize_cholesky[0]

# Fit the model
model_fit = LMM.new(x: x, y: y, zt: z,
                    x_col_names: x_names,
                    start_point: [2.0],
                    lower_bound: [0.0]) { |th| kinship_mat_cholesky_factor * th[0] }

Then we can use the available hypotheses test and confidence interval methods to determine which SNPs are significant predictors of the quantitative trait. Out of the 130 SNPs in the model, we find 24 to be significant as linear predictors.

See this blog post for a full analysis of this data with mixed_models.

Room for improvement and future work

  • Writing the formula language interpretation code used by LMM#from_formula from scratch was not easy. Much of the code can be reorganized to be easier to read and to use in other projects. Possibly, the formula interface should be separated out, similar to how it is done with the Python package patsy. Also, some shortcut symbols (namely *, /, and ||) in the model specification formula language are currently not implemented.

  • I plan to add linear mixed models for high-dimensional data (i.e. more predictors than observations) to mixed_models, because that work would be in line with my current PhD research.

  • I plan to add generalized linear mixed models capabilities to mixed_models, which can be used to fit mixed models to discrete data (such as binary or count data).

Acknowledgement

I want to thank Google and the Ruby Science Foundation for giving me this excellent opportunity! I especially want to thank Pjotr Prins who was my mentor for the project for much helpful advice and suggestions as well as his prompt responses to any of my concerns. I also want to thank my fellow GSoC participants Will, Ivan, and Sameer for their help with certain aspects of my project.

GSoC 2015: New NMatrix Gems for Advanced Linear Algebra Features

My Google Summer of Code project was working on the NMatrix project, moving functionality that depends on external libraries from the core nmatrix gem to optional plugin gems. NMatrix is a Ruby library for linear algebra, used by many other projects. In addition to the code that was part of NMatrix proper, NMatrix previously required the ATLAS library, which implemented fast versions of common matrix operations like multiplication and inversion, as well as more advanced operations like eigenvalue decomposition and Cholesky decomposition.

There were two separate but related motivations for my project. The first was to simplify the NMatrix installation process. ATLAS can be difficult to install, so the installation process for NMatrix was complicated, especially on OS X, and may have discouraged people from using NMatrix. The second motivation was that by separating out the ATLAS code from the main NMatrix code, it would be easier to add new linear algebra backends which provide similar features. Indeed, I implemented a second backend this summer.

The end result of my summer’s work:

  • The core nmatrix gem does not depend on any external linear matrix libraries. It provides non-optimized implementations of common matrix operations.
  • All code that requires ATLAS has been moved into the new nmatrix-atlas gem, so that those who are only interested in the core functionality are not required to install ATLAS. nmatrix-atlas provides optimized implementations of common matrix operations, as well as advanced functions not available in nmatrix. I wrote a blog post describing the setup for releasing multiple gems from the same repository, which this required.
  • A new gem nmatrix-lapacke, which provides the same features as nmatrix-atlas, but instead of depending specifically on the ATLAS library, requires any generic LAPACK and BLAS implementation. This should be easier to use for many users as they may already have LAPACK installed (it comes pre-installed with OS X and is commonly used in Linux systems), but not ATLAS.
  • The installation procedure is simplified, especially for those installing just the nmatrix gem. Compare the new installation instructions to the old ones.

The one deviation from my original proposal was that I originally intended to remove all the ATLAS code and release only the nmatrix-lapacke plugin, so that we would only have one interface to the advanced linear algebra functions, but I decided to keep the ATLAS code, since the nmatrix-lapacke code is new and has not had a chance to be thoroughly tested.

Usage

1
2
3
4
5
require 'nmatrix'
# create a 3-by-3 matrix
a = NMatrix.new([3,3], [1,2,3, 4,5,6, 7,8,9], dtype: :float64)
#invert it using non-optimized NMatrix-internal implementation
a.invert!
1
2
3
4
5
6
require 'nmatrix'
require 'nmatrix/atlas' #or require 'nmatrix/lapacke'
# create a 3-by-3 matrix
a = NMatrix.new([3,3], [1,2,3, 4,5,6, 7,8,9], dtype: :float64)
#invert it using optimized implementation provided by ATLAS
a.invert!

For advanced functions not provided by the core nmatrix gem, for example gesvd, nmatrix-atlas and nmatrix-lapacke provide a common interface:

1
2
3
4
5
require 'nmatrix'
require 'nmatrix/atlas'
a = NMatrix.new([4,5],[1,0,0,0,2, 0,0,3,0,0, 0,0,0,0,0, 0,4,0,0,0],
dtype: dtype)
u, s, vt = a.gesvd
1
2
3
4
5
6
#Identical to the above, except for the require
require 'nmatrix'
require 'nmatrix/lapacke'
a = NMatrix.new([4,5],[1,0,0,0,2, 0,0,3,0,0, 0,0,0,0,0, 0,4,0,0,0],
dtype: dtype)
u, s, vt = a.gesvd

If the developer wants to use an advanced feature, but does not care whether the user is using nmatrix-atlas or nmatrix-lapacke, they can require nmatrix/lapack_plugin, which will require whichever of the two is available, instead of being forced to choose between the two.

As a fun test of the new gems, I did a very simple benchmark, just testing how long it took to invert a 1500-by-1500 matrix in place using NMatix#invert!:

  • nmatrix (no external libraries): 3.67s
  • nmatrix-atlas: 0.96s
  • nmatrix-lapacke with ATLAS: 0.99s
  • nmatrix-lapacke with OpenBLAS (multithreading enabled): 0.39s
  • nmatrix-lapacke with reference implementations of LAPACK and BLAS: 3.72s

This is not supposed to be a thorough or realistic benchmark (performance will depend on your system, on how you built the libraries, and on the exact functions that you use), but there are still a few interesting conclusions we can draw from it:

  • Performance is much better using the two highly optimized libraries (ATLAS and OpenBLAS) than using either the NMatrix internal implementation or the reference implementation.
  • When using ATLAS, performance is similar whether using nmatrix-atlas and nmatrix-lapacke (this means we could consider deprecating the nmatix-atlas gem).

Overall, my summer has been productive. I implemented everything that I proposed and feedback from testers so far has been positive. I plan to stay involved with NMatrix, especially to follow up on any issues related to my changes. Although I won’t be a student next summer, I would certainly consider participating in Google Summer of Code in the future as a mentor. I’d like to thank my mentor John Woods and the rest of the SciRuby community for support and feedback throughout the summer.