Tools for Scientific Computing in Ruby

Ruby Science Foundation Selected for GSoC 2014

We’re excited to announce that the Ruby Science Foundation has been selected as a mentoring organization for Google Summer of Code 2014!

Last year was our first year as a mentoring organization, and we had a great group of students working with us on machine learning, timeseries statistics, the semantic web, and scientific plotting.

This year we’ve got a super set of possible projects including more flexible matrix computations, automatic Ruby interface generation for scientific libraries, a dataframe library for structuring and manipulating datasets, interactive plotting, a scientific notebook, high-performance minimization and integration libraries, and a semantic web datastore backend for scientific computing.

If you’re interested in applying as a student, learning more, or even contributing independent of GSoC, head over to our GSoC 2014 ideas page to see what projects we think are great. Don’t hesitate to tell us if you’ve got an amazing idea for a different project, too! If you’re still left wondering where to start, check out the issue tracker for NMatrix, the matrix computation library used as the basis for a number of our projects, and our top priority at the moment.

Good luck to all the GSoC applicants out there, and happy coding!

Some Words From GSoC 2013 Alumni

In 2013, SciRuby was a mentoring organization for the Google Summer of Code. We asked our alumni students:

1) How did you experience GSoC/SciRuby and what has it brought you?

2) What advice would you give new applicants?

Monica Dragan from Romania worked on gene validation, see also her blog. Actually, Monica was part of a different GSoC organisation, PhyloSoC, but also participated in our Ruby-centric meetings and code reviews. She shared her SciRuby GSoC experience:

Monica: During the GSoC period I developed a bioinformatics tool written in Ruby. First of all I learned a new programming language, as I had no experience with Ruby before. On this GSoC occasion I had the opportunity to get in touch with the community and I met people passionate about their work, with whom I continued the collaboration afterwards. But what I really gained from this experience is that I increased my enthusiasm about bioinformatics and I confirmed myself that this is the field I want to focus on in the next years.

Alberto Arrigoni from Italy worked on data mining and machine learning algorithms for Ruby and shared his GSoC experience:

Alberto: As a PhD student in the field of bioinformatics, my GSOC experience was very exciting and useful at different levels. On a training level, I had the unique chance to learn more in depth some topics of machine learning I had wanted to explore in the past, but never had quite the opportunity or the resources. On a more technical level, I appreciated the support of the GSOC mentors and the Sciruby community, which counts numerous experts and a very active mailing list.

Ankur Goel from India worked on statsample-timeseries for Ruby. Ankur shared,

Ankur: It was the best learning experience. I learnt quite a lot of statistics while working on my TimeSeries extension; after GSoC, I picked up Machine Learning course and I was able to relate it to very easily after working on regression techniques in GLM extension. I can’t thank enough for the opportunity provided and the trust endowed by my mentor on me. Learning to write quality code and getting reviews was a cherry on cake!

Will Strinz from Madison, USA, worked on RDF generators for Ruby for the semantic web:

Will: GSoC 2013 was a new experience for me in terms of managing my own time, planning my own project, and keeping up consistent interaction with my mentors across time zones. Despite a decent amount of prior experience with Ruby, it was also a challenge and an opportunity for me to really understand the tools and practices I knew, and learn to use the ones I wasn’t familiar with. As for what it’s brought me, aside from a job I secured partly through skills and project portfolio I gained during GSoC, and the power of knowing how to do just about any programming task using Ruby, I learned a lot about how to manage a project and interact with people in the real world. Communicating properly and in a timely manner over email and other asynchronous services is absolutely critical to the work I do now, and a lot harder than people make it out to be. Staying in touch with my mentors and making sure we were all on the same page about my project was something I spent a lot of time on, and in doing so I gained a lot of comfort with the process. Additionally, GSoC was my first true experience designing a large piece of software, where I couldn’t just give up and trash it when the code started getting messy or confusing. It really forced me to adopt good practices around testing and organization, especially since I had better programmers than myself looking over my work. Software architecture is something you just don’t learn in college level CS courses, and by the time I’d graduated, I’d started hearing a lot of my CS professors saying this too. Some day in the future, maybe soon, there will be classes taught about just this subject, but for now there’s no better way to learn about it than by working on a real project, with some accountability and motivation to actually get it done.

Our alumni give new GSoC applicants the following advice:

Monica: GSoC is a great experience that you should try as a student! What is cool about GSoC is that you work on the project you are keen on and manage your time as you wish. Also, working remotely involves additional challenges. In the end you improve your experience and get to know a lot of new and great people.

Alberto: I think one of the best features offered by the GSOC is the possibility to collaborate with (and learn from) people who share the same scientific interests and have very different backgrounds and skills. Though this may be somewhat ‘expected’ for mentors, I was also lucky to find other GSOC students willing to bond and share experiences and opinions. My advice is to be cooperative and try to learn as much as possible from/with them!

Ankur: Work really hard. Do your homework before you ask questions or before quoting anything in proposal. Writing a good proposal is necessary, and you must really be aware of what you are writing - a good research is necessary. SciRuby community members are readily available to help you at mailing list and #sciruby channel. A thorough discussion with the mentor will help you out.

Will: To new applicants this year I’d stress one thing above all else; get in touch with people on the sciruby mailing list. Introduce yourself as soon as possible, and start discussing your project ideas when you have something in mind. People on the mailing list are very friendly and helpful, so don’t be afraid to start a conversation, but also expect constructive criticism of your proposals. Answering any questions or concerns promptly and thoroughly not only shows that your know your stuff and are passionate about your project, it also shows that you are a good fit for GSoC in general. Don’t assume you’re in just because you’ve had a good dialogue, but plan and communicate as though you are; don’t wait for the project to start to fill in details or contact your prospective mentors personally. Once you’ve submitted a proposal, all of this goes double. The closer you get to the deadline, the less time there will be to polish your application and respond to questions, so make sure you’re doing it quickly and effectively.

Our SciRuby GSoC alumni added:

Monica: If I don’t join this year, I wish you good luck with the new students!

Ankur: I will be happy to sign up again as student, this year!

Will: I know I’ve said this already, but GSoC last year was a defining moment in my path to becoming a software developer, career-wise sure, but more importantly in the coder vs hacker vs computer scientist vs software developer sense. If there’s anything I can do to get involved this year, I’ll be available.

Statistics With Ruby: Time Series and General Linear Models

Editor’s Note: This is the third of four blog posts detailing our Google Summer of Code 2013 students’ work, edited by John Woods.


Statsample is a basic and advanced statistics suite in Ruby. It attempts to support JRuby and MRI/YARV equally, and also provides pure Ruby implementations for many functions.

Statsample is the perfect library for anyone who is (a) interested in exploring statistical aspects and (b) even a little familiar (or interested) in Ruby.

It includes a rich API, except for problems involving time series and generalized linear models (GLM), for which the functionality was rather basic.

So, in this Google Summer of Code 2013 program, working on the SciRuby Project, I released two extensions:

These gems aim to take Statsample further and incorporate various functionalities and estimation techniques on continuous data.

Statsample TimeSeries

Statsample TimeSeries is equipped with a variety of operations, which are directly available on the Series object. A few of those functionalities are:

  • _Autocorrelation of series: For finding repeating patterns (like a periodic signal) in noisy data or for identifying persistence (if it rained today, will it rain tomorrow?).
  • Autoregressive and Moving Average: Autoregressive models (AR and ARMA) are useful for describing random processes such as found in nature and economics believed to be predictable from past behavior (e.g., El Niño, the stock market).
  • Partial autocorrelation with Yule–Walker, a method for calculating the coefficients of autoregressive models.
  • Levinson–Durbin estimation: for solving linear equations involving a Toeplitz matrix, such as in signal processing or cyclic signals.
  • Kalman filtering (or linear quadratic estimation): often used for determining position and motion of a moving object based on sensor information (e.g., for drawing a vehicle’s position on a map using GPS data, or for aircraft or spacecraft navigation based on sensor inputs)
  • Moving averages, differences and autocovariances.

To get your hands dirty,

  • Install Statsample with gem install statsample.
  • Next, install the TimeSeries extension with gem install statsample-timeseries.

Now, let’s make a simple TimeSeries object:

require 'statsample-timeseries'
include Statsample::TimeSeries # Enable the DSL
# Create a randomized timeseries of 100 continuous elements
ts = { rand(100) }

# Get the autocorrelation of the series

# Get the partial autocorrelation of the series

# Partial autocorrelation with 11 lags by maximum likelihood estimation
ts.pacf(11, 'mle')

# autoregressive coefficients:

# ARIMA(2, 1, 1)
k_obj = TimeSeries.arima(ts, 2, 1, 1) #Gives autoregressive coefficients #Gives moving average coefficients

You can go through the documentation and API for more information.

Statsample GLM

Statsample GLM includes many helpful regression techniques, which can be used for regression analysis on data. Some of those techniques are:

The top level module for regression techniques is Statsample::Regression.

Using it is as simple as ever:

  * First, install `statsample` by `gem install statsample`.
  * Then, install GLM by `gem install `statsample-glm`.

Let’s get started:

require 'statsample-glm'
# Create the datasets:
x1 =[0.537322309644812,-0.717124209978434,-0.519166718891331,0.434970973986765,-0.761822002215759,1.51170030921189,0.883854199811195,-0.908689798854196,1.70331977539793,-0.246971150634099,-1.59077593922623,-0.721548040910253,0.467025703920194,-0.510132788447137,0.430106510266798,-0.144353683251536,-1.54943800728303,0.849307651309298,-0.640304240933579,1.31462478279425,-0.399783455165345,0.0453055645017902,-2.58212161987746,-1.16484414309359,-1.08829266466281,-0.243893919684792,-1.96655661929441,0.301335373291024,-0.665832694463588,-0.0120650855753837,1.5116066367604,0.557300353673344,1.12829931872045,0.234443748015922,-2.03486690662651,0.275544751380246,-0.231465849558696,-0.356880153225012,-0.57746647541923,1.35758352580655,1.23971669378224,-0.662466275100489,0.313263561921793,-1.08783223256362,1.41964722846899,1.29325100940785,0.72153880625103,0.440580131022748,0.0351917814720056, -0.142353224879252],:scale)
x2 =[-0.866655707911859,-0.367820249977585,0.361486610435,0.857332626245179,0.133438466268095,0.716104533073575,1.77206093023382,-0.10136697295802,-0.777086491435508,-0.204573554913706,0.963353531412233,-1.10103024900542,-0.404372761837392,-0.230226345183469,0.0363730246866971,-0.838265540390497,1.12543549657924,-0.57929175648001,-0.747060244805248,0.58946979365152,-0.531952663697324,1.53338594419818,0.521992029051441,1.41631763288724,0.611402316795129,-0.518355638373296,-0.515192557101107,-0.672697937866108,1.84347042325327,-0.21195540664804,-0.269869371631611,0.296155694010096,-2.18097898069634,-1.21314663927206,1.49193669881581,1.38969280369493,-0.400680808117106,-1.87282814976479,1.82394870451051,0.637864732838274,-0.141155946382493,0.0699950644281617,1.32568550595165,-0.412599258349398,0.14436832227506,-1.16507785388489,-2.16782049922428,0.24318371493798,0.258954871320764,-0.151966534521183],:scale)

y =[0,0,1,0,1,1,1,1,0,1,1,1,1,0,1,0,1,1,0,1,0,1,1,1,1,0,0,1,1,0,0,1,0,0,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1],:scale)

x ={"i"=>intercept,"x1"=>x1,"x2"=>x2})

obj = Statsample::Regression.glm(x, y, :binomial)
# => Returns logistic regression object

The documentation and API details is available here

We have some more plans for GLM module. First in the list is to make the algorithms work with singular value decomposition, because manual inversion of matrices is not fun for larger values in a Poisson regression.


I have blogged about most of the functionalities; additional information is available there.

Please explore and use the libraries; I eagerly await your input, suggestions and questions. Feel free to leave any questions on the Statsample GLM tracker or the Statsample TimeSeries tracker.

I had an amazing summer!

Stay tuned and Enjoy.

Call for Funding: More Women Needed in Open Source Science Software

Women make up 51% of the American workforce, and yet only 20% of software engineers are female. Worldwide, the situation is similar. In open source software engineering, the statistics are worse: only 1.5–5% are female.

One of the organizations which presented at the Google Summer of Code Mentor Summit was the GNOME Foundation’s Outreach Program for Women (OPW). OPW is similar to GSoC, except that OPW doesn’t require its applicants to be students — or know how to program when the coding period begins. The pay is competitive with GSoC. And of course, only women can apply.

In the process of our Google Code-In 2013 application, I recruited several female mentors to work with female GCI students — not a requirement, but I think it helps to have supportive people involved with whom one can identify. Unfortunately, we weren’t selected for the Code-In (not too disappointing given the several venerable and accomplished organizations that were chosen). But we want to have another go, this time by applying for the Outreach Program for Women.

Here’s where we need your help.

Work for a company that might want to support this goal? Show this to your boss. Have him or her get in touch with us (sciruby.project at gmail dot com).

If you don’t work for such a company, but would still like to help, you can also get in touch at the same email address. As a general rule of thumb, you can always donate via Pledgie, even if you don’t have access to tons of money.

By the way, here’s a blog post by one of our mentors, Anna Belak.

Mentoring Future Computational Power Women for GCI 2013

My name is Anna, and I’m an engineer and a scientist. I study Materials Science and Engineering at the University of Michigan; and I graduated from Virginia Tech with a degree in Physics. My thesis project deals with computationally predicting the properties of materials used in lithium-ion batteries and airplane turbine blades.

I’m writing this blog post for SciRuby, but I’m really writing it for the young women out there. SciRuby is applying to participate in Google’s Code-In 2013, which aims to get high school students involved in the open source movement, and specifically in coding. It’s great preparation for college and beyond — whatever you might choose to study.

I became involved in science because I love figuring stuff out and working with smart, interesting people, and I want you to get involved, too. Modern science wouldn’t be possible without open source software. Moreover, if you learn how to code, you’ll always have a job. Always.

And the cool part is that you don’t have to know how to code in order to participate. Many of the tasks involve research and documentation in science and mathematics — which is a great way to obtain valuable experience for college applications.

Ruby and the Semantic Web, RDF, and SPARQL: PubliSci

Editor’s Note: This is the second of four blog posts detailing our Google Summer of Code 2013 students’ work. I edited it to include a very incomplete list of public RDF repositories. —John Woods


Across all fields and disciplines, contemporary scientists are faced with a massive and growing volume of data. It often won’t fit in a lab notebook, and there is a pressing need to share it more quickly and widely than publication in a journal article would allow for. Database software is one great solution for storage of such data, but relational databases become brittle in the face of changes or new information, do not play nicely with other databases or data derived from such databases, and may not be fully machine (or human) readable without pre-existing knowledge.

Meanwhile, the Internet is an extremely useful place to discover and share useful information, but it is essentially built around linked documents, rather than pure data, and so our primary mechanism for sharing data is as HTML or text.

RDF and related technologies propose to provide the means to move beyond a web of documents to a web of data. Along the way, these technologies may address many of the problems with conventional relational databases (e.g., SQL). At its core, RDF defines an extremely flexible method for representing data on the web — which is nonetheless unambiguously defined, without any external context, and can be linked to other data as web documents link to each other. Because RDF data can be understood as either a set of subject–predicate–object statements or a directed graph with labeled edges, a number of supporting standards and tools that have grown up around it to provide powerful storage and access methods that are often easier to implement and use than those associated with relational databases and the document-based web.

Enter PubliSci

This summer I created a Ruby gem, PubliSci, to facilitate data publication and interaction using the Semantic Web. The format offers a unified way to share and combine information from multiple sources, support for machine learning tools, a flexible query language that makes application integration easy, and the backing of the World Wide Web Consortium (W3C) and other standards-setting bodies.

The PubliSci gem comprises a set of parsers for converting various input formats using the RDF Data Cube vocabulary, and a Ruby interface for defining new ones. Since the relationship between external datasets and semantic web formats is sometimes up to interpretation, a domain-specific language is included to allow end users to resolve ambiguities and provide additional metadata.

Along with the conversion tool, a standalone server is available as an extension to the gem that simplifies setting up and interacting with RDF data stores. The server allows import, export, querying, and management of external triplestores such as 4store, and supports both cross-domain access and content negotiation so the gem can be accessed using Javascript or other applications.

Triplestores are databases for the storage and retrieval of triples, which are typically subject–object–predicate relationships (e.g., Bob knows Fred).

If you’d like to contribute, the source code is available on Github, and a broad outline of the to do list can be had as well.


Once you’ve done gem install publisci, you can require the gem in the normal way (require 'publisci'). To invoke the domain-specific language, you’ll also want to include the DSL module:

require 'publisci'
include PubliSci::DSL

Input data can be specified like so:

# Specify input data
data do
  # Use local or remote paths to point to the data file you want to load:
  source ''

  # Specify datacube properties.
  dimension 'producer', 'pricerange'
  measure 'chunkiness'

  # Set parser-specific options.
  option 'label_column', 'producer'

You can provide meta-data on your dataset as well.

metadata do
  dataset 'bacon'
  title 'Bacon dataset'
  creator 'Will Strinz'
  description 'some data about bacon'
  date '1-10-2010'

Sending the data to a repository is simple.

# Send output to an RDF::Repository
#  can also use 'generate_n3' to output a turtle string
repo = to_repository

SPARQL queries can be run on the dataset using the QueryHelper module.

# run SPARQL queries on the dataset
PubliSci::QueryHelper.execute('select * where {?s ?p ?o} limit 5', repo)

Finally, data can be exported in other formats, such as ARFF:

# export in other formats

Some places to look for RDF repositories

GSOC 2013: Data Mining in JRuby With Ruby Band

Editor’s Note: This is the first of four blog posts detailing our Google Summer of Code 2013 students’ work, edited by John Woods.

In the context of the Google Summer of Code 2013, I have developed a Ruby gem called Ruby Band that makes some selected Java software for data mining and machine learning applications available to the JRuby/Ruby users. This project complements existing software already developed for SciRuby by adding support for the Weka Toolkit and general functions included in the Apache Commons Math library.

As Weka does, Ruby Band features a comprehensive collection of data preprocessing and modeling techniques, and supports several standard data mining tasks, more specifically: data pre-processing (filtering), clustering, classification, regression, and feature selection.

All of Ruby Band’s techniques have been built on the assumption that the data is available as a single flat file or relation, where each datum is described by a fixed number of attributes. The loaded datasets are stored in Weka Instances objects, which are used as ‘core’ data types for the interactions with other software (Apache Commons Math) or platforms.

Originally, Ruby Band was called Ruby Mining. I renamed it Ruby Band, as I imagine different software sources (Weka, Apache, etc.) working as a whole, like in a real band. Ruby Band has been designed with a modular structure, so that it can be easily extended and integrated with other Java software. The Core module is allows data types from different sources to be defined and integrated using converter methods; functionalities from each piece of additional software are independently imported. This structure increases the extensibility of the product, as in the future other developers may add functionalities according to their needs.

Though Ruby Band provides full support for the greatest part of the Weka APIs, some topics still need to be addressed properly. As I coded, I utilized the Weka Java APIs documentation as my roadmap; if you want to contribute, go see what is still missing. The best and easiest way to introduce a new functionality into the Ruby Band framework is to write up a pertinent Cucumber test, as a number of Weka recipes have been posted online by the creators.

The beta version of Ruby Band has already been released for general use on Rubygems (gem install ruby-band).

This is a quick example of how to train a classifier on a dataset parsed from an ARFF file:

require 'ruby-band'

# parse a dataset
dataset = Core::Parser.parse_ARFF(my_file.arff)

# initialize and train a classifier
classifier = do
  set_options '-M d'
  set_data dataset
  set_class_index 4

# cross-validate the trained classifier
puts classifier.cross_validate(3)

NMatrix Nearing Beta Release

As of this writing, NMatrix v0.0.9 is available on RubyGems. It is likely that the next version will be a beta release candidate, as there’s only one critical feature still missing (== between matrices of different storage types).

An enormous amount has changed since my last entry.

New, friendlier constructor

First and foremost, NMatrix sports a new constructor, based on helpful comments from folks on the listserv. Here are some examples of construction:[4,4], 0) # 4x4 dense matrix of :int32, all 0[4,4], 0.0) # 4x4 dense matrix of :float64, all 0.0[4,4], 0.0, dtype: :complex64) # 4x4 dense matrix of :complex64[1,4], stype: :yale) # size 4 sparse (Yale) row vector of 0s[4,3], [0,1,2]) # 4 rows, 3 columns: gradient across each row from 0 to 2 (int32)[4,3], [0,1,2], stype: :yale, default: 0) # same as above, but Yale storage (int32)[4,1], stype: :list, dtype: :rational128) # size 4 sparse (list) column vector of rational 0s[4,4]) # 4x4 dense matrix containing nils[4,4], dtype: :int64) # 4x4 uninitialized dense matrix containing 64-bit integers     # same as above

The different storage types (stypes) have slightly different behaviors when no initialization value is provided. This may change in the future, but addresses the somewhat different use-cases of these storage types.

I show a variety of examples above, several of which are not particularly wise uses — for example, the [0,1,2] Yale gradient, which uses 32-bit integers and must store 11 column indices and pointers (which are most likely unsigned long integers) in addition to the 11 entries (4 for the always-stored diagonal, 1 for the default, and 6 for the non-diagonal non-zeros).

The key to understanding the differing construction is the default value. All sparse matrices need a default. Usually we think of sparse matrices as being zero-based (only non-zero values are stored), but NMatrix now allows other defaults. However, dense matrices don’t need defaults, and in fact sometimes it’s more efficient to allocate them without initializing them — such as if they’re about to store the results of some LAPACK function call. But if they contain Ruby objects, they have to be initialized, which is why they become nil.

We’re still working out a few bugs in construction. If you find any, please report them in our issue tracker.

Elimination of NVectors

We removed the NVector class. Frankly, it didn’t make sense. A vector can’t have an orientation unless it has multiple dimensions — even if some of those dimensions have length 1. Now, vectors and matrices are treated the same in the code.

The good news is that element access ([]) has been rewritten so that the programmer only needs provide the coordinates whose dimensions are not 1. For example:

n =[4,1,3], 0) # 3-dimensional matrix of 0s
n[2,1] == n[2,0,1]
=> true

The same rule applies to slicing.

Slicing improvements

Aleksey Timin contributed the framework for matrix slicing. Matrices can be sliced by copying or by reference. Modifying a copy-slice does not modify the original matrix; but modifying a reference slice does.

A copy of some portion of the matrix may be made using NMatrix#slice:

new_matrix = n.slice(0..1,0..2)

And what appears to be a copy, but is actually a reference, may be made using brackets:

ref = n[0..1,0..2]

Reference slices may also be modified in a variety of ways:

n[0..4,2..8] = 0   # change all entries to 0
n[0..4,2..8] = [0,1,2]  # change the selected entries to [0,1,2,0,1,2,0,1,2...]
n[0..4,2..8] =[3,3], [0,1,2]) # same as above

For the sake of speed, each of these []= returns the right-hand value, whether that value is an array, matrix, or single value. So, you can do this:

x = m[0..4,2..8] = n[0..1,0..2] = [1,2,3]

and x will be equal to [1,2,3] after the evaluation.


Matrices are now enumerable. There are a variety of enumerators — namely each_with_indices (and each), each_stored_with_indices, and each_ordered_stored_with_indices. The first, each, is guaranteed to produce the same iteration regardless of the storage type. The other two iterate only across the stored entries.

Elementwise comparisons

A regular matrix comparison, returning a single boolean, can be accomplished using == or !=. In earlier versions of NMatrix, the results of the element-wise versions, =~, !~, >, <, >=, and <=, were matrices of 0s and 1s, stored using the :byte dtype.

Now, these comparisons return matrices of Ruby objects:

n < m
=> [ true,  false, false, true,
     false, false, false, true,
     true,  true,  true,  true  ]

Try experimenting with sparse matrices to see how the default value (#default_value) is initialized on the result during an elementwise comparison.

Chunky bacon

I’m really excited about all of the chunky bacon in our latest release. I feel like things are really coming together for our library. I’m also glad to see that people are using it.

If you’re thinking of using NMatrix yourself, I strongly encourage it. Although I’m writing my dissertation, I plan to prioritize troubleshooting ahead of just about everything else. I want using NMatrix to be an easy choice for every Rubyist.

Thanks for reading!