Support for categorical data is important for any data analysis
tool. This summer I implemented categorical data capabilities for:
Convenient and efficient data wrangling for categorical data in Daru
Visualization of categorical data
Multiple linear regression and generalized linear models (GLM) with categorical variables in Statsample and Statsample-GLM
Lets talk about each of them in detail.
Analyzing catgorical data with Daru
Categorical data is now readily recognized by
Daru and Daru has all the necessary
procedures for dealing with it.
To analyze categorical variable, simply turn the numerical vector to categorical and you are ready to go.
We will use, for demonstration purposes, animal shelter data taken
from the Kaggle Competition. It is
stored in shelter_data.
123456789
# Tell Daru which variables are categoricalshelter_data.to_category'OutcomeType','AnimalType','SexuponOutcome','Breed','Color'# Or quantize a numerical variable to categoricalshelter_data['AgeuponOutcome']=shelter_data['AgeuponOutcome(Weeks)'].cut[0,1,4,52,260,1500],labels:[:less_than_week,:less_than_month,:less_than_year,:one_to_five_years,:more_than__five_years]# Do your operations on categorical datashelter_data['AgeuponOutcome'].frequencies.sortascending:false
123456789
small['Breed'].categories.size#=> 1380# Merge infrequent categories to make data analysis easyother_cats=shelter_data['Breed'].categories.select{|i|shelter_data['Breed'].count(i)<10}other_cats_hash=other_cats.zip(['other']*other_cats.size).to_hshelter_data['Breed'].rename_categoriesother_cats_hashshelter_data['Breed'].frequencies# View the datasmall['Breed'].frequencies.sort(ascending:false).head(10)
With the help of Nyaplot, GnuplotRB and Gruff, Daru now provides ability to visualize categorical data as it does with numerical data.
To plot a vector with Nyaplot one needs to call the function #plot.
1234567
# dv is a caetgorical vectordv=Daru::Vector.new['III']*10+['II']*5+['I']*5,type::category,categories:['I','II','III']dv.plot(type::bar,method::fraction)do|p,d|p.x_label'Categories'p.y_label'Fraction'end
Given a dataframe, one can plot the scatter plot such that the points
color, shape and size can be varied acording to a categorical
variable.
123456789101112
# df is a dataframe with categorical variable :cdf=Daru::DataFrame.new({a:[1,2,4,-2,5,23,0],b:[3,1,3,-6,2,1,0],c:['I','II','I','III','I','III','II']})df.to_category:cdf.plot(type::scatter,x::a,y::b,categorized:{by::c,method::color})do|p,d|p.xrange[-10,10]p.yrange[-10,10]end
In a similar manner Gnuplot and Gruff also support plotting of categorical variables.
An additional work I did was to add Gruff with Daru. Now one can plot
vectors and dataframes also using Gruff.
See more notebooks on visualizing categorical data with Daru
here.
Regression with categorical data
Now categorical data is supported in multiple linear regression and
generalized linear models (GLM) in
Statsample and
Statsample-GLM.
A new formula language (like that used in R or
Patsy) has been introduced to ease
the task of specifying regressions.
Now there’s no need to manually create a dataframe for regression.
So this post comes rather late, but I’d still like to talk about what I did in the first week, and a little into the second week.
Firstly, thanks to a pull request offered by my GSoC mentor John (@mohawkjohn), I now have rake tasks set up properly in the repository and rake seems to be working.
There are installation instructions in the readme and there are already some sample Kernel files in
the spec folder to try out various routines on. Apart from cloning the repository, one additional thing you will need to do is download the SPICE toolkit. You may want to keep the entire compressed file for later, but you’ll only need the cspice.a file in the lib/ subdirectory. After this follow the instructions in the readme and you should be good to go. (Be sure to run bundle install to install any dependencies that you don’t already have.)
After you’re done compiling and installing, run rake pry in the gem root directory.
If you remember, almost any useful task involving the SPICE Toolkit is preceded by loading data through kernel files. The relevant routine to do this is called furnsh_c() , and the most direct way to access it through Ruby is by calling the function SpiceRub::Native.furnsh. (However, this is not recommended because SpiceRub has a specific Ruby class unifying all the kernel related methods, and also because SPICE maintains its own internal variables for both tracking loaded kernel files and unloading them.)
That’s the basic design of the wrapper, so here are a few simple examples on using the Kernel API. => denotes the interpreted output of pry.
First of all, the main KernelPool class is a singleton class, that means it can only be instantiated with the #instance method and the usual #new is private.
Any subsequent calls to #instance will produce the same object.
123456789101112
kernel_pool=SpiceRub::KernelPool.instance=>#<SpiceRub::KernelPool:0x00000002db3218>rogue_kernel_pool=SpiceRub::KernelPool.newNoMethodError:privatemethod`new' called for SpiceRub::KernelPool:Classfrom (pry):2:in `__pry__'rogue_kernel_pool=SpiceRub::KernelPool.instance=>#<SpiceRub::KernelPool:0x00000002db3218>rogue_kernel_pool==kernel_pool=>true
This is to make sure there is only one KernelPool state being maintained at a time. Now we have a bunch of kernel files in the spec/data/kernels folder which we’ll use for this example. There is an accessor attribute called @path which can be used to point to a particular folder.
1
kernel_pool.path='spec/data/kernels'
From here on it’s required that you’re running pry or irb from the gem root folder in order for the paths to match in these examples.
Now let’s load a couple of kernel files, you can type system("ls", kernel_pool.path) into your console to get a list of all the test kernels available in that folder. The KernelPool object has a #load method to load kernel files. If the variable path is set, then you only need to enter the file name, otherwise the entire path needs to be provided. An integer denoting the index of the kernel in the pool is returned if the load is successful.
12
kernel_pool.load("naif0011.tls")=>0
Note that this is the same as providing the full relative path of spec/data/kernels/naif0011.tls when the path variable is not set or nil.
So now if you try to view the @pool member of kernel_pool, you’ll find three SpiceKernel objects with @loaded=true and their respective file paths.
There isn’t much to do with a SpiceKernel object except unload it, or check its state. Note that you can only load kernel files into a KernelPool object and unload them via the SpiceKernel object.
You can access the kernel pool by calling the #[] operator and using the index that was returned on load:
So here we unload the first kernel and note that the count drops to 2. If you look up kernel_pool[0], you’ll find that the kernel is still present — but its @loaded variable has been set to false, which means that it has been removed from the CSPICE internal kernel pool.
And that about wraps up this blog post on basic kernel handling. Since we know how to load data but not use it yet, I’ll cover that and the various kernel types in the next blog post. Thank you for reading.
One of the main goals of this project was to make the ephemerides
functions easily available to an end user. Ephemerides basically
refers to an object in space whose motion is being tracked and
observed. SPICE required you to provide the following parameters to
observe a body in space.
Target: The body of interest
Frame: A rotational frame of reference (Default is J2000 [Not to be confused with the J2000 epoch])
Observer: An observing body whose viewpoint is used to chart the vector
Epoch : An epoch in Ephemeris Time
SPICE has an integer-key convention for the kind of bodies that it
has support for. Each body can be referenced via a string or an
integer id. While there isn’t an actual strict range for integer ID
classification, it is mentioned here and can be summed up
in the following if and elsif clauses. (In Ruby constant strings
are better off as symbols, so the constructor takes either an integer
ID of a string symbol)
It was very tempting to involve inheritance and extend a base Body
class onto these potential classes, but I simply did not see the need
for it at this point. The way it is at the moment, the Body object
has a reader attribute type that stores some metadata about the body
for the user’s convenience. Perhaps as coverage of SPICE improves,
this minor thing can be changed later on.
To create a Body object, you instantiate with either a body name or a
body id. Certain bodies such as instruments will require additional
kernels to be loaded. To proceed seamlessly, load a leap seconds
kernel, a planetary constants kernel, and an ephemeris kernel. (All
avaialable in spec/data/kernels)
399 and :earth map to the same body in SPICE data. The frame of
reference can also be specified as a named parameter during
instantiation to set a custom default frame for that particular
object.
The origin as seen from itself is still the origin, so this makes
sense. The methods #velocity_at and #state_at take an identical
set of parameters. While there is a bit of redundancy going on,
splitting them makes the API more elegant, but the basic relationship
between these three vectors is the following :-
One thing to note is that state/velocity/position vectors will always
be returned as an NMatrix object, SciRuby’s numerical matrix core,
to allow for calculations via the NMatrix API.
As an example that is used in the code, one line can turn a position
vector into distance from origin (here using Euclidean distance):
The unit of distance here is kilometers, so the speed of light by this
measurement is about pretty close to the textbook figure of 3e+8 m/s.
There is also a function to check if a list of bodies are within a
radial proximity from an observing body. We already calculated the
distance of the moon to be about 367,000 km. The function
within_proximity returns a list of all bodies that are within the
specified radial distance from the calling body object.
12345678
#assuming venus and mercury are instantiatedearth.within_proximity([moon,venus,mercury],400000,now)=>[#<SpiceRub::Body:0x0000000191c4f8@code=301,@frame=:J2000,@name=:moon,@type=:satellite>]
Now that we’ve come to the end of the functionality, I would like to
mention that there is another named argument aberration_correction
which is basically an error reduction method to provide a more
accurate result than the default observation. The default :none
option for aberration correction basically provides the geometric
observations without any corrections for reception or transmission of
photons. For a list of various aberration correction methods
available, have a look at the documentation for spkpos_c to
find out if you need an aberration correction on SPICE data.
If you want to look at it another way, no aberration correction would
give you the textbook response of rigid geometry, while introducing an
aberration correction would give you a somewhat more realistic output
accounting for the errors that do happen when these observations are
made.
Finally, if you need to generate a continuous time series for a body,
then SpiceRub::Time has two functions to aid in that:
In this case, I took a start time and an end time that was one day
after and requested 4 linearly spaced epochs. This is basically an
interface to NMatrix.linspace.
The other function requires you to input a start time and an end time
and a step size that keeps getting added to the start time till the
end time is reached. As a contrived example, we’ll take two epochs,
five days apart and ask for a step size of a day, expecting six
elements.
And that’s it for this blog post. I would appreciate any feedback
regarding this as I’ve been juggling the design back and forth very
frequently. There is large potential of expansion of the Body class,
particularly creating new classes as when different Bobdy objects
would have a corresponding function. (For example, the function
getfov_c which returns the field of view of an instrument could be
an instance function attached to the Instrument subclass of Body,
but this is just potential expansions in the future.)
Many popular programming languages these days ship with powerful Time
classes/interfaces to make the inevitable task of dealing with
time-related computations a more DRY experience. Humans like their
time representations to be stringy concatenations of various numbers
and alphabets that enable a large variety of date and time formats to
mean the same thing and still make instant sense to the human eye. It
was the creation of time zones that made our convenient 24-hour clocks
stay the way they are without experiencing day and night at the same
“time” across the globe. So now 00:00 A.M. UTC is 5:30 A.M. IST
for me, and the world remains sane.
But what happens when you accept the fact that you’re just a speck of
micro-dust adjusting time relatively for an only slightly bigger speck
of dust floating in the universe? Twenty-four hours in a day and thus we reset
after 2300, but consider: how would a resident of Venus know
when tea-time is on Venus if he had an Earth wristwatch that reset
after twenty-four hours? Barely a tenth of Venus’ day is complete in that time!
(If you know anybody intent on relocating to Mars, do not gift them a
clock or watch.)
So a decimal floating point representation must be the answer for
uniformity. Time zones can be dealt with; we’ll just pick a convenient
point in time and count the seconds from there onwards so that the
location on Earth doesn’t matter henceforth. It’ll drive humans insane
with the arithmetic but machines will work just fine with this. This
sort of a time system is called epoch time.
And so the internal time of most UNIX machines is the number of
seconds after midnight on Thursday, 1 January 1970 UTC. (And this
very convention is going to open a can of worms by
2038 if there is even a small set of critical machines
that haven’t moved on from 32-bit architectures.)
But we’re still not okay universally. Try going on an infinite journey
to space and you’ll find that counting seconds leads to some
inconsistencies with your local time when you try to synchronize with
Earth. How can the number of seconds after January 1970 be different
in any case? Well, your MacBook Pro has not been adjusted for
… relativity! Gravity bends light and thus the perception of
time. There’s a lot more mass, and thus a lot more gravitatonal fields
in neighborhoods away from Earth. The exact details of how this works
is beyond the scope of this blog post.
If the past few paragraphs were incessant and seemingly irrelevant,
they were there to drive home the point that Earth time simply will
not do when we step out of the ghetto to see what’s happening. But
astronomy’s been around for way longer, and astrophysicists came
forth with a time system adjusted for the relativity effects of the
solar system, called Barycentric Dynamical Time, or TDB. Like our
machines, it counts the seconds after a certain known reference time
point, except that it adjusts for relativity and can become a standard
for astronomical time.
There are many similar time scales like this, but SPICE has chosen to
use TDB as the standard for most of their design. Within the SPICE
API, TDB is the same as Ephemeris Time which is the main system used
to specify time points of astronomical bodies. Even though spacecrafts
have their clocks coordinated with UTC on Earth, you would require
that time in Ephemeris Time to be able to calculate their positions and
velocities using SPICE. SpiceRub::Time is built for this very purpose,
to revolve around a main attribute @et for Ephemeris Time and
provide many methods to convert to and from.
If you’d like to proceed with the examples, you’ll need a Leap Second
Kernel file to use SpiceRub::Time. This is a generic kernel, so you
can easily use naif0011.tls in spec/data/kernels of the repository
folder.
So Ephemeris Time is the number of seconds elapsed after Noon, January 1, 2000, TDB. This point in time is also known as the J2000
epoch. We find that out in an instant by using the Time.parse
function which is a wrapper function for SPICE’s str2et_c that
converts many formats of strings to Ephemeris Time. You can have a
look at the various string formats supported in its documentation
here
12
SpiceRub::Time.parse("12:00 Jan 1 2000 TDB")=>#<SpiceRub::Time:0x0000000325b1f8 @et=0.0>
So as a base case, using the reference epoch gives us 0 seconds as we would expect. Now would also be a good time to find out the discrepancy in UTC as well.
12
SpiceRub::Time.parse("12:00 Jan 1 2000 UTC")=>#<SpiceRub::Time:0x000000031c28e0 @et=64.18392728473108>
So right away we know that UTC was 64-ish seconds off from TDB / ET at the time of the reference J2000 epoch. What would the difference be around right now?
Well, here’s a surprise, it’s 68.18 now. Before I explain why that is,
here is a brief overview of what the above code does:
Time.now is a convenient way to specify your current UTC
timezone. It uses Ruby’s core Time.now method so this method is only
good if you’re working in UTC or Earth like Timezones. For a similar
purpose, the function Time.from_time let’s you create a SpiceRub
Time object from a Ruby Time object.
The +/- operators return a new Time object where the right operand
is added/subtracted to the left operand’s @et when it is a float or
integer. If a Time object is supplied , then it does the same with the
right operand’s ephemeris time instead. (Note that there really isn’t
a significant meaning to having a Time object whose @et is the
difference/sum of two other epochs, however you can increase a certain
epoch or decrease it by a constant offset of seconds)
In our case we used #to_utc to convert from ephemeris time to UTC,
and then the minus operator gave us a Time object that wasn’t really
an epoch, but a difference of two epochs, so using #to_f got us
exactly that.
It appears that UTC has changed by 4 seconds since 2000 with respect
to ephemeris time. This is actually the adjustment of “leap seconds”
that gets added to UTC to prevent it from falling too far behind other
time systems. (Humans really like to hack everything, don’t they?)
To verify this yourself, if you open up the kernel naif0011.tls in your
text editor and search for DELTET/DELTA_AT, you’ll find a list like
representation of the following sort :-
Here you can see that just before the year 2000, there were 32 leap
seconds added to UTC, and in 2015 when the last leap second was added,
there were 36. It’s an ongoing and indefinite process and so there
really is no way to account for leap second errors far in the future
for leap seconds that are yet to be added. As of now, the next
scheduled addition is in December, 2016.
Coming back to our Time object, let’s look at its basic
construction. One tricky task in the API was the option to specify
different epochs of reference in different time scales, like
International Atomic Time. As of now, Time.new requires that you
have kept your word of using the J2000 epoch and allows you to use a
named parameter seconds: for specifying the time scale. The use of
scale as a key was avoided as it sometimes is also used to refer to the
reference epoch used.
:tai here refers to International Atomic Time. For a list of more
parameters and their keyword abbreviations, have a look at
this SPICE documentation for the function that the
conversion is wrapped on top of.
But there is also a way to reference other epochs without doing the
manual conversions yourself, you can call the class method Time.at
to perform the same function as the constructor, with the option of a
different reference epoch. The resultant Time object will however have
its internal time referring to J2000.
A more readable way would involve step by step calculations, but that
would consume runtime resources everytime Time.at is called, so I’ve
basically pre-calculated the ephemeris times of the reference epochs
and subtracted them from the epoch.
To quickly verify the last one with the #to_s method:
12
SpiceRub::Time.new(-946727958.8160644).to_s=>"Thu Jan 01 00:00:00 UTC 1970"
It’s exactly the UNIX epoch! Let’s try out 1 day (86400 seconds) after
this epoch:
12
SpiceRub::Time.at(86400,:unix).to_s=>"Thu Jan 01 23:59:59 UTC 1970"
Just a second short of heading into the next day, because we’ve added
86400 TDB seconds and converted the time into a UTC string.
There are some more functions provided to work in tandem with the
Body class that I’ll explain more about in the next blog post, but
this more or less covers the core of SpiceRub:Time. Till then,
thanks for reading.
I worked on “Port NMatrix to JRuby” in the context of the Google
Summer of Code (GSoC) 2016 and I am pleased to announce that NMatrix
can now be used in JRuby.
With JRuby NMatrix, a linear algebra library, wraps Apache Commons Math for its most
basic functionalities. NMatrix supports dense matrices containing
either doubles or Ruby objects as the data type. The performance of JRuby
with Apache Commons Maths is quite satisfactory (see below for performance
comparisons) even without making use of JRuby threading capabilities.
I have also ported the mixed_models gem, which
uses NMatrix heavily at its core, to JRuby. This gem allowed us
to test NMatrix-JRuby with real-life data.
This blog post summarizes my work on the project with SciRuby, and
reports the final status of the project.
The original GSoC proposal, plan and application can be found here. Until merging
is complete, commits are available here.
Performance
I have benchmarked some of the NMatrix functionalities. The following
plots compare the performance between NMatrix-JRuby, NMatrix-MRI, and
NMatrix-MRI using LAPACK/ATLAS libraries. (Note: MRI refers to the
reference implementation of Ruby, for those who are new.)
Notes:
LAPACK and ATLAS aren’t involved in most element-wise operations, such as addition and subtraction.
NMatrix-MRI relies on LAPACK/ATLAS for calculating determinants and LU Decomposition (lud).
Result
For two-dimensional matrices, NMatrix-JRuby is currently slower than NMatrix-MRI for matrix multiplication and matrix decomposition functionalities (calculating determinant and factoring a matrix). NMatrix-JRuby is faster than NMatrix-MRI for other functionalities of a two-dimensional matrix — like addition, subtraction, trigonometric operations, etc.
NMatrix-JRuby is a clear winner when we are working with matrices of arbitrary dimensions.
Implementation
Storing n-dimensional matrices as one-dimensional arrays
The major components of an NMatrix are shape, elements, dtype and
stype. When initialized, the dense type stores the elements as a one-dimensional
array; in the JRuby port, the ArrayRealVector class is used to store
the elements.
@s stores elements, @shape stores the shape of the matrix, while
@dtype and @stype store the data type and storage type
respectively. Currently, I have nmatrix-jruby implemented only for
:float64 (double) and Ruby :object data types.
NMatrix-MRI uses struct as a type to store dim, shape, offset, count, src
of an NMatrix. ALLOC and xfree are used to wrap the NMatrix attributes to C structs
and release the unrequired memory.
Slicing and Rank
Implementing slicing was the toughest part of NMatrix-JRuby
implementation. NMatrix@s stores the elements of a matrix as a
one-dimensional array. The elements along any dimension are accessed with the
help of the stride. NMatrix#get_stride calculates the stride with
the help of the dimension and shape and returns an Array.
NMatrix#[] and NMatrix#[]= are thus able to read and write the
elements of a matrix. NMatrix#MRI uses the @s object which stores
the stride when the nmatrix is initialized.
NMatrix#[] calls the #xslice operator which calls #get_slice
operator that use the stride to determine whether we are accessing a
single element or multiple elements. If there are multiple elements,
#dense_storage_get returns an NMatrix object with the elements along
the dimension.
NMatrix-MRI differs from NMatrix-JRuby implementation as it makes sure
that memory is properly utilized as the memory needs to be properly
garbage collected.
1234567891011121314151617181920212223
defxslice(args)result=nilifself.dim<args.lengthraise(ArgumentError,"wrong number of arguments\ (#{args} for #{effective_dim(self)})")elseresult=Array.new()slice=get_slice(@dim,args,@shape)stride=get_stride(self)ifslice[:single]if(@dtype==:object)result=@s[dense_storage_get(slice,stride)]elses=@s.toArray().to_aresult=@s.getEntry(dense_storage_get(slice,stride))endelseresult=dense_storage_get(slice,stride)endendreturnresultend
NMatrix#[]= calls the #dense_storage_set operator which calls
#get_slice operator that use the stride to find out whether we are
accessing a single element or multiple elements. If there are
multiple elements #set_slice recursively sets the elements of the
matrix then returns an NMatrix object with the elements along the
dimension.
All the relevant code for slicing can be found here.
Enumerators
NMatrix-MRI uses the C code for enumerating the elements of a
matrix. Just as with slicing, the NMatrix-JRuby uses pure Ruby code in
place of the C code. Currently, all the enumerators for dense matrices
with real data-type have been implemented and are properly
functional. Enumerators for objects have not yet been implemented.
12345678910111213141516171819202122232425262728
defeach_with_indicesnmatrix=create_dummy_nmatrixstride=get_stride(self)offset=0#Create indices and initialize them to zerocoords=Array.new(dim){0}shape_copy=Array.new(dim)(0...size).eachdo|k|dense_storage_coords(nmatrix,k,coords,stride,offset)slice_index=dense_storage_pos(coords,stride)ary=Array.newif(@dtype==:object)ary<<self.s[slice_index]elseary<<self.s.toArray.to_a[slice_index]end(0...dim).eachdo|p|ary<<coords[p]end# yield the array which now consists of the value and the indicesyield(ary)endifblock_given?nmatrix.s=@sreturnnmatrixend
Two-Dimensional Matrices
Linear algebra is mostly about two-dimensional matrices. In NMatrix,
when performing calculations in a two-dimensional matrix, a one-dimensional array
is converted to a two-dimensional matrix. A two-dimensional matrix is
stored in the JRuby implementation as a BlockRealMatrix or
Array2DRowRealMatrix. Each has its own advantages.
Memory Usage and Garbage Collection: A scientific library is memory intensive and hence, every step counts. The JRuby interpreter doesn’t need to dynamically guess the data type and uses less memory, typically around one-tenth of it. If the memory is properly utilized, when the GC kicks in, the GC has to clear less used memory space.
Speed: Using java method greatly improves the speed — by around 1000 times, when compared to using the Ruby method.
Operators
All the operators from NMatrix-MRI have been implemented except
modulus. The binary operators were easily implemented through Commons
Math API and Java Math API.
1234567891011121314151617
def+(other)result=create_dummy_nmatrixif(other.is_a?(NMatrix))#check dimensionraise(ShapeError,"Cannot add matrices with different dimension")\if(@dim!=other.dim)#check shape(0...dim).eachdo|i|raise(ShapeError,"Cannot add matrices with different shapes")\if(@shape[i]!=other.shape[i])endresult.s=@s.copy.add(other.s)elseresult.s=@s.copy.mapAddToSelf(other)endresultend
Unary Operators (Trigonometric, Exponentiation and Log operators) have been implemented using #mapToSelf method that takes a Univariate function as an argument. #mapToSelf maps every element of ArrayRealVector object to the Univariate function, that is passed to it and returns self object.
NMatrix-MRI relies on LAPACK and ATLAS for matrix decomposition and
solving functionalities. Apache Commons Math provides a different set
of API for decomposing a matrix and solving an equation. For example,
#potrf and other LAPACK specific functions have not been implemented
as they are not required at all.
Calculating determinant in NMatrix is tricky where a matrix is reduced
either to a lower or upper matrix and the diagonal elements of the
matrix are multiplied to get the result. Also, the correct sign of the
result (whether positive or negative) is taken into account while
calculating the determinant. However, NMatrix-JRuby uses Commons Math
API to calculate the determinant.
1234567
defdet_exactif(@dim!=2||@shape[0]!=@shape[1])raise(ShapeError,"matrices must be square to have a determinant defined")returnnilendto_return=LUDecomposition.new(self.twoDMat).getDeterminant()end
Given below is code that shows how Cholesky decomposition has been
implemented by using Commons Math API.
defsolve(b,opts={})raise(ShapeError,"Must be called on square matrix")\unlessself.dim==2&&self.shape[0]==self.shape[1]raise(ShapeError,"number of rows of b must equal number\ of cols of self")ifself.shape[1]!=b.shape[0]raise(ArgumentError,"only works with dense matrices")ifself.stype!=:denseraise(ArgumentError,"only works for non-integer, non-object dtypes")\ifinteger_dtype?orobject_dtype?orb.integer_dtype?orb.object_dtype?opts={form::general}.merge(opts)x=b.clonen=self.shape[0]nrhs=b.shape[1]nmatrix=create_dummy_nmatrixcaseopts[:form]when:general,:upper_tri,:upper_triangular,:lower_tri,:lower_triangular#LU solversolver=LUDecomposition.new(self.twoDMat).getSolvernmatrix.s=solver.solve(b.s)returnnmatrixwhen:pos_def,:positive_definitesolver=Choleskyecomposition.new(self.twoDMat).getSolvernmatrix.s=solver.solve(b.s)returnnmatrixelseraise(ArgumentError,"#{opts[:form]} is not a valid form option")endend
NMatrix#matrix_solve
Suppose we need to solve a system of linear equations:
AX = B
where A is an m×n matrix, B and X are n×p matrices, we need to solve this equation by iterating through B.
NMatrix-MRI implements this functionality using NMatrix::BLAS::cblas_trsm method. However, for NMatrix-JRuby, NMatrix#matrix_solve is the analogous method used.
12345678910111213141516171819
defmatrix_solverhsifrhs.shape[1]>1nmatrix=NMatrix.new:copynmatrix.shape=rhs.shaperes=[]#Solve a matrix and store the vectors in a matrix(0...rhs.shape[1]).eachdo|i|res<<self.solve(rhs.col(i)).s.toArray.to_aend#res is in col major formatresult=ArrayGenerator.getArrayColMajorDouble\res.to_java:double,rhs.shape[0],rhs.shape[1]nmatrix.s=ArrayRealVector.newresultreturnnmatrixelsereturnself.solverhsendend
Currently, Hessenberg transformation for NMatrix-JRuby has not been implemented.
Other dtypes
I have tried implementing float dtypes using FloatMatrix class
provide by jblas. jblas was used instead of Commons Math as the
latter uses Field Elements for Floats and it had some issues
with Reflection and Type Erasure.
However, using jblas resulted in errors due to precision.
Code Organisation and Deployment
To minimise conflict with the MRI codebase all the JRuby front end
code has been placed in the /lib/nmatrix/jruby
directory. lib/nmatrix/nmatrix.rb decides whether to load
nmatrix.so or nmatrix_jruby.rb after detecting the Ruby platform.
The added advantage is that the Ruby interpreter must not decide which
function to call at run-time. The impact on performance can be seen
when programs which intensively use NMatrix for linear algebraic
computations (e.g., mixed_models) are run.
Spec Report
After the port; this is the final report that summarizes the number of tests that successfully pass:
NMatrix
Spec file
Total Tests
Success
Failure
Pending
00_nmatrix_spec
188
139
43
6
01_enum_spec
17
8
09
0
02_slice_spec
144
116
28
0
03_nmatrix_monkeys_spec
12
11
01
0
elementwise_spec
38
21
17
0
homogeneous_spec.rb
07
06
01
0
math_spec
737
541
196
0
shortcuts_spec
81
57
24
0
stat_spec
72
40
32
0
slice_set_spec
6
2
04
0
Why do some tests fail?
Complex dtype has not been implemented.
Sparse matrices (list and yale) have not been implemented.
Decomposition methods that are specific to LAPACK and ATLAS have not been implemented.
Integer dtype is not properly assigned to floor, ceil, and round methods.
Mixed_Models
Spec file
Total Test
Success
Failure
Pending
Deviance_spec
04
04
0
0
LMM_spec
195
195
0
0
LMM_categorical_data_spec.rb
48
45
3
0
LMMFormula_spec.rb
05
05
0
0
LMM_interaction_effects_spec.rb
82
82
0
0
LMM_nested_effects_spec.rb
40
40
0
0
matrix_methods_spec.rb
52
52
0
0
ModelSpecification_spec.rb
07
07
0
0
NelderMeadWithConstraints_spec.rb
08
08
0
0
Future Work
NMatrix on JRuby offers comparable speeds to MRI. For specific
computations it will be possible to leverage the threading support of
JRuby and speed up things using multiple cores.
Adding new functionality to NMatrix-JRuby will be easy from
here. Personally, I am interested to add OpenCL support to leverage
the GPU computational capacity available on most machines today.
Conclusion
The main goal of this project was to to gain from the performance JRuby offers,
and bring a unified interface for linear algebra between MRI and JRuby.
By the end of GSoC, I have been able to successfully create a linear
algebra library, NMatrix for JRuby users, which they can easily run on
their machines — unless they want to use complex numbers, at
least for now.
I have mixed_models gem simultaneously ported to JRuby. Even here,
NMatrix-JRuby is very close to NMatrix-MRI, considering the performance .
Acknowledgements
I would like to express my sincere gratitude to my mentor Pjotr Prins
for the continuous support through the summers, and for his patience,
motivation, enthusiasm, and immense knowledge. I could not have
imagined having a better advisor and mentor, for this project.
I am very grateful to Google and the Ruby Science Foundation for this
golden opportunity.
I am very thankful to Charles Nutter, John Woods, Sameer Deshmukh,
Kenta Murata and Alexej Gossmann, who mentored me through the
project. It has been a great learning experience.
I thank my fellow GSoC participants Rajith, Lokesh and Gaurav who
helped me with certain aspects of my project.
So this post comes rather late, but I’d still like to talk about what I did in the first week, and a little into the second week.
Firstly, thanks to a pull request offered by my GSoC mentor John (@mohawkjohn), I now have rake tasks set up properly in the repository and rake seems to be working.
There are installation instructions in the ReadMe and there are already some sample Kernel files in
the spec folder to try out various routines on. Apart from cloning the repository, one additional thing you will need to do is download the SPICE toolkit from here. You may want to keep the entire compressed file for later, but you’ll only need the cspice.a file in the lib/ subdirectory. After this follow the instructions in the ReadMe and you should be good to go. Be sure to run bundle install to install any dependencies that you don’t already have.
After you’re done compiling/installing, run rake pry in the gem root directory. If you remember, almost any useful task involving the SPICE Toolkit is preceded by loading data through kernel files. The relevant routine to do this is called furnsh_c() , and the most direct way to access it through Ruby is by calling the function SpiceRub::Native.furnsh. This is not recommended because SpiceRub has a specific Ruby class unifying all the kernel related methods and also because SPICE maintains its own internal variables for tracking loaded kernel files / unloading them. Below is a crude dependency sequence :-
That’s the basic design of the wrapper, so here’s a few simple examples on using the Kernel API. => denotes the interpreted output of pry.
First of all, the main KernelPool class is a singleton class, that means it can only be instantiated with the #instance method and the usual #new is private.
Any subsequent calls to #instance will produce the same object.
123456789101112
kernel_pool=SpiceRub::KernelPool.instance=>#<SpiceRub::KernelPool:0x00000002db3218>rogue_kernel_pool=SpiceRub::KernelPool.newNoMethodError:privatemethod`new' called for SpiceRub::KernelPool:Classfrom (pry):2:in `__pry__'rogue_kernel_pool=SpiceRub::KernelPool.instance=>#<SpiceRub::KernelPool:0x00000002db3218>rogue_kernel_pool==kernel_pool=>true
This is to make sure there is only one KernelPool state being maintained at a time. Now we have a bunch of kernel files in the spec/data/kernels
folder which we’ll use for this example. There is an accessor attribute called @path which can be used to point to a particular folder.
1
kernel_pool.path='spec/data/kernels'
From here on it’s required that you’re running pry or irb from the gem root folder in order for the paths to match in these examples.
Now let’s load a couple of kernel files, you can type system("ls", kernel_pool.path) into your console to get a list of all the test kernels available
in that folder. The KernelPool object has a #load method to load kernel files. If the variable path is set, then you only need to enter the file name,
otherwise the entire path needs to be provided. An integer denoting the index of the kernel in the pool is returned if the load is successful.
12
kernel_pool.load("naif0011.tls")=>0
Note that this is the same as providing the full relative path of spec/data/kernels/naif0011.tls when the path variable is not set or nil.
So now if you try to view the @pool member of kernel_pool, you’ll find 3 SpiceKernel objects with @loaded=true and their respective file paths.
There isn’t much to do with a SpiceKernel object except unload it, or check its state. Note that you can only load kernel files into a KernelPool object and unload
them via the SpiceKernel object.
You can access the kernel pool by calling the #[] operator and using the index that was returned on load :-
So here we unload the first kernel and note that the count drops to 2. If you look up kernel_pool[0], you’ll find that the kernel
is still present but it’s @loaded variable has been set to false which means that it has been removed from the CSPICE internal kernel pool.
And that about wraps up this blog post on basic kernel handling. Since we know how to load data but not use it yet, I’ll cover that and the various kernel types
in the next blog post. Thank you for reading :)
Google Summer of Code 2015 is coming to an end. During this summer, I have learned too many things to list here about statistical modeling, Ruby and software development in general, and I had a lot of fun in the process!
Linear mixed models
My GSoC project is the Ruby gem mixed_models. Mixed models are statistical models which predict the value of a response variable as a result of fixed and random effects. The gem in its current version can be used to fit statistical linear mixed models and perform statistical inference on the model parameters as well as to predict future observations. A number of tutorials/examples in IRuby notebook format are accessible from the mixed_modelsgithub repository.
Linear mixed models are implemented in the class LMM. The constructor method LMM#initialize provides a flexible model specification interface, where an arbitrary covariance structure of the random effects terms can be passed as a Proc or a block.
A convenient user-friendly interface to the basic model fitting algorithm is LMM#from_formula, which uses the formula language of the R mixed models package lme4 for model specification. With the #from_formula method, the user can conveniently fit models with categorical predictor variables, interaction fixed or random effects, as well as multiple crossed or nested random effects, all with just one line of code.
Examples are given in the sections below.
Implementation
The parameter estimation in LMM#initialize is largely based on the approach developed by the authors of the R mixed models package lme4, which is delineated in the lme4vignette. I have tried to make the code of the model fitting algorithm in LMM#initialize easy to read, especially compared to the corresponding implementation in lme4.
The lme4 code is largely written in C++, which is integrated in R via the packages Rcpp and RcppEigen. It uses CHOLMOD code for various sparse matrix tricks, and it involves passing pointers to C++ object to R (and vice versa) many times, and passing different R environments from function to function. All this makes the lme4 code rather hard to read. Even Douglas Bates, the main developer of lme4, admits that “The end result is confusing (my fault entirely) and fragile”, because of all the utilized performance improvements. I have analyzed the lme4 code in three blog posts (part 1, part 2 and part 3) before starting to work on my gem mixed_models.
The method LMM#initialize is written in a more functional style, which makes the code shorter and (I find) easier to follow. All matrix calculations are performed using the gem nmatrix, which has a quite intuitive syntax and contributes to the overall code readability as well.
The Ruby gem loses with respect to memory consumption and speed in comparison to lme4, because it is written in pure Ruby and does not utilize any sparse matrix tricks. However, for the same reasons the mixed_models code is much shorter and easier to read than lme4. Moreover, the linear mixed model formulation in mixed_models is a little bit more general, because it does not assume that the random effects covariance matrix is sparse. More about the implementation of LMM#initialize can be found in this blog post.
Other existing tools
Popular existing software packages for mixed models include the R package lme4 (which is arguably the standard software for linear mixed models), the R package nlme (an older package developed by the same author as lme4, still widely used), Python’s statmodels, and the Julia package MixedModels.jl.
Below, I give a couple of examples illustrating some of the capabilities of mixed_models and explore how it compares to the alternatives.
A usage example and discussion
As an example, we use data from the UCI machine learning repository, which originate from blog posts from various sources in 2010-2012, in order to model (the logarithm of) the number of comments that a blog post receives. The linear predictors are the text length, the log-transform of the average number of comments at the hosting website, the average number of trackbacks at the hosting website, and the parent blog posts. We assume a random effect on the number of comments due to the day of the week on which the blog post was published. In mixed_models this model can be fit with
Interestingly, the estimates of the random effects coefficients and standard deviation are all zero!
That is, we have a singular fit. Thus, our results imply that the day of the week on which a blog post is published has no effect on the number of comments that the blog post will receive.
It is worth pointing out that such a model fit with a singular covariance matrix is problematic with the current version of Python’s statmodels (described as “numerically challenging” in the documentation) and the R package nlme (“Singular covariance matrices correspond to infinite parameter values”, a mailing list reply by Douglas Bates, the author of nlme). However, mixed_models, lme4 and MixedModels.jl can handle singular fits without problems.
In fact, like mixed_models above, lme4 estimates the random effects coefficients and standard deviation to be zero, as we can see from the following R output:
123456789101112131415161718
> mod <- lmer(log_comments ~ log_host_comments_avg + host_trackbacks_avg + length + has_parent_with_comments +(1|day), data = df)Warning message:Some predictor variables are on very different scales: consider rescaling > ranef(mod)$day (Intercept)fr 0mo 0sa 0su 0th 0tu 0we 0> VarCorr(mod) Groups Name Std.Dev. day (Intercept) 0.0000 Residual 1.2614
Unfortunately, mixed_models is rather slow when applied to such a large data set (blog_data is a data frame of size 22435×8), especially when compared to lme4 which uses many sparse matrix tricks and is mostly written in C++ (integrated in R via Rcpp) to speed up computation. The difference in performance between mixed_models and lme4 is on the order of hours for large data, and Julia’s MixedModels.jl promises to be even faster than lme4. However, there is no noticeable difference in performance speed for smaller data sets.
A second example and statistical inference on the parameter estimates
Often, the experimental design or the data suggests a linear mixed model whose random effects are associated with multiple grouping factors. A specification of multiple random effects terms which correspond to multiple grouping factors is often referred to as crossed random effect, or nested random effects if the corresponding grouping factors are nested in each other.
A good reference on such models is Chapter 2 of Douglas Bates’ lme4 book.
Like lme4, mixed_models is particularly well suited for models with crossed or nested random effects. The current release of statmodels, however, does not support crossed or nested random effects (according to the documentation).
As an example we fit a linear mixed model with nested random effects to a data frame with 100 rows, of the form:
We consider the factor b to be nested within the factor a.
We assume that the intercept varies due to variable a; that is, a different (random) intercept term for each level of a.
Moreover, we assume that the intercept varies due to the factor b which is nested in a; that is, different (random) intercept for each combination of levels of a and b.
That is, mathematically the model can be expressed as
1
y=beta_0+beta_1*x+gamma(a)+delta(a,b)+epsilon
where gamma(a) ~ N(0, phi**2) and delta(a,b) ~ N(0, psi**2) are normally distributed random variables which assume different realizations for different values of a and b, and where epsilon is a random Gaussian noise term with variance sigma**2. The goal is to estimate the parameters beta_0, beta_1, phi, psi and sigma.
We fit this model in mixed_models, and display the estimated random effects correlation structure with
123
mod=LMM.from_formula(formula:"y ~ x + (1|a) + (1|a:b)",data:df,reml:false)putsmod.ran_ef_summary.inspect
which produces the output
123
aa_and_ba1.34108300nila_and_bnil0.97697500
The correlation between the factor a and the nested random effect a_and_b is denoted as nil, because the random effects in the model at hand are assumed to be independent.
An advantage of mixed_models over some other tools is the simplicity with which p-values and confidence intervals for the parameter estimates can be calculated using a multitude of available methods. Such methods include a likelihood ratio test implementation, multiple bootstrap based methods (which run in parallel by default), and methods based on the Wald Z statistic.
We can compute five types of 95% confidence intervals for the fixed effects coefficients with the following line of code:
For example, we see here that the intercept term is likely not significantly different from zero. We could proceed now by performing hypotheses tests using #fix_ef_p or #likelihood_ratio_test, or by refitting a model without an intercept using #drop_fix_ef.
We can also test the nested random effect for significance, in order to decide whether we should drop that term from the model to reduce model complexity. We can use a bootstrap based version of likelihood ratio test as follows.
We get a p-value of 9.99e-4, suggesting that we probably should keep the term (1|a:b) in the model formula.
A third example — a less conventional model fit
Another advantage of mixed_models against comparable tools is the ease of fitting models with arbitrary covariance structures of the random effects, which are not covered by the formula interface of lme4. This can be done in a user-friendly manner by providing a block or a Proc to the LMM constructor. This unique feature of the Ruby language makes the implementation and usage of the method incredibly convenient. A danger of allowing for arbitrary covariance structures is, of course, that such a flexibility gives the user the freedom to specify degenerate and computationally unstable models.
As an example we look at an application to genetics, namely to SNP data (single-nucleotide polymorphism) with known pedigree structures (family relationships of the subjects). The family information is prior knowledge that we can model in the random effects of a linear mixed effects model.
We model the quantitative trait y (a vector of length 1200) as
1
y=X*beta+b+epsilon,
where X is a 1200 x 130 matrix containing the genotypes (i.e. 130 SNPs for each of the 1200 subjects); epsilon is a vector of independent random noise terms with variances equal to sigma**2; beta is a vector of unknown fixed effects coefficients measuring the contribution of each SNP to the quantitative trait y; and b is a vector of random effects.
If we denote the kinship matrix by K, then we can express the probability distribution of b as b ~ N(0, delta**2 * 2 * K), where we multiply K by 2 because the diagonal of K is constant 0.5, and where delta**2 is a unknown scaling factor.
The goal is to estimate the unknown parameters beta, sigma, and delta, and to determine which of the fixed effects coefficients are significantly different from 0 (i.e. which SNPs are possibly causing the variability in the trait y).
In order to specify the covariance structure of the random effects, we need to pass a block or Proc that produces the upper triangular Cholesky factor of the covariance matrix of the random effects from an input Array. In this example, that would be the multiplication of the prior known Cholesky factor of the kinship matrix with a scaling factor.
Having all the model matrices and vectors, we compute the Cholesky factor of the kinship matrix and fit the model with
12345678
# upper triangulat Cholesky factorkinship_mat_cholesky_factor=kinship_mat.factorize_cholesky[0]# Fit the modelmodel_fit=LMM.new(x:x,y:y,zt:z,x_col_names:x_names,start_point:[2.0],lower_bound:[0.0]){|th|kinship_mat_cholesky_factor*th[0]}
Then we can use the available hypotheses test and confidence interval methods to determine which SNPs are significant predictors of the quantitative trait. Out of the 130 SNPs in the model, we find 24 to be significant as linear predictors.
See this blog post for a full analysis of this data with mixed_models.
Room for improvement and future work
Writing the formula language interpretation code used by LMM#from_formula from scratch was not easy. Much of the code can be reorganized to be easier to read and to use in other projects. Possibly, the formula interface should be separated out, similar to how it is done with the Python package patsy. Also, some shortcut symbols (namely *, /, and ||) in the model specification formula language are currently not implemented.
I plan to add linear mixed models for high-dimensional data (i.e. more predictors than observations) to mixed_models, because that work would be in line with my current PhD research.
I plan to add generalized linear mixed models capabilities to mixed_models, which can be used to fit mixed models to discrete data (such as binary or count data).
Acknowledgement
I want to thank Google and the Ruby Science Foundation for giving me this excellent opportunity! I especially want to thank Pjotr Prins who was my mentor for the project for much helpful advice and suggestions as well as his prompt responses to any of my concerns. I also want to thank my fellow GSoC participants Will, Ivan, and Sameer for their help with certain aspects of my project.
My Google Summer of Code project was working on the NMatrix project, moving
functionality that depends on external libraries from the core nmatrix
gem to optional plugin gems. NMatrix is a Ruby library for linear algebra,
used by many other projects.
In addition to the code that was part of
NMatrix proper, NMatrix previously required the ATLAS library, which
implemented fast versions of common matrix operations like multiplication
and inversion, as well as more advanced operations like eigenvalue
decomposition and Cholesky decomposition.
There were two separate but related motivations for my project. The
first was to simplify the NMatrix installation
process. ATLAS can be difficult to install, so the installation
process for NMatrix was complicated, especially on
OS X, and may have discouraged people from using NMatrix.
The second motivation was that by separating out the ATLAS code from the
main NMatrix code, it would be easier to add new linear algebra backends
which provide similar features. Indeed, I implemented a second backend this
summer.
The end result of my summer’s work:
The core nmatrix gem does not depend on any external linear matrix
libraries. It provides non-optimized implementations of common matrix
operations.
All code that requires ATLAS has been moved into the new nmatrix-atlas
gem, so that
those who are only interested in the core functionality are not required to
install ATLAS. nmatrix-atlas provides optimized implementations of common matrix
operations, as well as advanced functions not available in nmatrix.
I wrote a blog post describing the setup for releasing multiple gems from the same repository, which this required.
A new gem nmatrix-lapacke, which provides the same features as
nmatrix-atlas, but instead of depending specifically on the ATLAS
library, requires any generic LAPACK and
BLAS
implementation. This should be easier to use for many users as they may
already have LAPACK installed (it comes pre-installed with OS X and is
commonly used in Linux systems), but not ATLAS.
The installation procedure is simplified, especially for those installing
just the nmatrix gem. Compare the new installation instructions
to the old ones.
The one deviation from my original proposal was that I originally intended to remove
all the ATLAS code and release only the nmatrix-lapacke plugin, so that we
would only have one interface to the advanced linear algebra functions, but I
decided to keep the ATLAS code, since the nmatrix-lapacke code is new and
has not had a chance to be thoroughly tested.
Usage
12345
require'nmatrix'# create a 3-by-3 matrixa=NMatrix.new([3,3],[1,2,3,4,5,6,7,8,9],dtype::float64)#invert it using non-optimized NMatrix-internal implementationa.invert!
123456
require'nmatrix'require'nmatrix/atlas'#or require 'nmatrix/lapacke'# create a 3-by-3 matrixa=NMatrix.new([3,3],[1,2,3,4,5,6,7,8,9],dtype::float64)#invert it using optimized implementation provided by ATLASa.invert!
For advanced functions not provided by the core nmatrix gem, for example
gesvd, nmatrix-atlas and nmatrix-lapacke
provide a common interface:
#Identical to the above, except for the requirerequire'nmatrix'require'nmatrix/lapacke'a=NMatrix.new([4,5],[1,0,0,0,2,0,0,3,0,0,0,0,0,0,0,0,4,0,0,0],dtype:dtype)u,s,vt=a.gesvd
If the developer wants to use an advanced feature, but does not care
whether the user is using nmatrix-atlas
or nmatrix-lapacke, they can require nmatrix/lapack_plugin, which will
require whichever of the two is available, instead of being forced to
choose between the two.
As a fun test of the new gems, I did a very simple benchmark, just
testing how long it took to invert a
1500-by-1500 matrix in place using NMatix#invert!:
nmatrix (no external libraries): 3.67s
nmatrix-atlas: 0.96s
nmatrix-lapacke with ATLAS: 0.99s
nmatrix-lapacke with OpenBLAS (multithreading enabled): 0.39s
nmatrix-lapacke with reference implementations of LAPACK and BLAS: 3.72s
This is not supposed to be a thorough or realistic benchmark (performance will
depend on your system, on how you built the libraries, and on the exact
functions that you use), but there
are still a few interesting conclusions we can draw from it:
Performance is much better using the two highly optimized libraries
(ATLAS and OpenBLAS) than using either the NMatrix
internal implementation or the reference implementation.
When using ATLAS, performance is similar whether using nmatrix-atlas
and nmatrix-lapacke (this means we could consider deprecating
the nmatix-atlas gem).
Overall, my summer has been productive. I implemented everything that I
proposed and feedback from testers so far has been positive.
I plan to stay involved with NMatrix, especially to follow up on any issues
related to my changes.
Although I won’t be a student next summer, I would certainly consider
participating in Google Summer of Code in the future as a mentor.
I’d like to
thank my mentor John Woods and the rest of the SciRuby community for support
and feedback throughout the summer.
This summer I’ve been participating in Google Summer of Code
with GnuplotRB project (plotting tool for Ruby users based on Gnuplot)
for SciRuby. GSoC is almost over and I’m releasing v0.3.1 of GnuplotRB as a gem.
In this blog post I want to introduce the gem and highlight some of its capabilities.
Features
There are several existing plotting tools for Ruby such as Nyaplot, Plotrb, Rubyvis
and Gnuplot gem. However they are not designed for large datasets and have fewer
plotting styles and options than Gnuplot. Gnuplot gem was developed long ago and nowadays consists
mostly of hacks and does not support modern Gnuplot features such as multiplot.
Therefore my goal was to develop new gem for Gnuplot which would allow full use of its
features in Ruby. I was inspired to build an easy-to-use interface for the most commonly
used features of Gnuplot and allow users to customize their plots with
Gnuplot options as easily as possible in Rubyesque way.
I am sure you have heard about Wolfram Alpha. Some of you would have used its Mathematics section at some point of time to cross check your solution to a maths problem. If you haven’t, please do. A computer algebra system (CAS) does the same thing. Algebraic computation or symbolic computation are all used interchangeably. A CAS solves problems the same way a human does, but way more quickly and precisely. There is very less chance for an error, that we humans often make.
Introduction
My project was to write the Ruby extensions for the library SymEngine and come up with a Ruby-ish interface, after which we can use the features of SymEngine from Ruby.
SymEngine is a library for symbolic computation in C++. You may ask, why SymEngine? There are other CASs that I know of. This question was indeed asked. At the beginning, the idea was to use ruby wrappers for sage (a mathematics software system) which uses Pynac, an interface to GiNaC (another CAS). As it turns out from the benchmarks, SymEngine is much faster than Pynac. What about directly wrapping GiNaC? SymEngine is also a bit faster than GiNaC.
The motivation for SymEngine itself is to develop it once in C++ and then use it from other languages rather than doing the same thing all over again for each language that it is required in. In particular, a long term goal is to make Sage as well as SymPy use it by default, thus unifying the Python CAS communities. The goal of implementing the Ruby wrappers is to provide a CAS for the Ruby community.
How will this be useful?
There are times when we might need a symbolic computation library. Here is an incomplete list of some of the situations:
We need to do some algebra to get systems of equations in suitable form for numerical computation.
We need to make substitutions for some variables and don’t want to risk a math error by hand. There might be times when we want to partially simplify an expression by substituting only a few of its variables.
We have a situation where we need to find the optimum number of something to maximise our profits, and the mathematical model devised is just too complicated to solve by hand.
We need to perform non-trivial derivatives or integrals.
We are trying to understand the domain of a new function.
Most root finding algorithms search for a single root, whereas often there are more than one.
We need to solve a linear system or manipulate a symbolic matrix exactly.
We need to manipulate exact expressions and solutions, as opposed to approximate ones using a numerical method.
With that said, a symbolic manipulation library is indispensable for scientists and students. Ruby has gained a great deal of popularity over the years, and a symbolic manipulation library gem like this project in Ruby might prove to be the foundation for a computer algebra system in Ruby. With many efforts like these, Ruby might become the first choice for academicians given how easy it is to code your logic in Ruby.
How to install the gem?
To install, please follow the compile instructions given in the README. After you are done, I would suggest to test the extensions. To run the test suite execute rspec spec on the command line, from the symengine/ruby dir.
The gem is still in alpha release. Please help us out by reporting any issues in the repo issue tracker.
What can I do with the gem?
Currently, the following features are available in the gem:
- Construct expressions out of variables (mathematical).
- Simplify the expressions.
- Carry out arithmetic operations like +, -, *, /, ** with the variables and expressions.
- Extract arguments or variables from an expression.
- Differentiate an expression with respect to another.
- Substitute variables with other expressions.
Features that will soon be ported to the SymEngine gem
- Functions, including trigonometric, hyperbolic and some special functions.
- Matrices, and their operations.
- Basic number-theoretic functions.
I have developed a few IRuby notebooks that demonstrate the use of the new SymEngine module in ruby.
Below is an example taken from the notebooks.
Using the SymEngine Gem
SymEngine is a module in the extensions, and the classes are a part of it. So first you fire up the interpreter or an IRuby notebook and load the file:
This shows that we have successfully loaded the module.
SymEngine::Symbol
Just like there are variables like x, y, and z in a mathematical expression or equation, we have SymEngine::Symbol in SymEngine to represent them. To use a variable, first we need to make a SymEngine::Symbol object with the string we are going to represent the variable with.:
Now that we have an expression, we would like to see it’s expanded form using #expand:
123
f=e.expand()f.to_s=>"x**(1 + y)/z - x**y*y/z"
Or check if two expressions are same:
12
f==-(x**y*y/z)+(x**y*x/z)=>true
But e and f are not equal since they are only mathematically equal, not structurally:
12
e==f=>false
Let us suppose you want to know what variables/symbols your expression has. You can do that with the #free_symbols method, which returns a set of the symbols that are in the expression.:
You can make objects of class SymEngine::Integer. It’s like regular Integer in ruby kernel, except it can do all the operations a Basic object can — such as arithmetic operations, etc.:
Like any other Basic object arithmetic operations can be done on this rational type too.:
12
(a-d).to_s=>"34/3"
You need not create an instance of SymEngine::Integer or SymEngine::Rational, every time you want to use them in an expression that uses many Integers. Let us say you already have Integer/Rational object. Even then you can use them without having to create a new SymEngine object.:
123
k=(1/(x*y)-x*y+2)*(c+x*y)# c is a Rational object, not SymEngine::Rationalk.to_s=>"(2/3 + x*y)*(2 + 1/(x*y) - x*y)"
As you can see, ruby kernel Integers and Rationals interoperate seamlessly with the SymEngine objects.
In the rest of the post, I would like to summarise my work and what I learned as a participant of Google Summer of Code 2015.
Pre-midterm Evaluations
I am a newbie when it comes to Ruby, and it took me a while to setup the gem and configure files for the building of extensions.
The struggle between shared, static and dynamic libraries
I faced a lot of problem in the early stages, when I was trying to build the extensions. Ondrej, my mentor, and Isuru, a fellow GSoC student, helped me a lot. There were many C flags that were being reported as missing. Some flags cmake added by default but extconf.rb didn’t, the same one that was required to be added to build it as a shared library. I am still confused about the details, some of which are explored in greater detail in my personal blog. Finally, the library had to be built as a dynamic one. The problem of missing C flags was resolved later by hooking the process to cmake rather than mkmf.
Load Errors and problems in linking
Many LoadErrors popped up, but were eventually solved. Ivan helped a lot in debugging the errors. In the end, it turned out to be a simple file missing in the gemspec, that was not being installed.
Reconfiguring building
One of our aims during developing this was to get rid of unessential dependencies. The ones we already had the tools for. Like later the file extconf.rb, that is used to generate Makefile for the extension was removed, because that could also be done by cmake. Flags were added to cmake for building the Ruby extensions, like the flag -DWITH_RUBY=yes. The Makefile then generates the library symengine.so in the directory lib/symengine.Along with extconf.rb, the file extconf.h was also gone. Along these lines, the dependency on rake was also removed, and with that the Rakefile. Any task automation will most probably be done in python. So, the Rake::ExtensionTask was done by cmake and the Rake::GemPackageTask was replaced by the manual method of gem build symengine.gemspec and gem install symengine-0.0.0.gem
Travis setup
Not many projects have travis-ci setup for multiple languages. Not even the tutorials had clearly mentioned about setting up for multiple languages. But I did know about one of them, which is Shogun, the machine-learning toolbox. I referred to their .travis.yml and setup it up. If something like this wouldn’t have worked the plan was to manually install the required version of ruby and then execute the shell commands.
Making a basic object
Finally, I was able to successfully build the extensions, link the extensions with the SymEngine library, load the ruby-extension library in the interpreter and successfully instantiate an object of type Basic.
Inheritance and Symbols
At this time, the way inheritance works(like the sequence of formation and destruction of objects of a class that had a superclass) with the Ruby C API, was confusing for all of us. I designed an experiment
to check what was actually happening. That cleared things out, and made the it easier to wrap things from now on. I also wrapped the Symbol class during the course.
Post-midterm Evaluations
Redesign of the C interface
We had to design an ugly function to wrap vector in C. That led us to redesign the C interface. This approach had no reinterpret casting that was being done earlier. Each data structure had a type that was determined at compile time. For C, it was an opaque structure, while for C++ the opaque structure declared in the shared header file was implemented in the source file that had C++ data types. This blog post explains it further.
Integer and Rational
While trying to port the SymEngine classes, Integer and Rational, I had to port many methods in Basic before that. I also replicated the rake tasks in NMatrix, for detection of memory leaks, in form of bash scripts.
Common enumeration
Since all objects in the Ruby C API are of the type CBasic, we needed a function that would give us the typename during the runtime for the corresponding objects to be wrapped in ruby, as an object of the correct Class in ruby. Since, this was achieved with enum in C++, the same thing could be done in C too, with all the classes written manually again. But there was no guarantee for this to be consistent, if ever the features required to be wrapped for a new language, and also manually adding the class in all the enum list everytime a new class is added was prone to errors. So, to make this DRY, we automated this by sharing the list of enums. More details for the implementation can be found here.
Class coercion and interoperability
To support interoperability with the builtin ruby types, I had to overload the methods in builtin classes earlier(this was not continued). Overriding all the existing binary operations for a ruby class to support SymEngine types, violated the open/closed principle. There was indeed another way, which is ‘Class Coercion’. It was suggested by Isuru. After that, SymEngine types could seamlessly interoperate between the ruby types.
Arithmetic operations
After this, all the arithmetic operations had been successfully ported. Each Basic object can now perform arithmetic operations with other Basic object(sometimes even ruby objects like Integer). The test file in python, that had all the corresponding test cases was ported to its RSpec counterpart.
Substitutions
Recently I completed porting the substitutions module to the extensions(#subs). This feature has added a lot of convenience as now you can substitute a SymEngine::Symbol with some other value in an expression and then #expand to get the result.
Trigonometric functions
Currently, I am working on porting the trigonometric functions in SymEngine to the extensions. This would first require to wrap the Function class and then the TrigFunction class in SymEngine.
Integration of other Ruby gems
I also have plans to integrate the ruby bindings for gmp, mpfr and mpc libraries, that are already available as gems, with ruby bindings for our library. I have created an issue here. Feel free to drop any suggestions.
There is much scope for improvement in both the projects. For SymEngine, to support more features like polynomials and series-expansion in the near future, and improving the user interface and the exception handling for the extensions. In short, making the extensions more ruby-ish.
I am grateful to my mentor, Mr. Ondřej Čertík, the Ruby Science Foundation and the SymPy Organisation for the opportunity that they gave me and guiding me through the project, and my team-mates for helping me with the issues. I hope more people will contribute to the project and together we will give a nice symbolic manipulation gem to the Ruby community.