KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!

Data Until I Die!

The challenge from the KDD Cup this year was to use their data relating to student enrollment in online MOOCs to predict who would drop out vs who would stay.

The short story is that using H2O and a lot of my free time, I trained several hundred GBM models looking for the final one which eventually got me an AUC score of 0.88127 on the KDD Cup leaderboard and at the time of this writing landed me in 120th place. My score is 2.6% away from 1st place, but there are 119 people above me!

Here are the main characters of this story:

mariadb
MySQL Workbench
R
H2O

It started with my obsessive drive to find an analytics project to work on. I happened upon the KDD Cup 2015 competition and decided to give it a go. It had the characteristics of a project that I wanted to get…

View original post 1,843 more words

Reviews; Machine learning for music discovery; icml 2016 workshop, new york [1/2: invited talks]

Keunwoo Choi

I attended this amazing workshop this year again, Machine learning for music discovery at International conference on machine learning (ICML) 2016. ICML is one of the biggest conferences in machine learning (ICML; THE summer ML conference, NIPS; THE winter ML conference)(Or the opposite if it happens at somewhere in southern hemisphere). The whole conference was massive! The committee expected ~3,000 attendees. ML4MD workshop was also rather packed, though the room was not large like deep learning workshop.

There was one keynote (1hr), 5 invited talks, 8 accepted talks, and happy hours.

Project Magenta: Can Music Generation be Solved with Music Recommendation?

By Douglas Eck, Google Brain

Douglas Eck gave this presentation about rather hot issue – Project Magenta by Google Brain. If you haven’t heard of it — please check out the website. The current example is not that like state-of-the-art-as-Google-does-all-the-time, but it is a project that just started…

View original post 1,359 more words

Singular Value Decomposition Part 2: Theorem, Proof, Algorithm

Math ∩ Programming

I’m just going to jump right into the definitions and rigor, so if you haven’t read the previous post motivating the singular value decomposition, go back and do that first. This post will be theorem, proof, algorithm, data. The data set we test on is a thousand-story CNN news data set. All of the data, code, and examples used in this post is in a github repository, as usual.

We start with the best-approximating $latex k$-dimensional linear subspace.

Definition: Let $latex X = { x_1, dots, x_m }$ be a set of $latex m$ points in $latex mathbb{R}^n$. The best approximating $latex k$-dimensional linear subspace of $latex X$ is the $latex k$-dimensional linear subspace $latex V subset mathbb{R}^n$ which minimizes the sum of the squared distances from the points in $latex X$ to $latex V$.

Let me clarify what I mean by minimizing the sum of squared distances. First we’ll start with the…

View original post 5,066 more words

Dynamic Time Warping averaging of time series allows faster and more accurate classification

the morning paper

Dynamic Time Warping averaging of time series allows faster and more accurate classification – Petitjean et al. ICDM 2014

For most time series classification problems, using the Nearest Neighbour algorithm (find the nearest neighbour within the training set to the query) is the technique of choice. Moreover, when determining the distance to neighbours, we want to use Dynamic Time Warping (DTW) as the distance measure.

Despite the optimisations we looked at earlier this week to improve the efficiency of DTW, the authors argue there remain situations where DTW (or even Euclidean distance) has severe tractability issues, particularly on resource constrained devices such as wearable computers and embedded medical devices. There is a great example of this in the evaluation section, where recent work has shown that it is possible to classify flying insects with high accuracy by converting the audio of their flight (buzzing bees, and so on).

The…

View original post 1,398 more words

Singular Value Decomposition Part 1: Perspectives on Linear Algebra

Math ∩ Programming

The singular value decomposition (SVD) of a matrix is a fundamental tool in computer science, data analysis, and statistics. It’s used for all kinds of applications from regression to prediction, to finding approximate solutions to optimization problems. In this series of two posts we’ll motivate, define, compute, and use the singular value decomposition to analyze some data.

I want to spend the first post entirely on motivation and background. As part of this, I think we need a little reminder about how linear algebra equivocates linear subspaces and matrices. I say “I think” because what I’m going to say seems rarely spelled out in detail. Indeed, I was confused myself when I first started to read about linear algebra applied to algorithms, machine learning, and data science, despite having a solid understanding of linear algebra from a mathematical perspective. The concern is the connection between matrices as transformations and matrices as a “convenient” way to organize data.

Data…

View original post 2,673 more words

Guest Post: ROB TIBSHIRANI

Normal Deviate

GUEST POST: ROB TIBSHIRANI

Today we have a guest post by my good friend Rob Tibshirani. Rob has a list of nine great statistics papers. (He is too modest to include his own papers.) Have a look and let us know what papers you would add to the list. And what machine learning papers would you add? Enjoy.

9 Great Statistics papers published after 1970
Rob Tibshirani

I was thinking about influential and awe-inspiring papers in Statistics and thought it would be fun to make a list. This list will show my bias in favor of practical work, and by its omissions, my ignorance of many important subfields of Statistics. I hope that others will express their own opinions.

  1. Regression models and life tables (with discussion) (Cox 1972). A beautiful and elegant solution to an extremely important practical problem. Has had an enormous impact in medical science. David Cox…

View original post 501 more words

Misleading modelling: overfitting, cross-validation, and the bias-variance trade-off

Cambridge Coding Academy

Introduction

In this post you will get to grips with what is perhaps the most essential concept in machine learning: the bias-variance trade-off. The main idea here is that you want to create models that are as good at prediction as possible but that are still applicable to new data (i.e. they are generalizable). The danger is that you can easily create models that overfit to the local noise in your specific dataset, which isn’t too helpful and leads to poor generalizability since the noise is random and therefore different in each dataset. Essentially, you want to create models that capture only the useful components of a dataset. On the other hand, models that generalize very well but are too inflexible to generate good predictions are the other extreme you want to avoid (this is called underfitting).

We discuss and demonstrate these concepts using the k-nearest neighbors algorithm…

View original post 2,590 more words

Sunday Bayes: A brief history of Bayesian stats

The Etz-Files

The following discussion is essentially nontechnical; the aim is only to convey a little introductory “feel” for our outlook, purpose, and terminology, and to alert newcomers to common pitfalls of understanding.

Sometimes, in our perplexity, it has seemed to us that there are two basically different kinds of mentality in statistics; those who see the point of Bayesian inference at once, and need no explanation; and those who never see it, however much explanation is given.

–Jaynes, 1986 (pdf link)

Sunday Bayes

The format of this series is short and simple: Every week I will give a quick summary of a paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere.

Bayesian Methods: General Background

The necessity of…

View original post 1,230 more words

Learning Python For Data Science

Python Tips

For those of you who wish to begin learning Python for Data Science, here is a list of various resources that will get you up and running. Included are things like online tutorials and short interactive course, MOOCs, newsletters, books, useful tools and more. We decided to put this together so that you can begin learning Data Science with Python right of the bat, without having to spend hours surfing the web in search of resources. Please note that while we believe the list is comprehensive, it is by no means exhaustive. We probably have missed out on a couple of nice resources so feel free to mention them in the comments if you are so inclined. 🙂

View original post 1,466 more words

Nonlocality and statistical inference

Low Dimensional Topology

It doesn’t have much to do with topology, but I’d like to share with you something Avishy Carmi and I have been thinking about quite a bit lately, that is the EPR paradox and the meaning of (non)locality. Avishy and I have a preprint about this:

A.Y. Carmi and D.M., Statistics Limits Nonlocality, arXiv:1507.07514.

It offers a statistical explanation for a Physics inequality called Tsirelson’s bound (perhaps to be compared to a known explanation called Information Causality). Behind the fold I will sketch how it works.

View original post 2,946 more words

Inferring Causal Impact Using Bayesian Structural Time-Series Models

the morning paper

Inferring Causal Impact Using Bayesian Structural Time-Series Models – Brodersen et al. (Google) 2015

Today’s paper comes from ‘The Annals of Applied Statistics’ – not one of my usual sources (!), and a good indication that I’m likely to be well out of my depth again for parts of it. Nevertheless, it addresses a really interesting and relevant question for companies of all shapes and sizes: how do I know whether a given marketing activity ‘worked’ or not? Or more precisely, how do I accurately measure the impact that a marketing activity had, so that I can figure out whether or not it had a good ROI and hence guide future actions. This also includes things like assessing the impact of the rollout of a new feature, so you can treat the word marketing fairly broadly in this context.

…we focus on measuring the impact of a discrete marketing event…

View original post 1,308 more words

Spectral Clustering: A quick overview

A lot of my ideas about Machine Learning come from Quantum Mechanical Perturbation Theory.  To provide some context, we need to step back and understand that the familiar techniques of Machine Learning, like Spectral Clustering, are, in fact, nearly identical to Quantum Mechanical Spectroscopy.   As usual, this will take several blogs.

Here, I give a brief tutorial on the theory of Spectral Clustering and how it is implemented in open source packaages

At some point I will rewrite some of this and add a review of this recent paper  Robust and Scalable Graph-Based Semisupervised Learning

Spectral (or Subspace) Clustering

The goal of spectral clustering is to cluster data that is connected but not lnecessarily compact or clustered within convex boundaries

The basic idea:

  1. project your data into $latex R^{n} $
  2. define an Affinity  matrix $latex A $ , using a Gaussian Kernel $latex K $ or say just an Adjacency matrix…

View original post 1,778 more words

Graph Encryption: Going Beyond Encrypted Keyword Search

Outsourced Bits


graph2This is a guest post by Xianrui Meng from Boston University about a paper he presented at CCS 2015, written in collaboration with Kobbi Nissim, George Kollios and myself. Note that Xianrui is on the job market.

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good…

View original post 2,568 more words

TensorFlow Meets Microsoft’s CNTK

The eScience Cloud

CNTK is Microsoft’s Computational Network Toolkit for building deep neural networks and it is now available as open source on Github.   Because I recently wrote about TensorFlow I thought it would be interesting to study the similarities and differences between these two systems.   After all, CNTK seems to be the reigning champ of many of the image recognition challenges.   To be complete I should also look at Theano, Torch and Caffe.   These three are also extremely impressive frameworks.   While this study will focus on CNTK and TensorFlow I will try to return to the others in the future.   Kenneth Tran has a very nice top level (but admittedly subjective) analysis of all five deep learning tool kits here.  This will not be a tutorial about CNTK or Tensorflow.  Rather my goal is to give a high level feel for how they compare from the programmer’s perspective. …

View original post 3,926 more words