Sunday, April 20, 2014

How Fast the Fastest Human Would Run 100m?

People have used extreme value theory to predict the records in various sports. Here is an articles which provides codes to visualize the same. One can update the dataset to take into account latest records. It's interesting to see how this updation affects the estimates:

Where nobody lives

Despite having a population of more than 310 million people, 47 percent of the USA remains unoccupied. Here is a map showing places where nobody lives:

Vectorization in R: Why?

Beginning R users are often told to “vectorize” their code. Here, is an attempt to explain why vectorization can be advantageous in R by showing how R works under the hood:

Checking (G)LM assumptions in R

(Generalized) Linear models make some strong assumptions concerning the data structure. Here is how to verify those assumptions in R:

Wednesday, April 16, 2014

Mapping a century of earthquakes

Did you know that United States Geological Survey maintains an ever growing archive of earthquakes detected around the world, and they make it easy to query and download?
Here is how you can map that data using R:

Benefits of using Open Source Software

Why public universities should use open source software? Read the reasons at:

Monday, April 14, 2014

The Median Isn't the Message

The Median Isn't the Message is the wisest, most humane thing ever written about cancer and statistics. It is the antidote both to those who say that, "the statistics don't matter," and to those who have the unfortunate habit of pronouncing death sentences on patients who face a difficult prognosis. Anyone who researches the medical literature will confront the statistics for their disease. Anyone who reads this will be armed with reason and with hope:

Thursday, April 10, 2014

Beeps and progress alerts to your phone

Would you like your R program to alert you with a beep or ping, as soon as the execution is over? then here is the way out:

R Help about Symbols

How to open R help about a symbol or punctuation mark like ( parenthesis or [ bracket:

Saturday, April 5, 2014

God is a Gambler

"All the evidence shows that God was actually quite a gambler, and the universe is a great casino, where dice are thrown, and roulette wheels spin on every occasion."

- Stephen Hawking

Friday, April 4, 2014

Some R Resources for GLMs

It is relatively easy to figure how to code a GLM in R. Even a total newcomer to R is likely to figure out that the glm() function is part of the core R language within a minute or so of searching. Thereafter though, it gets more difficult to find other GLM related stuff that R has to offer. Here is a far from complete, but hopefully helpful, list of resources.

Thursday, April 3, 2014

Does R have too many packages?

Most of us agree with the fact that availability of thousands of packages on CRAN is often life saving. But, there are few, who feel that there are rather too many packages on R. Read this post to know what makes them think the other way?

Seven quick facts about R

Here are some key facts about growth of R:

Tuesday, April 1, 2014

Probability of Extreme Events like 9/11

An interesting article on estimating probability of large terrorist events like 9/11

Global Flow of People

A very cool representation of global migration flows between all countries using R:

Don't miss this link and check out the migration pattern of your country:

Friday, March 28, 2014

Why use R?

Why should R be preferred over other statistical software? Read in the words of an extensive user of both a proprietary statistical programming language as well as the open source alternative.

Add new colors to your R-charts

If you are a big fan of Wes Anderson's movies and if you love the quirky characters and stories, the distinctive cinematography, and the unique visual style, then you can bring some of that style to your own R charts, by making use of these Wes Anderson inspired palettes.

Wednesday, March 26, 2014

Statistics reveal a prescription drug epidemic

After the tragically early death of actor Philip Seymour Hoffman last month, Carlos Grajales finds that the statistics reveal a prescription drug epidemic in the US. Can you believe that in 2010 drug overdose caused more deaths than motor vehicle traffic crashes.

Tuesday, March 25, 2014

Overlapping Clusters

Aren't all of us used to seeing the well-separated clusters displayed in textbooks and papers.. But that doesn't happen in reality. So, what should one do in such cases? Read about how to deal with such situation at:

Handling Character data in R

In today's data-centric world, a statistician can't escape from text data. It's not a very difficult task, if we start in time. So, let's learn about handling character data in R with this free e-book:

Saturday, March 22, 2014

About Normality and Testing for Normality

It is often said that with small sample sizes, everything looks normal, as the normality tests are, indeed, very sensitive to what goes on in the extreme tails. In other words, if we have enough data to fail a normality test, we always will because our real-world data won’t be clean enough. If we don’t have enough data to reliably fail a normality test, then there’s no point in performing the test, and we have to rely on the fat pencil test or our own understanding of the underlying processes. Read the detailed reasoning at:

Why one shouldn't use Bivariate Correlations for Variable Selection?

In applied statistics, what typically happens is a researcher sits down with their statistical software of choice and they compute a correlation between their response variable and their collection of possible predictors. From here, they toss out potential predictors that either have low correlation or for which the correlation is not significant. The concern here is that it is possible for the correlation between the marginal distributions of the response and a predictor to be almost zero or non-significant and for that predictor to be an important element in the data generating pathway. Read more about why we shouldn't be using bivariate correlations for variable selection..

Friday, March 21, 2014

Teaching for Modern Generation

Dr. Rajeeva Karandikar speaking about how teaching should be transformed for modern generation which is an instant generation, the Facebook/Whats app/Twitter generation, the generation for which sending email is too slow. All those who are teachers or who aspire to become one, should not miss it ...

Thursday, March 20, 2014

80/20 Rule of Statistical Methods Development

Developing statistical methods is hard and often frustrating work. One of the under appreciated rules in statistical methods development is the 80/20 rule. The basic idea is that the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%. The hard decision is whether to create a new method is whether the 20% is worth it. This is obviously application specific. Here is an interesting discussion about 80/20 rule of statistical methods development.

The Improbability Principle

The video and slides from David Hand's lecture on the subject of his new book 'The Improbability Principle'.

It is about extraordinarily improbable events. It’s about events which are so unlikely that we wouldn’t expect to see them during our entire lifetimes - or even the lifetime of the human race or the universe itself. And it’s about why, despite all that, we do see such events; and more, it’s about why we them again and again.

Secrets of Teaching R: An Interview with Bob Muenchen

It is of interest to see what makes R so popular, yet ‘quirky’ to learn. To get some insight from a real pro here is an interview with Bob Muenchen. Bob is the author of 'R for SAS and SPSS Users'. He is also the creator of, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language.

Google Drive in R

Want to retrieve all direct links to your Google Documents? R can help you out. Check out the details at:

Bayesian First Aid

Bayesian First Aid is an attempt at implementing reasonable Bayesian alternatives to the classical hypothesis tests in R. Here are a few of them:
Here are a few more introductory articles:

Thursday, March 13, 2014

Magical Wolfram Language

Examples of what can be done with the knowledge-based Wolfram Language..
Right from Blurring Faces in an Image to Hiding Secret Messages in Images, Make a You-Centric world map.. Do check out the complete list!!

Mathematical Character Curves

Check out to see how various shapes are represented through mathematical equations and inequalities..
We're glad to see that people have been enjoying our mathematical character curves!

Check out how you can play with your favorite cartoon characters using Wolfram Mathematica

Wednesday, March 12, 2014

A Hack to Create Matrices in R, Matlab style!!

The Matlab syntax for creating matrices is pretty and convenient. Its R-counterpart is functional but not as pretty, plus the default is to specify the values column wise. Using meta-programming we can hack together a function that allows us to create matrices in a similar way as in Matlab. Read more at:

Thursday, March 6, 2014

The Magical Mind of Persi Diaconis

When Diaconis first came to Stanford, he planned to keep his magic background a secret from his academic colleagues.. fearing they wouldn't take seriously a man of hocus-pocus who did research on card shuffling.
Then he stumbled upon a book that described an experiment by the French mathematician Paul Lévy, analyzing the phenomenon known as perfect shuffling - in which a standard deck of cards is carefully shuffled eight times and ends up returning precisely to its starting arrangement. Diaconis says. "I thought, If Paul Lévy can study perfect shuffling, I can say I study perfect shuffling. So I wrote up my work on perfect shuffling, and it got on the front page of The New York Times."

Forecasting weekly data

What would you do if the seasonal period is rather long and non-integer? For example, if you have a weekly data, ARIMA models do not tend to give good results. The simplest approach in such situation is a regression with ARIMA errors. Here is an example using weekly data on US finished motor gasoline products supplied (in thousands of barrels per day) from February 1991 to May 2005.

Wednesday, March 5, 2014

Beauty is the First Test

"Beauty is the first test; there is no permanent place in the world for ugly mathematics."
- G. H. Hardy

Why Mathematics Is Beautiful and Why It Matters, here is an Huffington Post article.

No need for SPSS – Now beautiful output in R as well

Many social scientists don't want to move R as it doesn't give a simple table view, just like the SPSS output window. The articles below discuss ways to put the results of certain statistics in HTML tables in R. These tables can be saved to disk or, even better for quick inspection, shown in a web browser or viewer pane... and then R output will be atleast as beautiful as the SPSS output.

Tuesday, March 4, 2014

Photoshop via Clustering

"Do not believe anything: what artists really do is to hang around all day."
-Paco de Lucia
It seems clustering is the new way to Photoshop.. one gets different variations with different no. of clusters..
PS: Don't miss the video link in the end.

Oldies but Goldies: Some Classical Books on Statistical Graphics

The article below highlights some interesting things about three classical books on statistical graphics. The books are old but still relevant and together they give a sense of the development of exploratory graphics in general and the graphics system in R specifically as all three books were written at Bell Labs where the S-language was developed.

Monday, March 3, 2014

Movies and Statistics

It’s Oscars season again, so why shouldn't statisticians enjoy this movie fever...

Here is some number crunching with IMDb data, using R..

Some tools on predicting Academy Awards..

Thursday, February 27, 2014

Why You Should Always Get The Bigger Pizza

When one looks at thousands of pizza prices, one can see that you almost always get a much, much better deal when you buy a bigger pizza.The math of why bigger pizzas are such a good deal is simple: A pizza is a circle, and the area of a circle increases with the square of the radius.

Job Trends in the Analytics Market

This article presents various ways of measuring the popularity or market share of software for analytics including R, SAS, SPSS etc. It's interesting to note that analytics jobs for SPSS have not changed much over the years, while those for R have been steadily increasing. The jobs for R finally crossed over and exceeded those for SPSS toward the middle of 2012.

Monday, February 24, 2014

Modelers' Hippocratic Oath

For those who make a living by building some kind of Analytical models... Here comes a modelers' Hippocratic Oath, from the book "The Quants"

1. I will remember that I didn't make the world, and it doesn't satisfy my equations

2. Though I will use models boldly to estimate value, I will not be overly impressed by Mathematics

3. I will never sacrifice reality for elegance without explaining why I have done so.

4. Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.

5. I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension

Thursday, February 20, 2014

A Delicious Analysis !!

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. The article below discusses, use of Topic Modelling to find relevance of various ingredients, using data on recipes..

Coloured Noise

Have you ever wondered that there could be other colors to our all time favorite White Noise... like Red, Pink or Green. Read more about these coloured noises at:

Wednesday, February 19, 2014

Princeton vs Facebook

Whoa!!! it seems two biggies here had a tussle. When Princeton claimed rapid decline in Facebook, Facebook retorted by debunking Princeton. Enjoy reading and don't forget to take away the message that how cautious one should be while doing data analysis..

Here are some third party views on this debate:

Monday, February 17, 2014

Communication Skills in Academics and Research

"The first rule of writing is not to omit the thing you meant to say.” - Ralph Waldo
Here is Terry Speed, discussing the importance of communication skills in academics and research..

Thursday, February 13, 2014

Wednesday, February 12, 2014

Handling Dates and Times in R

Here is a small tutorial on the various ways to handle dates and times in R:

Does commuting affect our well-being?

Does commuting affect our well-being? Definitely according to the Office for National Statistics!
Data analysis from the Annual Population Survey revealed that commuting has a negative impact on personal well-being with the worst effects on happiness and anxiety. Read more at:

Monday, February 10, 2014

Banknotes featuring Scientists and Mathematicians

One can find a collection of currency featuring scientists and mathematicians from all over the world at the link below:

Saturday, February 8, 2014

Getting to The Heart of it With Monte Carlo

You only need two functions to draw a heart mathematically. Once you draw a heart, by using these two function, one can easily compute the area by using Monte-Carlo techniques. Details can be found at:

Friday, February 7, 2014

Linear Modeling and Logistic Regression with R

If you're new to the R language but keen to get started with linear modeling or logistic regression in the language, take a look at the link below. It works through a series of examples to teach by demonstration. All of the datasets used in the guide are available online, so it's easy to follow along from home.

Thursday, February 6, 2014

Wolfram Personal Analytics for Facebook

Recently you might have been enjoying personal #FacebookIs10 videos.. Now it is time to take a look at the stats behind your Facebook profile with Wolfram Personal Analytics..
Right from Statistics of your posts, their weekly distribution, post lengths to word cloud, top commenters and sharers..

Interview with Kanti Mardia

Statistics provides a challenge somewhat akin to Sherlock Holmes’ task: how to find hidden truth in any data, from small to big.
- Kanti Mardia

Read the complete interview with Samuel S. Wilks Award winner Kanti Mardia, at:

Saturday, February 1, 2014

Amount of snow to cancel school

Following link gives an interesting visualization showing an estimated amount of snow required to close school for the day, by county. It's not simply directly proportional to the amount of snowfall as school cancellation is the result of more snow than an area is used to handling

Thursday, January 30, 2014

Comparing multiple (g)lm in one graph

It is already possible to compare multiple models as table output, here the author has built a function that plots several (g)lm-objects in a single ggplot-graph:

Recurrent events analysis, not so straightforward!

Heart failure hospitalizations are associated with an increased risk of cardiovascular death, so if an individual dies during follow-up, this isn't necessarily independent of the event process of interest. Dependent censoring needs to be accounted for in any analyses that are carried out and this renders standard methods as unsuitable. Here is some discussion about the alternative approaches:

History through the president’s words

Studying president's choice of words, over time, provides glimpses of change in American politics. Check out different tabs.. Eg. Foreign policy gives a very clear picture of how relations evolved with various countries..

Free books on statistical learning

Here are some books related to statistical learning, freely available online:

Story Competition

People first started talking about the Normal Distribution nearly 300 years ago. The scientific community used their understanding of the Normal Curve to model and give meaning to the results of their experiments.Today, we owe much of our modern technology and modern world to the discoveries made possible by the Normal Curve. So, what would the world be like if the Normal Curve had never been discovered? Submit your story for a chance at $3500 in cash prizes!

Wednesday, January 29, 2014

Charts That Don’t Start at Zero

A statistician throws light on how an improper usage of statistical tools can lead to misleading conclusions:

Tuesday, January 28, 2014

Interview with Inventor of S and R

John Chambers (creator of S programming language & core member of R programming language project) recounts the history of S and R in the following interview:

John Chambers talks about his involvement in the birth of the S language in 1976, and how it evolved over the years to become the inspiration for the R language.

Monday, January 27, 2014

Public transit times in major cities

Here is an interesting visualization..

You can select the time of the day and day of the week,  and get a realistic estimate of how long it takes to get from point A to point B. There is also an interesting comparison option, which lets you choose two locations to see which area will get you somewhere else faster. 

Musings on Random Walk

"A drunk man will find his way home, but a drunk bird may get lost forever."
- Shizuo Kakutani
Want to know why? Read at:

R Tricks for Kids

Here is an article from 'Teaching Statistics', which describes real-world phenomena simulation models, which can be used to engage middle-school students with probability. Links to R instructional material and easy-to-use code are provided to facilitate implementation in the classroom.

Friday, January 24, 2014

An interview with Sir David Cox

Sir David Cox is arguably one of the world’s leading living statisticians. He has made pioneering and important contributions to numerous areas of statistics and applied probability over the years, of which perhaps the best known is the proportional hazards model, which is widely used in the analysis of survival data. In this interview, he says, "I would like to think of myself as a scientist, who happens largely to specialise in the use of statistics”

Read the complete interview at:

Wednesday, January 22, 2014

Does 1+2+3… really equal -1/12?

A recent Numberphile video claims that the sum of all the positive integers is -1/12. Bothered by that, Evelyn Lamb talks about what it means to assign a value to an infinite series and explains different ways of doing this.

A century of passenger air travel

Kiln and the Guardian explored the 100-year history of passenger air travel, and to kick off the interactive is an interactive map that uses live flight data from FlightStats. The map shows all current flights in the air right now. Be sure to click through all the tabs. They're worth the watch and listen, with a combination of narration, interactive charts, and old photos.

Tuesday, January 21, 2014

Solving water resource problems using Statistics

In an exclusive interview, Dr. Upmanu Lall, Director of Columbia Water Center discusses how he uses Statistics and an understanding of climate, agriculture, commerce, engineering, technology, and politics to solve some of the world’s most pressing water problems:

Sunday, January 19, 2014

Not Missing at Random

Not Missing at Random (NMAR) is data that is missing for a specific reason..
Here is an interesting example of NMAR data.. with the message that one shouldn't be sad and low, after reading on Facebook, about abnormally flattering lives of their friends' ..

The Music Timeline

The Music Timeline shows genres of music waxing and waning using stacked area chart. Each stripe on the graph represents a genre; the thickness of the stripe tells you roughly the popularity of music released in a given year in that genre.

An Interview with Larry Wasserman

Professor Larry Wasserman is currently Professor in the Department of Statistics and Machine Learning at Carnegie Mellon University. His research interests include nonparametric inference, machine learning, statistical topology and astrostatistics. Here is a link to his interview where he talks about statistics and his career in statistics.

R is the most-used tool

O'Reilly has just published the results of the Data Scientist Salary Survey, based on data collected from attendees of the O'Reilly Strata conferences in 2012 and 2013. Each respondent listed multiple tools that they used both in data roles and non-data roles. R topped the list of Statistical Software beating SAS, SPSS, Excel etc.

Thursday, January 16, 2014

Competitions to celebrate 175th Anniversary of ASA

American Statistical Association is celebrating 175th anniversary. You may celebrate with them by doing any of the following:
  • Entering ASA's Got Talent, the ASA's unique talent competition
  • Looking for clues in Amstat News and playing ASA's Trivia Challenge
  • Sending in your design for the ASA's official 175th anniversary T-shirt

Submit your entries before 30th April 2014. More details can be found at: 

Wednesday, January 15, 2014

Timeline of Statistics

Check out this precise yet detailed "timeline of statistics" published by Significance magazine to celebrate its 10th anniversary..

Regression with Gradient Descent

Here is an overview of the gradient descent algorithm, which offers some intuition on why the algorithm works and where it comes from, and provides examples of implementing it for ordinary least squares and logistic regression in R:

Lexical distance between European languages

So why is English still considered a Germanic language and not the  Latinate one? How do you measure the proximity in linguistic families? Read more at:

Tuesday, January 14, 2014

n vs n-1

People keep on wondering “Why is the denominator in the sample mean n, but the denominator for the sample variance is n−1?” All of us have had to answer this question at some time in our careers, either for our students or for ourselves. How do you answer it, and how helpful is your answer? Do you feel obliged to introduce distinctions such as populations vs samples, description vs inference, parameters vs statistics, Greek vs Roman letters? Or more advanced concepts, such as degrees of freedom, dimensions of subspaces, unbiasedness or maximum likelihood? Read more at:

How Much Time to Conceive?

One of the most important questions that people ask when they make the decision to have a child is: how long is it going to take us to get pregnant? The probabilities mentioned by doctors provide an answer to this question. But these probabilities are estimates at best (albeit, no doubt, educated estimates!) and are associated with some not insignificant uncertainties. Here is an approach to judge how important is the value of the monthly probability in determining the time to conception, using basic probability distributions and R visualisations:

Monday, January 13, 2014

Hidden History

A modern statistician needs to appreciate the historical roots of the profession, argues Terry Speed:

So look to your statistical roots!

Friday, January 10, 2014

Are you saving too much?

The only hard-and-fast rule for how much retirement income you will need is that there is no hard-and-fast rule. New research shows that many retirees can live well on less than the amount suggested by financial industry but others rack up higher expenses through travel, expensive hobbies or medical costs that can't be avoided. Read more at:

Thursday, January 9, 2014

From spreadsheet thinking to R thinking

One may have inertia in switching from spreadsheets to R. Here is a post to help overcome the same:

Statistics and The War

We all agree that wars are terrible and to be avoided to the greatest extent possible, yet it is hard not to concede that wars can bring scientific, technological, industrial, cultural, political, even economic benefits. This is one of the many paradoxes of war. Statistics is no exception. Not only was there extremely rapid development of some areas of statistics, especially industrial statistics, but also a large proportion of the leaders in our subject in the 40 years following the World War 2 met it for the first time during the War. Most of them, would not have become statisticians but for the War.

Wednesday, January 8, 2014

Paul Erdös, The Maverick Genius

One of the finest minds in the history of mathematics, Erdös chose as his epitaph the self-deprecating Hungarian phrase “Finally I am becoming stupider no more.” Read more about him at:

Friday, January 3, 2014

Bodily maps of emotions

Emotions are often felt in the body, and somatosensory feedback has been proposed to trigger conscious emotional experiences. As one would expect, the body looks like it shuts down with depression, and it lights up with happiness, but it's the subtle differences that are most interesting. Read how statistical classifiers were used to distinguished emotion-specific activation maps accurately:

Thursday, January 2, 2014

Generalized linear models for predicting rates

We often need to build a predictive model that estimates rates. A simple example is estimating default rates of mortgages or credit cards. One could try linear regression, but specialized tools often do much better. Here is a discussion of  how to do such things in R:

Experience the thrill of touching real data

The story of one man's efforts to re-analyse the stats behind a BBC report on bowel cancer is a heartwarmingly nerdy one:

Wednesday, January 1, 2014

Parallelisation may not be always better than sequential processing

Parallelisation incurs some overhead: information needs to be distributed over the nodes, and the result from each node needs to be collected and aggregated into the resulting object. This overhead is one of the main reasons why in certain cases parallel processing takes longer than sequential processing. Read more at:

Animation of the Construction of a Confidence Interval

The confidence interval is one of the more tricky statistical concepts. A way of explaining confidence intervals is as the region of possible null hypotheses resulting in corresponding significance tests that are not rejected. Turns out it is not easy to make a corresponding nice explanatory animation either, but that’s what has been tried here: