Data Science


Many folks are a little too stuck on the oldest principles of statistical science, which were developed 100 years ago by mathematicians testing theories on small samples. Very large data sets, used all at once in a model, often produce artificial effects on traditional statistical measures - coefficient size and significance, correlations, explained variance, likelihood measures, etc.

Bootstrapping, ensembling, variance shrinkage and feature selection methods often help one get around such problems with big data sets. Also, assessment of models should be based primarily on their ability to predict a representative test sample after being fit (“trained”) with the rest of the data set. This recommendation comes primarily from some folks at Stanford, themselves premier mathematical statisticians, who led the way in creating the discipline of data science/machine learning.

Also, don’t ever tell your boss that you were only able to get an R-squared of only 0.05 when your model classifies the top-two deciles of your model’s predicted values 200% better than random chance (or global) targeting. Assuming management can target those higher-value customer types, you’ve already tripled their sales; and, given enough data, within a very small margin of error.

Finally, look onto the horizon for networked computing to achieve more miracles in applications of artificial intelligence - Robotic Process Automation, Computational Cognition and more.