Back in 2001 the term ‘Data Science’ was first used in a publication by William Cleveland; fast forward to 2012 the Harvard Business Review hails data science as being ‘the sexiest job of the 21st Century’; fast forward to today and every business wants to employ data scientists. What is this thing called data science, where has it come from, and why is it so popular? Read on and I will reveal all.
A one-line definition of what a data scientist does is a difficult thing to write. The job has grown to encompass analysis of trends, plotting, informatics, machine learning, artificial intelligence, numerical analysis, words analysis, business analytics, clicks on a web site analysis etc. It can mean different things in different roles. Instead of a one-line definition think more in terms of a short discussion:
• Data Science extracts knowledge and insights from data, often information that would not be available from just one set of data.
• The roots of data science are in scientific methods and algorithms and these are rigorously used. The output has to be verified as being plausible.
• Most data scientists spend the majority of their time cleaning data before it can be used. This is to remove the incorrect and partial data.
• It is multi-disciplinary and often collaborative, being part physics, statistics, maths, computer science, coding, big data, business. To be a good data scientist you have to understand all aspects.
The roots of data science date back to the areas of physics and astronomy where large data sets were first collected and scientifically analysed. There are many good examples of data science in the world of physics, but I believe the first example is with Tycho Brahe and Johannas Kepler. Tycho took a large number of superbly accurate astronomical observations by eye; he lived in the 16th Century and therefore pre telescopes. To quantify how accurate, his measurements were 5 times more precise than his contemporaries. He had created an accurate clean data set. He collaborated with Kepler and using numerical rigour the observations were turned into scientific discoveries. Kepler used Tycho’s observations, scientific methods and algorithms to develop his laws of planetary motion.
Another good example is measuring the distance from the earth to the sun. Today, this is a relatively simple task using radar or lasers, but this was not the case when measurements were first starting to be made by Archimedes in the 3rd Century BC. Mankind continued to try to measure this distance in earnest gaining a wide range of different results, until Simon Newcome came along in 1895. You guessed it, he applied data science techniques to get an incredibly accurate answer with quite limited resources. The main difference Newcome employed was the use of multiple data sets: he didn’t just use the really popular transit of Venus to get an answer, he also used aberration, the speed of light and the Gaussian Gravitational constant. He made sure the data sets he was working with were clean and he collaborated with other scientists to ensure proper scientific rigour to his work. This all paid off with a highly accurate calculated value of 0.9994AU, it would take another 50 years for us to get closer to the correct answer.
More recent examples can include the way large particle physics projects analyse their data, and places like CERN who generate huge amounts of data handle their work within a consortium. However, these analytics skills, often honed in a physics or astrophysics degrees, are becoming valuable to the world outside academic physics. These skills are not only applicable, but highly desirable and sought after.
If you come along to one of my talks on data science you will learn about all these things in more detail and get to experience the thrill of a live code. From scratch I write a simple machine learning algorithm to predict if a person would have survived the first voyage of the titanic. This will give you an insight to the kind of work that data scientists do. Writing in Python and using open source data from Kaggle, I will take you through the initial look at the data, how to clean it, how to build a machine learning algorithm and how to interpret the results. We will work out if you want to be a wealthy female or a poor male to get the best odds of survival!
Because data science identifies and predicts trends it has become a staple in the financial world. It is quickly spreading across the business world building tools such as recommender engines, business analytics and online advertising. People in all different kinds of analytics roles are being re-branded as data scientists to keep up with the demand. Exactly where this sudden demand and its response will take the world of data science will be interesting to see as it unfolds over the next few years.
Thanks to the Institute of Physics for collaborating with Swamphen Enterprises to produce this blog post. If you want to learn more about the topic, then come along to one of my talks hosted by the IOP, registration details are on my events web page or go to https://events.iop.org/ and search for ‘Tamara’. If you want to read the IOP published version, then visit https://beta.iop.org/rise-and-rise-data-science.