The A-Z of data science


28 Sep 2015

So you can understand what it means to mine through yottabytes of unstructured data, here’s our glossary of data science terms. Photo via beeboys/Shutterstock

From all-encompassing analytics to the myth that is zero bias, here’s a crash course in data science to help you get to grips with this fascinating field.

Ireland is on the cusp of a data science tsunami and, with a week of content focusing on data science, we’re going to explore, examine and extract knowledge from this rapidly-growing field.

But first, a primer on some of the core aspects of data science and its varied applications.

A is for analytics

Analytics and data science go hand-in-hand, with analysis unearthing and interpreting meaningful patterns within data. The field of analytics combines statistics, computer programming and operations research. Results from data analysis can guide decision-making, and it is frequently used by businesses to assess, predict and improve performance.

B is for big data

In the strictest sense, big data relates to datasets so large or complex that they cannot be run through traditional data-processing applications. However, use of this term has become so popular it has mutated to take on broader contexts. These days, big data can refer to any data used in conjunction with advanced methods of analysis, regardless of the dataset’s size.

C is for climate modelling

Now we’ve covered the basics, we can delve into the applications of data science, such as climate modelling.

Researchers have begun to use big data to map, model and interpret the predicted environmental effects of climate change. These researchers use recently gathered extensive and complex data as well as older datasets to create predictive models of how climate change will manifest over time.

D is for data mining

Similar to analytics, data mining involves the discovery of patterns in data. Data mining, however, specifically applies to big data.

The goal is to extract understandable information from multiple large, related datasets, highlighting correlations and relationships between them.

E is for econometrics

Econometrics is analytics for the economy. It utilises mathematics, statistics and computer science to interpret economic data, uncover relationships and patterns, and to draw conclusions to economic questions.

F is for forecasting

Forecasting is basically the modern-day, data science-based equivalent of Nostradamus. Through forecasting, predictions are made based on existing data and the analysis of trends.

Despite its intentions, forecasting is a highly unpredictable field. While complex and nuanced statistical models are used, as is the most up-to-date data, forecasters are still trying to predict the future, and one can never account for all the variables.

G is for gaming

Why are Riot Games, Zynga et al recruiting data scientists right now? Well, when you have millions of players of an online game, you have, potentially, a vast mine of data to tap into and inform your future decisions.

This information may be used to push advertising to the right target demographics, or it may be used to improve the game based on players’ needs and expectations. Sustaining gamers’ attention is tricky business, so anything that keeps players playing is incredibly valuable in what is a highly competitive and lucrative business.

H is for Hadoop

To the uninitiated, Hadoop may sound like a nonsense word, but to those working in and around data science, it’s one of the most commonly used terms in their vernacular.

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware. The architecture is important because it means that nodes can process blocks of data in parallel, making it far more efficient for deep, intensive analytics.

I is for internet of things (IoT)

The internet of things presents unique challenges and opportunities in data science. In the oncoming IoT age, all kinds of hardware and everyday objects will be connected to the internet and each other via machine-to-machine communications.

But with more connected devices comes even more data, with Cisco expecting global mobile data traffic to grow to 11.2 exabytes per month by 2017, from less than one exabyte in 2012. What we consider big data today will pale in comparison to the whopping datasets the IoT era will usher in.

‘What we consider big data today will pale in comparison to the whopping datasets the IoT era will usher in’

J is for jobs

As more and more businesses start dealing in data, the more this role of data scientist pops up in a wide range of industries. However, there aren’t enough active, qualified data scientists to fill them, and it’s set to get worse.

The McKinsey Global Institute projects that the US alone will be facing a a shortage of 140,000 to 190,000 data scientists by 2018.

K is for KDD, or knowledge discovery in databases

The process of discovering useful knowledge from a collection of data is known as knowledge discovery in databases, or KDD for short. The ultimate goal of KDD is to extract high-level knowledge from low-level data, and the process involves preparation, selection, cleansing and interpretation of data.

KDD isn’t new, but the advent of technology to manage these tasks is and, these days, with unfathomably large datasets, data-mining software and artificial intelligence have become integral to the process.

L is for life sciences

Life sciences, the study of living organisms, relies hugely on the data science field. Considering so many disciplines fall into this catch-all term, the correct documenting, storing and collating of data is a prerequisite to any sort of advancement.

In terms of healthcare, this is one area of data science in which the public has shown a lot of interest and engagement. Wearable tech and wellness trackers are now common devices picked up by consumers, all of them with the potential to augment medical data with new, useful information. Then there’s that giant dataset that holds the secret to our make-up – the genome – which has been opened up for analysis using data science.

M is for machine learning

On the face of it, machine learning sounds like the premise of many a horror sci-fi movie. It’s the science of making computers act for themselves, without the need for explicit programming – think driverless cars, drone deliveries and the like.

At the heart of it, machine learning is a form of artificial intelligence that evolves using algorithms that learn from data and make data-driven predictions or decisions.

N is for NoSQL

For large-scale data distribution, NoSQL databases are the way to go. The rise of big data, and the reliance on such by some of the major technology companies around the world, has made NoSQL a common skill among software developers. The name comes from ‘not only SQL’, which emphasises the versatility of the database structure.

O is for open data

Open data is the idea that some datasets – such as those held by public bodies – should be available online and be free to use and redistribute within reason. In Ireland, for example, Data.gov.ie lists the datasets openly accessible to the public from over 80 Government departments and public bodies. Advocates of open data believe these datasets have the potential to deliver economic, social and democratic benefits.

P is for privacy

Privacy can mean many things. However, when it comes to data science, it primarily concerns the protection of the personal information an individual has digitally stored somewhere.

Private bodies surrounding data protection exist in many countries, with the area growing on the back of an explosion in social media services. It’s a hot-button issue in a data-driven age, illustrated by high-profile legal cases such as Europe v Facebook.

‘The quality of data analysis is irrelevant if the data that is being analysed is littered with inaccuracies’

Q is for quality of data

The quality of data analysis is irrelevant if the data that is being analysed is littered with inaccuracies.

To take a real-life example, the US Postal Service (USPS) estimated in 2013 that there were approximately 6.8bn pieces of mail that could not be delivered as addressed. Forwards, returns and disposal of all this mail cost USPS US$1.5bn to process, and that’s not to mention the countless dollars wasted by the businesses whose post went undelivered, which can be roughly estimated as a US$3.4bn cost for incorrect address data.

R is for risk management

Effective data analysis can help a company reduce its risks as well as the costs associated with risk management.

For example, IBM and Deloitte recently came together to develop a system that can parse complex government regulations related to financial matters and compare a company’s plans for meeting the requirements. This could then help companies cut the cost of meeting new regulatory requirements, which is something that can often be very expensive.

S is for sensors

While ‘internet of things’ is the buzz phrase at the moment, it would be more accurately described as an ‘internet of sensors’. Gartner forecasts that there will be 25bn connected things in the world by 2020 and this proliferation of sensors will contribute to the creation of smart cities where information on everything from noise and air quality to traffic congestion is constantly collected.

Dublin has already been earmarked by Intel for the smart cities treatment, while Croke Park is set to become one of the world’s first IoT stadiums, equipped with sensors to measure pitch quality and stadium microclimate, queueing times at refreshments stalls, and traffic to and from the stadium.

T is for training

Anyone who wishes to be a data scientist needs to have a strong understanding, and even a love of, statistics. Most data scientists have qualifications in the areas of maths, statistics, computer science or engineering, with many educated to a master’s level.

It is also essential that a data scientist has a good knowledge of different programming languages such as Python and R. Probably the most popular of them all is R, an open source language that provides a wide variety of statistical and graphical techniques for data handling, analysis and presentation.

A knowledge of SQL, one of the standard languages used to access databases, would also be required. Data scientists also need to have good communication skills as they will often be required to present data to other parts of the business.

U is for unstructured data

Essentially, most data is unstructured data, which is any data – be it text files, photos, audio or video files – that is not contained in a database or some kind of data structure. All content contained in emails, Word documents, instant messages or on social media is unstructured data.

Specific enterprise search products are often used to search through and parse a business’ unstructured data – with access to unstructured data sometimes required in the case of lawsuits, or if a company wants to use its unstructured data to get an insight into what its customers want, who they are etc.

V is for visualisation

With these aforementioned billions of devices expected to be connected online by 2020, incomprehensible amounts of data will be created that could be useful given the right circumstances. Data visualisation plays a crucial role in making the step from raw data to something that’s legible for everyone through graphs, maps and infographics.

For businesses and organisations, these visualisations can reveal patterns that they never knew existed and, for that reason, data visualisation platform providers are in constant demand.

W is for warehouses

Major corporations produce a lot of data spread across different departments, but where does this data go once it’s generated?

Rather than being stored in various locations – such as a server in sales, a server in accounting – a data warehouse stores all of this information in one place for later use in a company server, easing a network’s workload. Alternatively, a data warehouse can act as a top-down endeavour for passing down datasets to groups in a company.

X is for XML

Standing for extensible mark-up language, XML is one of the core ways in which data is actually turned into understandable information and is the foundation of many of the next-generation web technologies.

XML data is known as self-describing or self-defining, which means that the structure of the data is embedded with the data. By applying XML, once-chaotic datasets become navigable without the need to pre-build a structure to store the data.

Y is for yottabyte

In the world of incomprehensibly large numbers, the yottabyte (YB) is one of the largest when it comes to a number for the amount of data stored.

To put things into perspective, 2015 was dubbed the year of the zettabyte (ZB) in April with Cisco predicting that we would pass the 1ZB threshold for generated online data. Yet 1YB is the equivalent of 1,000ZB and if we were to create a 1YB hard drive today, it would cost US$100trn.

Z is for zero bias and zero variance

In predictive modelling, errors can be caused by either bias or variance. High bias can miss relevant relations betweens features and target outputs, resulting in an ‘under-fitted’ model, while high variance can cause ‘over-fitting’.

While the aim of the game is to limit these sources of error as much as possible – i.e. zero bias and zero variance – it is generally accepted that there is a trade-off in terms of a model’s ability to simultaneously minimise both. Understanding this bias-variance trade-off can help to create more accurate models.

Siliconrepublic.com’s Data Science Week brings you special coverage of this rapidly growing field from 28 September to 2 October 2015. Don’t miss an entry worth your analysis by subscribing to our news alerts or following @siliconrepublic and the hashtag #DataScienceWeek on Twitter.

Alphabet image by beeboys via Shutterstock