We anticipate that 2020 is the year of data science. The data science industry already skyrocketed, and predictions only show a steep, upward growth trend.
The role of the data scientist has already become not only one of the most important in an organization but also one of the most lucrative.
Below you’ll find a dictionary of commonly used terms in data science. The list will continue to expand, so don’t forget to keep checking in for updates. First, we’ll start with the basics.
The practice of extracting meaningful insights from data.
Data science projects are often accomplished using machine learning, coding languages like Python or R, data mining, data visualization, and advanced statistical techniques. The class of problems data science addresses is usually, but not always, prediction problems.
One who analyzes, processes, and models data, and then interprets the results for actionable insights.
The role of a data scientist can mean many different things to different organizations and sometimes even within the same organization. For example, companies like Google and Facebook employ data scientists on product teams, core business strategy teams, and research teams.
Data scientists who are focused on digital products are typically very similar to highly skilled analysts. Conversely, research data scientists focus mainly on developing new technologies, developing software and tools, and publishing research.
Data scientists also come from very diverse backgrounds. Fields like software engineering, mathematics, physics, and computer science each produce data scientists with different strengths and weaknesses.
A data scientist from a software engineering background might be extraordinarily talented in deploying data science projects, whereas a data scientist coming from a math background may be talented in algorithmic development.
When you conduct a randomized, controlled experiment with two or more variations of a digital experience, machine learning model, or another asset to determine which version is more effective.
Building a robust testing process and culture of desiring A/B testing is essential to ensuring every change an organization makes is having a positive effect on the identified KPI’s. A/B testing is also an essential component of becoming a data-driven company, as decisions based on speculation are no longer an option for businesses today.
The primary goal with A/B testing tools is to allow proper bucketing of users and to distribute variations correctly to get representative samples of your traffic while also minimizing the impact of bad variations.
A set of explicit instructions.
In mathematics and computer science, an algorithm is a finite sequence of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation. Algorithms are unambiguous specifications for performing calculations, data processing, automated reasoning, and other tasks.
A simple example of an algorithm would be a recipe for making dinner. The implementation of machine learning models is accomplished using algorithms, as well as ETL processes in data engineering.
A computer code that enables structured communication from one software product to another.
APIs accomplish this task by formalizing a contract between a client and a server and typically use the RESTFUL framework. APIs are useful because they allow a client to always get an expected, structured response if the server gets a valid request.
An API may be for a web-based system, operating system, database system, computer hardware, or software library. An API specification can take many forms, but often includes specifications for routines, data structures, object classes, variables, or remote calls.
Artificial Intelligence (AI):
Any agent that can perceive its environment and then take actions to maximize its chance of pre-defined success. Colloquially, AI is the study of how to mimic human-like intelligence in machines.
A collection of traditional and digital data that makes up large data sets and can be analyzed to reveal patterns, trends, and associations.
The concept of big data is used to drive results for the analytics practice of any industry. As a result, big data allows for better business decisions, increased revenue, and decreased operating costs. Big data and its associated practices can completely transform businesses and/or create new ones.
Here are the four characteristics that help define it.
- Volume: This is the scale of how much data there is and can help define whether or not your data is actually considered “big.”
- Variety: Big data can come in many forms, including structured data, unstructured data, video, text, etc.
- Velocity: This refers to the speed of how your data is generated and used. Most big data comes in real-time.
- Veracity: This is the quality of your data that is captured. You want data that is valuable so that you accurately analyze the results.
The task of grouping a set of objects in such a way that objects in the same group (a cluster) are more similar to each other than those in other clusters.
Clustering is frequently used in exploratory data analysis, which is an important step in the data science workflow. Additionally, clustering is used on its own to predict how data fits into clusters as part of unsupervised learning.
Clustering is used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
There are many different algorithms used in clustering and each has its own pros and cons. Some examples of the criterion used for clustering are the distance between points of data, the density of data points, or certain distributions or boundary metrics of the data.
Any various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.
Cross-validation, sometimes called rotation estimation or out-of-sample testing is largely used in areas where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a fully labeled dataset is split into training and testing sets. The model is then trained on the training set, and its output is compared to the known labels in the training set. The test set is used as an independent validation set that verifies how the model predictions generalize to unseen data.
The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.
In summary, cross-validation combines (averages) measures of fitness in prediction to derive a more accurate estimate of model prediction performance.
A set of processes and frameworks for ensuring that data within an organization meets standards of quality, processes are properly managed, and the correct technologies are employed. Data Governance is the control of data assets.
Like many practices in data fields, data governance requires the correct people, processes, and technology for the proper handling of data within an organization. Data Governance is key to ensuring data assets are transformed into meaningful insights.
A collection of data.
In the case of tabular data, a data set corresponds to one or more database tables where every column of a table represents a particular variable and each row corresponds to a given record of the data set in question.
The data set lists values for each of the variables, such as the height and weight of an object, for each member of the data set. Each value is known as a datum. Data sets can also consist of a collection of documents or files.
The practice of representing data visually. As data becomes more complex, insights from data can become harder to discover.
Data Visualization is an incredibly important part of machine learning, data science, and analytics. Extremely important analyses or insights can be overlooked numerous times because they are not understood.
Using meaningful visualizations can drastically improve one’s ability to derive insights from data. This is an essential skill for any data professional and should be practiced as much as possible.
A flowchart or tree-like structure that requires a decision at each node or “branch point” based on a test of a given criterion. Given the outcome of the test, a branch is chosen to follow and ultimately terminates.
Decision trees are used in machine learning as a predictive model. To accomplish this, robust mathematical conditional tests are placed on the nodes. The order and sequence of what criteria are picked for the tests at each node can be determined by multiple methods by splitting the data that flows through the decision tree.
Decision trees are an example of a supervised machine learning algorithm and need to be trained on data that has been labeled.
A specific subset of techniques, models, and algorithms within the broader set of artificial neural networks, which itself is a subset of machine learning. Deep learning algorithms can be supervised, semi-supervised, or unsupervised.
Deep neural networks require certain architectures for different tasks. Different flavors of deep learning networks can be used for specific or “narrow” tasks. For example, recurrent neural networks and convolutional neural networks are being applied to the fields of computer vision, speech recognition, natural language processing, and in many cases can outperform humans.
A testable prediction or idea.
In order for an idea or prediction to be a real hypothesis, it must be testable. In the fields of statistics, analytics, machine learning, and data science, hypotheses are tested by observing a process that is modeled via a set of random variables or modeled by a machine learning model.
The testing of these hypotheses leads to making statistically rigorous inferences. Hypothesis testing occurs in A/B testing, machine learning model prediction validation, statistical tests, and many other mediums in data-driven organizations
The science of making computer systems to accomplish a task without being explicitly programmed to do so.
AI and machine learning seem very similar and can be hard to distinguish. To clarify, machine learning is a subset or more specific field of study that is contained within AI.
Machine Learning Model:
A specific instance of an algorithm that has been “trained” or “learned” on data.
It’s important we draw a distinction between machine learning models, and relational data models because they are drastically different. This model uses statistical and mathematical methods to accomplish learning. The output of such machine learning models is predictions when given new data for them to act upon.
An interpreted, interactive, object-oriented programming language.
Python incorporates modules, exceptions, dynamic typing, very high-level dynamic data types, and classes. It has become the leading language for data science and machine learning due to its readability, and is not only a way to develop models, but also used for developing any other applications needed in the deployment of machine learning into production systems.
The development community has embraced Python rapidly and there exists an immense amount of libraries and packages to accomplish a wide range of tasks in data engineering, data science, analytics, machine learning, deep learning, web development, DevOps, and more.
A statistical analysis language and environment for statistical computing and graphics.
R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques, and is highly extensible.
Relational Data Model:
A way to organize data together so that it may be structured and accessed by logic.
This structure and associated language were essential to the development of all data-intensive fields, although there are alternatives to relational data models, such as hierarchical and network data models.
Relational data models are the most common. When a person interacts with a database, they are using a structured query language (SQL) to transact with the data inside the database.
There are many different relational database technologies. These different technologies can be thought of as different “flavors” of the same thing. The specifics of how to access data may vary slightly amongst them, but their form and function are very similar.
Structured Query Language (SQL):
A domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS).
SQL is particularly useful in handling structured data (i.e., data incorporating relations among entities and variables).
Information that either does not have a predefined data model or is not organized in a pre-defined manner.
Unstructured data can come from many sources and take many forms. Examples are photos, videos, text, audio, etc. A vast majority of existing data and data generated today is unstructured. This results in irregularities and ambiguities that make it difficult to understand using traditional programs compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.