Mastodon C

Big Data Done Better


a leader’s A-Z guide to common data terms

This glossary of terms is designed to help leaders stay current with the ways that data is talked about and used in both public and private sectors.

This list provides definitions of key terms. If you’re interested in additional information, like use cases and examples, or want a copy to print out or share, you can download the full version here or by clicking on the link below.

If you have suggestions for terms we should add or definitions, feel free to email us.

The glossary is published with a Creative Commons CC BY-NC 4.0 licence. This means that it can be reused for non-commercial purposes, as long as Mastodon C are credited. License details can be found here.

Download a free A-Z of data terms


    A process or set of rules to carry out a particular task, for example data analysis algorithms. Often expressed in computer code.


    The discovery, interpretation and communication of meaningful patterns and insights in data.


    The process of removing detail from or otherwise transforming data, to avoid any identification of individuals or organisations.

    Artificial Intelligence (AI)

    “Intelligent behaviour” exhibited by machines, for example learning and problem solving.

    Big Data

    Any form of data that due to its size, velocity (rate of change) or complexity pushes the limits of current storage and analytical capability.


    The task of preparing data so that it can be used for a specific purpose, whether that’s analysis or sharing with others.


    A general purpose programming language used to work on data projects.

    Cloud storage

    Storing data on machines accessed remotely over an internet connection, as opposed to on a machine or server housed in your own building.


    A CSV file is a Comma Separated Values file which allows data to be saved in a table structured format. CSVs look similar to a normal spreadsheet, but are reliably usable in more contexts.


    “A set of values of qualitative or quantitative variables.” (Wikipedia). Information in raw form (such as alphabets, numbers, or symbols) that refer to, or represent, conditions, ideas, or objects. In the context of computing data can be thought of as information that is transmitted or stored.


    A digital collection of data and the structure around which the data is organized.

    Data Science

    Data science is an interdisciplinary exercise that aims to find useful answers and insights in data by combining mathematical, scientifically robust approaches with computer programming techniques.

    Data Mining

    Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

    Deep Learning

    Deep Learning involves feeding a computer system a lot of data, which it can use to make decisions (Forbes). A subset of machine learning in Artificial Intelligence (AI), Deep Learning develops networks which are capable of learning unsupervised from data that is unstructured or unlabeled. Also known as Deep Neural Learning or Deep Neural Network (Investopedia).

    ETL (Extract, Transform and Load)

    A process in database and data warehousing meaning extracting the data from outside sources, transforming it to fit operational needs, and loading it into a database.


    How data is structured and stored.


    A geographic information system (GIS) is a system designed to capture, store, manipulate, analyze, manage, and present spatial or geographic data.


    The processes, policies and tools that ensure data is formally managed, so that an organisation meets policy, legal, statutory, requirements, and so data can serve the mission and goals of an organisation.


    An open-source software framework used for storage and processing of (typically) large datasets.

    IOT - Internet of Things

    The connection of ordinary, everyday devices to the internet. Connection of everyday physical objects and products to the Internet so that they can relate to other systems or their data can be used and analysed.

    Linked Data

    A method of publishing structured data so that it can be interlinked and become more useful.


    Data licensing is what tells someone what they can and can’t legally do with a piece of data or software.


    A set of data that describes and gives information about what other data is about.

    Machine Learning

    A subfield of artificial intelligence that gives computers the ability to learn without being explicitly programmed.


    An abstract construct that organizes elements of data and standardizes how they relate to one another and to properties of real world entities.

    Natural Language Processing

    Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with using computers to make sense of human language.

    Open Data

    Data that anyone can access, use and share, for any purpose, without cost.

    Open Source

    Software for which the original source code or information is made freely available and may be redistributed and modified in various ways depending on its licence.

    Predictive Analytics

    Using data to predict what will happen next - for example what someone is likely to buy or visit, or how something will behave.

    Data Platform

    A place where data is published for use by others.


    Python is a widely used high-level programming language for general-purpose programming


    R is an open source programming language and software environment for statistical computing.


    A data format, RDF is a framework for describing resources on the web. RDF is designed to be read and understood by computers.

    Sentiment Analysis

    Using data and algorithms against unstructured text, to provide insights into what people are thinking and feeling.

    Software-as-a-Service (SAAS)

    A software tool that you access from your browser rather than one that is downloaded and installed onto your device.


    Apache Spark is a “fast and general engine for large-scale data processing” (Apache). It was built for speed, ease of use, and analytics.

    Structured data

    Data that is identifiable and easy to use, as it is pre-organized in structure like rows and columns.

    Transactional data

    A very common kind of data which describes events such as payments, events in a system, or appointments, often held in a large database or data warehouse.

    Unstructured data

    Unstructured data is data that is in general text heavy, but may also contain dates, numbers and facts.


    Representing data, or relationships between data, in a visual manner so as to communicate a finding, relationship or story.


    The speed at which the data is created, stored, analysed and visualized.


    XML stands for Extensible Markup Language (XML).

    Content made available under the Creative Commons CC BY-NC 4.0 licence. Find full licence details here.