Skip to main content

Introduction to Datasets

Overview

Datasets play a pivotal role in managing and processing data within Vue.ai. They bridge the gap between data sources—such as APIs, files, databases, or data warehouses—and the destinations where this data is stored, analyzed, or utilized. This document introduces the key concepts, configurations, and modes of data transfer that empower seamless data integration. By understanding these elements, users can effectively set up and manage datasets to meet their specific business needs.

Source

A source refers to any system from which data is ingested, such as an API, file, database, or data warehouse. Setting up a source involves configuring the necessary variables that enable the connector to access and retrieve data. The specific configuration fields vary by connector type but typically include authentication credentials (e.g., username and password, API key) and parameters that define the data to extract, such as a start date for syncing records or a search query for matching records.

Source Definition

The definition of a source depends on the type of system (API, database, file, or data warehouse) and the specific parameters required for a secure connection or authentication. Configuration fields are determined by the connector type and the security protocols needed to access the data.

Test Connection

After providing the source configuration details, the Test Connection function is used to validate whether the authentication information is correct. This function ensures that the system can successfully connect to the source with the supplied credentials.

Stream

A stream represents a collection of related records. Depending on the destination, it may be referred to as a table, file, or blob. The term "stream" generalizes the flow of data across different destinations.

Examples of Streams:

  • A table in a relational database
  • A resource or endpoint in a REST API
  • Records from a directory containing multiple files in a filesystem

Record

A record is an individual entry or unit of data, often referred to as a "row." Each record is unique and encapsulates information related to a specific entity, such as a customer or transaction.

Examples of Records:

  • A row in a relational database table
  • A line within a data file
  • A data unit retrieved from an API response

Batch

A batch is a group of records that are processed and transferred together as a single unit. Batching efficiently transfers large volumes of data rather than processing records individually.

Examples of Batches:

  • A collection of rows in a relational database updated simultaneously
  • A set of files transferred together during a data migration
  • Multiple data entries sent in a single API request

Sync Modes

Sync modes define how data is retrieved from a source and transferred to a destination. They consist of two components: Source Sync Mode and Dataset Sync Mode.

Source Sync Mode

This component describes how data is read from the source.

ModeDescription
IncrementalReads only the records added since the last sync. The first sync acts as a Full Refresh.
Full RefreshReads all records from the source, regardless of previous syncs.

Dataset Sync Mode

This component specifies how data is written to the destination.

ModeDescription
OverwriteReplaces existing data in the destination with new data.
AppendAdds new data to existing tables without altering any pre-existing records.
Append DedupAppends data to existing tables while keeping a history of changes. The final table is de-duplicated using a primary key.