Introduction to Datasets
Overview
Datasets play a pivotal role in managing and processing data within Vue.ai. They bridge the gap between data sources—such as APIs, files, databases, or data warehouses—and the destinations where this data is stored, analyzed, or utilized. This document introduces the key concepts, configurations, and modes of data transfer that empower seamless data integration. By understanding these elements, users can effectively set up and manage datasets to meet their specific business needs.
Source
A source refers to any system from which data is ingested, such as an API, file, database, or data warehouse. Setting up a source involves configuring the necessary variables that enable the connector to access and retrieve data. The specific configuration fields vary by connector type but typically include authentication credentials (e.g., username and password, API key) and parameters that define the data to extract, such as a start date for syncing records or a search query for matching records.
Source Definition
The definition of a source depends on the type of system (API, database, file, or data warehouse) and the specific parameters required for a secure connection or authentication. Configuration fields are determined by the connector type and the security protocols needed to access the data.
Test Connection
After providing the source configuration details, the Test Connection function is used to validate whether the authentication information is correct. This function ensures that the system can successfully connect to the source with the supplied credentials.
Stream
A stream represents a collection of related records. Depending on the destination, it may be referred to as a table, file, or blob. The term "stream" generalizes the flow of data across different destinations.
Examples of Streams:
- A table in a relational database
- A resource or endpoint in a REST API
- Records from a directory containing multiple files in a filesystem
Record
A record is an individual entry or unit of data, often referred to as a "row." Each record is unique and encapsulates information related to a specific entity, such as a customer or transaction.
Examples of Records:
- A row in a relational database table
- A line within a data file
- A data unit retrieved from an API response
Batch
A batch is a group of records that are processed and transferred together as a single unit. Batching efficiently transfers large volumes of data rather than processing records individually.
Examples of Batches:
- A collection of rows in a relational database updated simultaneously
- A set of files transferred together during a data migration
- Multiple data entries sent in a single API request
Sync Modes
Sync modes define how data is retrieved from a source and transferred to a destination. They consist of two components: Source Sync Mode and Dataset Sync Mode.
Source Sync Mode
This component describes how data is read from the source.
| Mode | Description |
|---|---|
| Incremental | Reads only the records added since the last sync. The first sync acts as a Full Refresh. |
| Full Refresh | Reads all records from the source, regardless of previous syncs. |
Dataset Sync Mode
This component specifies how data is written to the destination.
| Mode | Description |
|---|---|
| Overwrite | Replaces existing data in the destination with new data. |
| Append | Adds new data to existing tables without altering any pre-existing records. |
| Append Dedup | Appends data to existing tables while keeping a history of changes. The final table is de-duplicated using a primary key. |