Virtual Data Assistant
Virtual Data Assistant
  • Virtual Data Assistant
  • Overview
    • What we do
    • Features
      • Data Sources
      • Datasets
      • Dashboards
      • Workbooks
        • Expectations Book
        • Query Book
        • Document Book
          • API Documentation
  • Quickstart
    • VDA in Docker
  • How-to-guides
    • Data Catalog
      • DataSource
      • Datasets
      • Exploration
    • Data Quality
      • Expectations
        • Templated Expectations
        • Custom Expectations
      • Profiling
      • Reconciliation
    • Data Analytics
      • Data Modeling
      • Visualization
      • Data Ingestion
    • Governance
Powered by GitBook
On this page
  1. How-to-guides
  2. Data Analytics

Data Ingestion

PreviousVisualizationNextGovernance

Last updated 10 months ago

Data Ingestion is the process of collecting, importing, and processing data from various sources into a centralized data repository or system, making it ready for analysis and utilization. This step is critical in any data pipeline as it ensures that data is available and accessible in the desired format for subsequent processing, analysis, and decision-making.

Key Components of Data Ingestion

  1. Source Systems

    • Definition: The origins of the data being ingested. These can include databases, APIs, flat files, cloud storage, sensors, and more.

    • Variety: Data can come from structured sources like SQL databases, semi-structured sources like JSON files, or unstructured sources like text documents.

VDA connectors:

List of Available Connectors

  • and anything built over it

  • CSV

  • (through dbapi or sql_alchemy)

Create the Data Source from which data is to be Ingested

Navigate to Datasource Tab and click on the desired Data source to find the list of associated Datasets

For New Ingestion Workbook creation, Navigate to Workbook, click on Ingestion Book create

  1. Ingestion Methods

    • Batch Processing: Data is collected and processed in large chunks at scheduled intervals.

      • Use Cases: Suitable for use cases where real-time data is not necessary, such as end-of-day reports or periodic data archiving.

      • It can be:

        • Full refresh

        • Incremental

        • Historical

To create a Schedule, Navigate to Schedule, click on Plus and enter the details of Ingestion Workbook and Submit:

Name of Schedule

Name of Ingestion workbook

Frequency

Start Date

Amazon Athena
Amazon EventBridge
Amazon Glue
Amazon Redshift
Apache Cassandra
Apache Druid
Apache Hive
dbt
Delta Lake
Elasticsearch
Google BigQuery
IBM DB2
Kafka Schema Registry
Microsoft SQL Server
MySQL
Oracle
PostgreSQL
PrestoDB
Trino (formerly Presto SQL)
Vertica
Snowflake
Data Source Creation
Datasets
Create New Ingestion Book
Batch Ingestion
Create a schedule