Data Ingestion
Data Ingestion is the process of collecting, importing, and processing data from various sources into a centralized data repository or system, making it ready for analysis and utilization. This step is critical in any data pipeline as it ensures that data is available and accessible in the desired format for subsequent processing, analysis, and decision-making.
Key Components of Data Ingestion
Source Systems
Definition: The origins of the data being ingested. These can include databases, APIs, flat files, cloud storage, sensors, and more.
Variety: Data can come from structured sources like SQL databases, semi-structured sources like JSON files, or unstructured sources like text documents.
VDA connectors:
List of Available Connectors
Amazon Glue and anything built over it
CSV
Oracle (through dbapi or sql_alchemy)
Create the Data Source from which data is to be Ingested
Navigate to Datasource Tab and click on the desired Data source to find the list of associated Datasets
For New Ingestion Workbook creation, Navigate to Workbook, click on Ingestion Book create
Ingestion Methods
Batch Processing: Data is collected and processed in large chunks at scheduled intervals.
Use Cases: Suitable for use cases where real-time data is not necessary, such as end-of-day reports or periodic data archiving.
It can be:
Full refresh
Incremental
Historical
To create a Schedule, Navigate to Schedule, click on Plus and enter the details of Ingestion Workbook and Submit:
Name of Schedule
Name of Ingestion workbook
Frequency
Start Date
Last updated