Using Airflow
If you are using Apache Airflow for your scheduling then you might want to also use it for scheduling your ingestion recipes. For any Airflow specific questions you can go through Airflow docs for more details.
We've provided a few examples of how to configure your DAG:
mysql_sample_dag
embeds the full MySQL ingestion configuration inside the DAG.snowflake_sample_dag
avoids embedding credentials inside the recipe, and instead fetches them from Airflow's Connections feature. You must configure your connections in Airflow to use this approach.
These example DAGs use the PythonVirtualenvOperator
to run the ingestion. This is the recommended approach, since it guarantees that there will not be any conflicts between DataHub and the rest of your Airflow environment.
When configuring the task, it's important to specify the requirements with your source and set the system_site_packages
option to false.
ingestion_task = PythonVirtualenvOperator(
task_id="ingestion_task",
requirements=[
"acryl-datahub[<your-source>]",
],
system_site_packages=False,
python_callable=your_callable,
)