Feasible Workflow for Parquet Export: A Database Guide

Bruno Peixoto
5 min readJun 12, 2023

In today’s data-driven world, Parquet has emerged as a popular columnar storage format for efficient data processing and analytics. In this blog post, we will explore feasible workflows for exporting data into Parquet format across a range of databases. We’ll cover each database individually, highlighting the necessary libraries and tools for seamless Parquet export.

A parquet mosaico

MySQL

Libraries: PyMySQL, Pandas, PyArrow

Workflow:

  • Connect to the MySQL database using PyMySQL.
  • Fetch the data using SQL queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Write the PyArrow Table to Parquet format using the PyArrow library.

PostgreSQL

Libraries: Psycopg2, Pandas, PyArrow

Workflow:

  • Establish a connection to the PostgreSQL database using Psycopg2.
  • Retrieve the data using SQL queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Save the PyArrow Table as Parquet format using the PyArrow library.

Oracle Database

Libraries: cx_Oracle, Pandas, PyArrow

Workflow:

  • Create a connection to the Oracle database using cx_Oracle.
  • Fetch the required data using SQL queries and store it in a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Export the PyArrow Table to Parquet format using the PyArrow library.

Microsoft SQL Server

Libraries: pyodbc, Pandas, PyArrow

Workflow:

  • Establish a connection to the SQL Server database using pyodbc.
  • Retrieve the data using SQL queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Save the PyArrow Table as Parquet format using the PyArrow library.

MongoDB

Libraries: PyMongo, Pandas, PyArrow

Workflow:

  • Connect to the MongoDB database using PyMongo.
  • Query the required data and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Export the PyArrow Table to Parquet format using the PyArrow library.

IBM Db2

Libraries: ibm_db, Pandas, PyArrow

Workflow:

  • Establish a connection to the Db2 database using ibm_db.
  • Fetch the data using SQL queries and store it in a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Write the PyArrow Table to Parquet format using the PyArrow library.

Amazon Redshift

Libraries: Psycopg2, Pandas, PyArrow

Workflow:

  • Connect to the Redshift database using Psycopg2.
  • Retrieve the required data using SQL queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Save the PyArrow Table as Parquet format using the PyArrow library.
  • Optionally, utilize AWS Glue or Amazon Redshift Spectrum for efficient Parquet export.

Google BigQuery

Tools: Google Cloud SDK (BQ CLI)

Workflow:

  • Use the BigQuery export job or data transfer tools to export the data from BigQuery to Cloud Storage in a CSV or JSON format.
  • Use a tool like Apache Arrow or pandas to load the exported data into a DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Write the PyArrow Table to Parquet format using the PyArrow library.

Apache Cassandra

Tools: DataStax Apache Kafka Connector, Apache Arrow, PyArrow

Workflow:

  • Use the DataStax Apache Kafka Connector to export data from Apache Cassandra to Apache Kafka in a suitable format (e.g., CSV or JSON).
  • Load the exported data from Apache Kafka into a DataFrame using Apache Arrow or pandas.
  • Convert the DataFrame to a PyArrow Table.
  • Save the PyArrow Table as Parquet format using the PyArrow library.

SQLite

Libraries: sqlite3, Pandas, PyArrow

Workflow:

  • Connect to the SQLite database using the sqlite3 library.
  • Retrieve the data using SQL queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Export the PyArrow Table to Parquet format using the PyArrow library.

MariaDB

Libraries: PyMySQL, Pandas, PyArrow

Workflow:

  • Connect to the MariaDB database using PyMySQL.
  • Fetch the data using SQL queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Write the PyArrow Table to Parquet format using the PyArrow library.

SAP HANA

Tools: SAP HANA Spark Controller, Apache Arrow, PyArrow

Workflow:

  • Utilize the SAP HANA Spark Controller to extract data from SAP HANA into Apache Spark.
  • Load the data into a DataFrame using Apache Arrow or pandas.
  • Convert the DataFrame to a PyArrow Table.
  • Save the PyArrow Table as Parquet format using the PyArrow library.

Teradata

Tools: Teradata QueryGrid, Apache Arrow, PyArrow

Workflow:

  • Utilize Teradata QueryGrid to export data from Teradata to a supported target (e.g., Apache Hadoop, Amazon S3).
  • Load the exported data into a DataFrame using Apache Arrow or pandas.
  • Convert the DataFrame to a PyArrow Table.
  • Export the PyArrow Table to Parquet format using the PyArrow library.

Couchbase

Libraries: Couchbase Python SDK, Pandas, PyArrow

Workflow:

  • Use the Couchbase Python SDK to connect to the Couchbase cluster.
  • Retrieve the data from Couchbase using N1QL queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Export the PyArrow Table to Parquet format using the PyArrow library.

Neo4j

Libraries: Neo4j Python Driver, Pandas, PyArrow

Workflow:

  • Connect to the Neo4j database using the Neo4j Python Driver.
  • Fetch the data using Cypher queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Save the PyArrow Table as Parquet format using the PyArrow library.

Apache Hive

Workflow:

  • Apache Hive has native support for Parquet format, so no additional libraries are needed.
  • Use Hive queries to extract the data in the desired format and store it as Parquet files.

Apache HBase

Libraries: Apache Phoenix, Pandas, PyArrow

Workflow:

  • Use Apache Phoenix to connect to the Apache HBase cluster.
  • Query the required data using SQL-like queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Save the PyArrow Table as Parquet format using the PyArrow library.

Apache Derby

Libraries: JDBC Driver for Apache Derby, Pandas, PyArrow

Workflow:

  • Connect to the Apache Derby database using the JDBC Driver.
  • Retrieve the data using SQL queries and load it into a Pandas DataFrame.
  • Convert the DataFrame to a PyArrow Table.
  • Export the PyArrow Table to Parquet format using the PyArrow library.

Apache Ignite

Tools: Apache Ignite-Parquet module, Apache Arrow, PyArrow

Workflow:

  • Utilize the Apache Ignite-Parquet module to interact with Apache Ignite and export the data.
  • Load the exported data into a DataFrame using Apache Arrow or pandas.
  • Convert the DataFrame to a PyArrow Table.
  • Save the PyArrow Table as Parquet format using the PyArrow library.

In this blog post, we explored feasible workflows for exporting data to Parquet format across various databases. We discussed the necessary libraries and provided step-by-step instructions for each database, enabling seamless Parquet export. Leverage these workflows to unlock the benefits of Parquet’s columnar storage and optimize your data processing and analytics workflows.

Note: The specific libraries and tools mentioned are popular choices, but other alternatives may exist based on individual requirements and preferences.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Bruno Peixoto
Bruno Peixoto

Written by Bruno Peixoto

A person. Also engineer by formation, mathematician and book reader as hobby.

No responses yet

Write a response