Skip to content

How to Convert .parquet Files to Excel With Python

What is Parquet & Why

First, let’s start with what the Parquet format and why is it used. Parquet is an Apache open-source storage format used for fast querying. It’s fast because it uses columnar formats designed to provide efficient data compression and encoding schemes, which makes it useful for handling data in bulk. To learn more about Parquet, check out this link.

I have several Python scripts that query data out of a source system, export it as Parquet files, and then load the data into Snowflake. Please note, I inherited them and did not create them. There are other ways to move data around, but that isn’t the point of this article. Now, when I’m having trouble with the process, I sometimes need to confirm what is in the Parquet file, and you can’t just open the file and read it like a text file. When I open the file, it looks like this, so I need to convert it to Excel.

The Conversion Code

1 import os
2 import pyarrow.parquet as pq
3
4 def convert_parquet_to_csv(input_folder, output_folder):
5    # Ensure the output folder exists; if not create
6    os.makedirs(output_folder, exist_ok=True)
7
8    # Iterate over all files in the input folder
9    for filename in os.listdir(input_folder):
10        if filename.endswith(".parquet"):
11           # Read the Parquet file
12            parquet_file = os.path.join(input_folder, filename)
13            table = pq.read_table(parquet_file)
14            df = table.to_pandas()
15
16            # Define the output CSV file path
17            csv_file = os.path.join(output_folder, filename.replace(".parquet", ".csv"))
18
19            # Write the DataFrame to a CSV file
20            df.to_csv(csv_file, index=False)
21            print(f"Converted {parquet_file} to {csv_file}")
22
23 #Specify file locations
24 input_folder ="C:/Users/blah blah/09 Dumpster/QA/Input"
25 output_folder = "C:/Users/blah blah/09 Dumpster/QA/Output"
26 
27 #Convert parquet to csv
28 convert_parquet_to_csv(input_folder, output_folder)

High Level Code Explanation

Lines 1 – 2 Import modules and packages needed.

Lines 4 – 21 Define the function that will convert the .parquet files to .xlsx files.

Lines 23 – 25 Specify the input and output folder locations.

Lines 27 – 28 Call/run the function.

Detailed Code Explanation

Lines 1 and 2, import the pyarrow package and os module. Pyarrow will do the conversion. The os module is built-in to Python and is used to interact with the operating system. It does things like creates and manages files and directories, environment variables, and a whole bunch of other stuff. You can find out more here.

Lines 4 – 21 define a function called convert_parquet_to_csv with 2 arguments. The function uses the os module to make sure the output folder exists. If it doesn’t exist, the “exist_ok=True” parameter prevents an error. You could also modify this to check the input and output folder. Presumably, the inputs folder exists. Sometimes, we forget about the output folder.

Then, a for loop iterates over all the files in the folder. If the file is a .parquet files, it reads the file and puts it in a pandas data frame. All other file types are ignored.

Next, the function defines the output file path as an object and writes the data frame to a .csv file.

Now, keep in mind that lines 4 – 21 just define the function. We haven’t run it yet.

Lines 23 – 25 specify the details of the input and output file paths.

Finally, lines 27 and 28 actually call the function and it runs.

But Wait, There’s More!

Since I’m newer to Python, I send all my posts to a coworker. After reading my original content, he came back and gave me a way to do the same thing with just pandas in fewer lines of code. Granted, this way doesn’t loop, but you can always make it loop for multiple files.

import pandas as pd

file_path = r"\\blah\blah\enersight_vols.parquet"

df = pd.read_parquet(file_path)
df.to_csv(r"C:\users\blah\blah\enersight_vols.csv", index=False)

Conclusion

And that’s it. That’s how you convert .parquet to .csv files. The cool thing about this script is that it can be used for other file types. Just change the file type in line 10.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.