How to use cx_oracle to fish your data effectively from SQL tables

Fishing data from SQL
Spread the love

For the last two weeks, I have been working on some financial data to predict customer’s behavior towards payments. Now, what would be the biggest problem for a Data Scientist’s point of view. No, its not about choosing the right algorithm for prediction but about choosing what data to use to predict all of that. To give you the magnitude of data I was dealing with, it consisted for 14 SQL tables, approximately 1000 columns and about 257 million rows combined. I can’t possibly take entire data into hdfs and start doing analysis on top of it. I needed to strategize my approach here.

We talked to business to understand the entire scenario and figured out some tables of interest, possibly some columns across different tables and the right queries to take out this data to start analyzing in python. Most of the features were not directly usable but needed to be combined with other features to make it useful and there were a lot table joins needed to achieve all of it.

The next challenge was that there was nobody to support us to extract this data. Folks suggested to use SQL developer to run the queries, then export required data from there into various csv files and then combine it all in python to process. We tried following this approach and to fetch a million odd records from one query it easily took about 2-3 hours. Then doing something same for about 15 different queries that we had was quite a manual task and not to mention how slow the entire process was.

Connecting to DB From Python

Fortunately for us, there is a library called cx_oracle which can connect with oracle database and we can run our queries directly from python, extract csv, combine it and then process it, without doing anything manually. Here’s a simple screenshot of how to connect using this library.

import cx_Oracle
try:
     con = cx_Oracle.connect('USER_NAME/password@localhost:8000/ORCL')
except cx_Oracle.DatabaseError as er:
     print('There is an error in the Oracle database:', er)

Reading Data from SQL tables

One can then read queries from a file or better create variables for your query and then you can create a list of these query variables. For every query you can have a separate filename and that also can be created as a list of file names.

for i,query in enumerate(queries):
    cur = con.cursor()
    ref = pd.read_sql_query(query, con=con, chunksize=1000)
    cdfs_list = [i for i in ref]
    cdfs = pd.concat(cdfs_list).reset_index(drop=True)
    cdfs.to_csv(path+file_list[i], index=False)
    print(f'file {file_list[i]} saved to {path}')

Here queries and file_list is defined as

queries = [query1, query2, query3, query4, query5, query6, query6]
file_list = [file1, file2, file3, file4, file5, file6]

One can directly store the file names without using variables. Or the same can be read from an external file as well. There can be many approaches to it but we chose what was pretty simple to use.

The tables are actually read using two lines pd.read_sql_query and then its next line in chunks of 1000 rows at a time. One can increase decrease the chunk size we chose it be a modest 1000 as it seemed to be a good balance between trips and memory usage. You can do with chunk size as None that way pandas will return will all rows of data and will require a significantly longer time. With chunk size it returns an iterator and using iterator we can fetch further rows.
Here’s an overview of the entire process from this stackoverflow post.

Some thoughts on using Chunk size while querying SQL

  1. chunksize is None(default value):
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize is None
    • pandas tells database that it wants to receive all rows of the result table at once
    • database returns all rows of the result table
    • pandas stores the result table in memory and wraps it into a data frame
    • now you can use the data frame
  2. chunksize in not None:
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize has some value
    • it then creates a query iterator(usual ‘while True’ loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table
    • pandas tells database that it wants to receive chunksize rows
    • database returns the next chunksize rows from the result table
    • pandas stores the next chunksize rows in memory and wraps it into a data frame
    • now you can use the data frame

Conclusion

Finding the right data is a priority but how to find it and how effectively one can reach there is a necessity. Don’t shy away from writing code to extract data from tables. Its effective, fast, manageable, scalable and increases your productivity as well.

Leave a Reply