Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Saturday, 19 January 2019

Time SQL Execution with Python

I've said before in this blog how I find Python to be very useful for doing various things, including processing data to or from an Oracle database. Here is another example where a relatively simple and straightforward piece of Python code can deliver something that is very useful - in this case measuring the elapsed time of SQL queries executed on an Oracle database. We often need to execute a given SQL query and see how long it takes to complete as part of a tuning or monitoring exercise, but running it via SQL*Plus or something else can tie up a session for an extended period of time and produce large amounts of output that are spooled to the screen. Conversely we don't want to run the SQL query and stop when only the first data row is returned, as that is not a "full" execution of the query. We can use Python to achieve this - execute a query to completion, consuming all the data it produces but not outputting any of it, and measure the elapsed time. And we can execute this outside of any SQL*Plus or SQL Developer or other window we might be using.

The key requirements for this utility are:
  • Connect to an Oracle database as a specified user
  • Execute the given SQL query, unmodified, so the execution plan used should match that when the same SQL is submitted by the real application
  • Consume all the rows of data produced, for a true time to complete measurement
This covers most scenarios but does leave a few out - no bind variables can be used, for instance.

There are three parts to achieving this really
  • Connect to the Oracle database using the supplied credentials
  • Read in the SQL query text from a file
  • Execute against an Oracle database connection and consume the results noting the start and end time of this
And each of these parts involves relatively few lines of Python code to achieve what it does. And part of the reason for this is Python's dynamic data typing of variables - you don't have to declare the data types of the data the SQL query produces in advance of retrieving it. Which means that it can handle almost any query producing any number of values per output row.

The code I will now present is not "perfect" in an absolute sense - there is very limited error handling for instance - but it is complete and does work. I've used it many times to execute a SQL query pulled out from an application that is performing poorly, to verify its execution and the impact of any changes made to try and improve performance.

Python Code

Connecting to an Oracle database
I use cx_Oracle as the database access module for Oracle, and connection is relatively simple. This assumes that the database credentials are in variables named "user_name", "user_password" and "database_name".
import cx_Oracle

# Other code here, such as parsing command line arguments to get values into necessary variables

db_conn = cx_Oracle.connect (user_name, user_password, database_name)

Reading in the SQL query from a file
Assume that the name of the file containing the query is in the string variable "SQL_File_Name".
sql_query = open (SQL_File_Name, "r").read ().rstrip (' ;\n') # Strip any semi-colon off from the end
This does several things in one combined statement:
  • Opens the file for reading
  • Reads in the content of the file returning it as a single string
  • Strips off certain trailing characters at the end of the string - space, semi-colon or new line character.
We now have in the "sql_query" variable the text string of the SQL query with no trailing semi-colon (if there was one). This is now ready for use to execute against an Oracle database.

Executing the query and timing it
For now assume that a database connection has been made and is referenced by the object named "db_conn".

To execute the SQL query we first need to get a cursor from the database connection, then execute the SQL using it, retrieve all the rows of data, and report the elapsed time:
import time

try :
    curs = db_conn.cursor ()  
    start_time = time.time ()  
    curs.execute (sql_query) 
    # Now retrieve the rows of data produced by the query
    row_count = 0  
    rows = curs.fetchmany (1000)  
    # Empty list will be False so can test it directly  
    while rows :  
        row_count += len (rows)  
        rows = curs.fetchmany (1000)  
    end_time = time.time ()  
    print ("Query Complete - Elapsed {0:4.1f}s for {1} rows".format ((end_time - start_time), row_count))  
except Exception as ex :
    logging.critical ("ERROR - Exception raised during processing")
    logging.critical ("Exception - %s" % (ex))
I've included exception handling in case there is something wrong with the SQL query or its execution, but there is not a lot we can do about it so it just prints out some information about the exception it caught.

More details
I actually allow all the necessary values to be passed in from the command line to this utility program, and I use the Python "argparse" module to process these and map them to corresponding variables. So you can either assign the results from "argparse" to the discrete variables I have used in the code examples, or instead replace those references by the results produced by "argparse". The results will be the same either way.

The following code can be placed before the code given so far in the final Python program:
import argparse
import sys

# Parse command line arguments
parser = argparse.ArgumentParser (description="Time execution of a SQL query")
# Add command line argument definitions one by one into the parser
parser.add_argument ("-u", "--user",      help="User name", required=False, default=None)
parser.add_argument ("-p", "--password",  help="Password", required=False, default=None)
parser.add_argument ("-d", "--database",  help="Database name", required=False, default=None)
parser.add_argument ("SQL_File", help="Name of SQL File to execute", default=None)
args = parser.parse_args ()
# First check whether we did get a user name, password etc on the command line
# If not, prompt the user for what we are missing
if args.database is None :
    print ("Enter database", end=": ", flush=True)
    database_name = sys.stdin.readline ().rstrip ()
else :
    database_name = args.database
if args.user is None :
    print ("Enter username", end=": ", flush=True)
    user_name = sys.stdin.readline ().rstrip ()
else :
    user_name = args.user
if args.password is None :
    print ("Enter password", end=": ", flush=True)
    user_password = sys.stdin.readline ().rstrip ()
else :
    user_password = args.password
SQL_File_Name = args.SQL_File

# Get Database Connection ...

# Execute SQL and consume results ...

Summary

This relatively short and straightforward Python program gives me a utility to time the execution of any SQL query against an Oracle database, including consuming all of the data rows the query produces. For me this has many advantages over other ways of executing SQL queries when investigating performance problems, and is something I use whenever needed. One advantage is that because it is executing the original SQL query with no modifications, it will be using the same execution plan as when that SQL query is executed from an application. Which in turn means that if I do modify the SQL in some way and manage to "improve" its performance, the new execution plan should be the same one used when the same improved SQL is executed from the application code.

Monday, 14 December 2015

Python for the DBA (2) - Data Unloading

I've mentioned before that I like the Python programming language. Not only is it a "clean" language for writing programs in with a rich set of data types and structures for manipulating data, it also has a standard API for database access making it "database neutral". This makes it relatively easy to get data that is in a database into and out of your Python code where you can use the power of Python for whatever kind of data manipulation you need.

To show how easy it is to get data into a Python program I'll show a little utility to unload data from one table in a database to a flat, text file as one record per line with each data field separated by special delimiters.

Assuming the following:
  • Your Python program already has a database connection
    • This is relatively easy, but does vary by specific database driver being used
  • And it has opened the output file to unload to i.e. an open file handle and not the file name
  • And it has the name of the table, the field delimiter characters and the record terminator characters in local variables
Then you can write a simple function to call that will unload the data from that table to the specified file:
def unload_table_to_file (db_conn, table_name, out_data_file, field_separator, record_terminator) :
    # Preparation steps before doing the real work
    # Replace literals of special characters in the separator and terminator
    record_terminator = record_terminator.replace ('\\n', '\n')
    field_separator = field_separator.replace ('\\t', '\t')
    
    # Simplest SQL to get the data, assume column ordering is correct
    select_SQL = "select * from " + table_name
    
    # Now do the work - read in all records and output them
    print ("=== Unloading . . .")
    select_cursor = db_conn.cursor ()
    select_cursor.execute (select_SQL)
    for data_row in select_cursor.fetchall () :
        # Convert Python native data types to simple string representations
        output_fields = []
        for data_field in data_row :
            output_fields.append (str(data_field))
        out_data_file.write (field_separator.join (output_fields) + record_terminator)
    print ("=== Finished")
The main lines are:
  • First make sure that if any "escaped characters" are in the delimiters then they are the real escape characters
    • "\t" means the TAB character, and "\n" means the NEWLINE character
  • We just use a "select * from table" to get all the data into the Python code
    • The String concatenate operator is just the "+" operator i.e. "+" is overloaded depending on data type
  • And we execute this "select" via a Cursor created from the database connection passed in
  • The real work happens in 2 nested loops - for each data row, for each data field in each data row
    • The fields from one record are put into a list (essentially equivalent to an array)
    • This is so we can use a single action at the end to concatenate them all together with each data field value separated by the "field separator"
    • Then we "write" out this concatenated list of fields to the output file
  • The "str" function will return the String representation of a data value e.g. convert an integer to its String representation
    • This assumes that there exists a data type specific "str" method for each possible data type
  • The "join" method of a String variable iterates over a list of data, and returns one String of the list contents concatenated together with the value of the String variable between each one.
I think this shows some of the power and elegance of Python that reading all the data from a table can be done with so few lines of code, and that the code itself is so readable. And this code is flexible enough to deal with different values for the field separator characters and the record terminator characters. You could just do a classic comma separated list (',' as the field separator) or something more complicated to avoid issues where a comma may appear in the data itself e.g. "|__|" as the field separator (vertical bar, underscore, underscore, vertical bar). And this flexibility is all handled by this Python code.

Caveats

This is probably the most simple Python code to achieve the goal of reading all the data in a table and writing out to a flat, text file. There are several things it does not do at all, and other things that could be enhanced with it.
  • It is assumed that all data types being used can be cleanly converted to valid String representations using "str()"
    • This may not be true for some data types, or you may wish to handle some data types differently e.g. Dates
  • Dates should probably be handled differently, to ensure the full date and time values are output
    • You could execute an "ALTER SESSION" SQL statement to set the value of NLS_DATE_FORMAT to something suitable
    • Python has a native Date / Time data type, "datetime", which Oracle DATE columns are mapped to, and you could test for this
  • You could add extra functionality such as debugging / tracing messages (use the "logging" module) or other progress messages
    • You may wish to add extra counters to count the number of records and fields per record

Monday, 26 October 2015

JSON, Python & Trello

JSON is a simple data storage format intended to allow easy data interchange between programs and systems. Data is stored in a text format as objects of "name : value" pairs. And that's about it, other than objects can be nested (a value could be another whole object) and multiple objects can occur one after another in the data set (a list or array of such objects). This makes it slightly database like because you have records (objects) made up of named fields holding data values, with nested records inside other records for any kind of relationship or hierarchy.

The "name : value" storage structure is often termed a Dictionary where a given name is associated with its value - other terms such as Hash Map are used in other programming languages. This makes a JSON data set an almost perfect match for processing in Python which has almost directly corresponding data types of dict for the "name : value" pairs and list for multiple objects of the same type. And Python offers a standard "json" module (library) for importing and exporting such JSON data sets into and out of Python objects (the corresponding methods are called "load" and "dump" for consistency with the naming in the existing "pickle" module). Generally speaking when you load in a JSON data set it is converted to a corresponding Python dict object, that may in turn contain other embedded list and dict objects according to the structure of the JSON data being loaded in.

Trello

A real example of this in use is with Trello, where you can export your Do Lists to a local file as a JSON data set. You could then open this file and read in the JSON data set in Python, and process it one way or another. Note that you can also directly access the live data in a Trello List using a suitable URL - Google the Trello API for more on this.

Trello structures its data as follows in a hierarchy where each thing has a name and an internal, normally hidden identifier:
  • A Board is the top level structure - each Board is separate from each other
  • A Board contains multiple Lists
  • A List contains Cards - normally these are the "actions" you want to do
  • A Card can contain a few subsidiary other data items, such as a Checklist, or Labels (categories)
With the data from one Trello Board now read into a Python dict object it becomes very easy to navigate through this and pull out individual members of a list for further processing. An obvious example is to do a simple reformat of the lists for printing purposes - Trello only lets you print the whole board or an individual card, and a board can run to many, many pages when you have lots of lists and cards (items) in it given the default spacing between items.

Trello does not actually store the data in a strict hierarchy, but instead more in a relational format, where child items will contain the identifier ("id") of the parent item they are part of. Thus the top level JSON data for a Trello Board contains data of:
  • name - name of the board
  • id - identifier of this board, a long hexadecimal number
  • desc - description of the board
  • lists - all the Lists in this Board
  • cards - all the individual Cards in all the Lists
  • checklists - all the Checklists used in different Cards
i.e. most things are stored directly linked off a Board, and not stored nested within each other. This is clearly to allow easy moving of Cards between Lists - only the parent identifier within the Card needs to be updated, not the Lists involved or the Board itself.

The main data sets of lists, cards and checklists come into Python as a list (array) of multiple members, each member being a dict. Each data record for these child data sets contains an "id" field for the parent i.e. "idBoard" in a List, and "idList" in a Card (Checklists are slightly different as the Card does contain a list of the Checklist identifiers).

Example 1 - Printing the names of the lists in a board

We just need to iterate over the "lists" data set of the Board, printing out the "name" of each List. We can also check if each list is open i.e. not closed. Note that I am using Python 3, hence "print" is a true function now.
import json
import sys

NAME = "name"
ID = "id"
DESC = "desc"
CLOSED = "closed"
LISTS = "lists"

json_file = open (sys.argv [0])
trello_board = json.load (json_file)
print ("Board: " + trello_board [NAME])
print ("Lists:")
# trello_board [LISTS] is a Python list, each member being a Python dict for each List
# Loop through all the Lists i.e. one List at a time
for trello_list in trello_board [LISTS] :
    if (not trello_list [CLOSED]) :
        print (trello_list [NAME])
This assumes that the JSON file name is given as the first argument on the command line to the Python program.

Example 2 - Printing out the cards in one list

Assuming that the second command line argument is the first part of the name of a List in the Board, then we can use this to find that List and then print out just its Cards.
import json
import sys

NAME = "name"
ID = "id"
DESC = "desc"
CLOSED = "closed"
LISTS = "lists"
CARDS = "cards"
CHECKLISTS = "checklists"
IDBOARD = "idBoard"
IDLIST = "idList"
IDCHECKLISTS = "idChecklists"

json_file = open (sys.argv [0])
trello_board = json.load (json_file)
print ("Board: " + trello_board [NAME])

# Loop through all Lists in the Board, comparing its name against the input name
for trello_list in trello_board [LISTS] :
    # Only do a partial match on name on leading part of name
    if (trello_list [NAME] [:len(sys.argv [1])] == sys.argv [1] and not trello_list [CLOSED]) :
        print ("List: " + trello_list [NAME])
        # Loop through all Cards in all Lists, checking the parent ID of the Card
        for trello_card in trello_board [CARDS] :
            if (trello_card [IDLIST] == trello_list [ID] and not trello_card [CLOSED]) :
                print (trello_card [NAME])
                if (trello_card [DESC]) :
                    print (trello_card [DESC])

Note that in reality you would have extra code to check that the command line arguments were present, and better error handling for things like the file not existing.

Also I actually used "textwrap" in Python to word wrap long lines properly in the output, and indent wrapped lines for better readability - I've just used "print" directly in these examples to keep it simple.

Summary

That's it for JSON and Python using Trello as an example. You can load in a JSON data set from a file into a corresponding Python data structure using just one method call (json.method), and it is very easy to traverse that data structure finding and processing the data elements you want to using the "field name" as the index into the dictionary of "name : value" pairs in the Python data structure.

Monday, 17 February 2014

Packaging Python With My Application

To try and keep a longer story as short as possible I needed to package up the Python run time environment to ship along with a Python based application I had written. And in this case the target platform was Windows, though the solution will also work for Linux or any other platform (however most Linux distributions will already have Python on them). I needed to ship Python itself with my application to guarantee that the application would be able to run (had everything it needed), and to avoid complications of requiring the customer to download and install Python themselves (potential issues over version compatibility).

Through a number of blog posts by other people about different "packaging" techniques (see References later) I came up with the following solution that works. This is not the only method of packaging a Python application, and indeed it is quite surprising how many different techniques there are. But this worked for me, and was what I wanted i.e. including Python itself with my Python based application. One of the neat benefits of this for me is that the whole Python run time I need is only 7.5 MB in size, and the main ZIP file of the run time environment is only 2.5 MB in size, which shows how compressable it all is.

Packaging Python with my application

First I create a directory (folder for Windows people) for my application, and put all my application's Python files in there.

Then I create a sub-directory in this to put Python itself into e.g. Python33_Win.

Into this I put the following files:
_socket.pyd
cx_Oracle.pyd
msvcr100.dll
pyexpat.pyd
python.exe
python33.dll
python33.zip
LICENSE.txt
Note that "cx_Oracle.pyd" is needed because my application makes a connection to an Oracle database to do its work. Also "msvcr100.dll" is technically a Microsoft DLL that is needed by programs written in C, which the Python interpreter is. Microsoft allows this DLL to be copied for the purpose of running such C based programs.

The "python33.zip" is something I created, and into which I put the rest of the necessary Python run time files. There are quite a lot of these, all taken from the directory where you installed Python on your own system:
  • All the ".py" files in the top level Python folder
  • The following folders including any sub-folders:-
    • collections
    • concurrent
    • ctypes
    • curses
    • dbm
    • distutils
    • email
    • encodings
    • html
    • http
    • importlib
    • logging
    • pydoc_data
    • site-packages
    • unittest
    • urllib
    • venv
    • wsgiref
    • xml
Then I wrote a wrapper script to run my application via the Python run time included. In this case it is a Windows batch script, and it exists in the folder above my application source code. My application needs two command line arguments provided e.g. user name and password.
MyAppName\Python33_Win\python MyAppName\myappname.py %1 %2
That's it, and it works.

How does it work.

Built into the Python interpreter i.e. into "python.exe", is clearly the functionality to dynamically load into itself various libraries it needs at run time. An example of these are the "*.pyd" files explicitly included in the Python directory. However, it also has the functionality to open up a ZIP file and look inside that for the libraries it needs. Thus we can take most of the Python run time environment files and put them into a PYTHON33.ZIP file, and Python will look in here to find the files it needs. Which makes packaging up Python pretty simple.

The exceptions to this are the "python33.dll" and Microsoft C DLL, and a few PYD files (which are Python Dynamic libraries, a bit like Windows DLL's). These will not be found in the ZIP file, as they seem to be needed before Python gets around to opening such a ZIP file.

Further Notes

  • These files and directories are what I needed to make my application work. The particular set needed can be different for each Python based application, depending on what modules you import into your program at run time.
  • I tried using the "modulefinder" module to report on which files were actually being referenced at run time when my application was being run. This helped reduce down the total number of files I needed to include in the ZIP file.
  • The ZIP file is named "python33.zip" because I am using Python version 3.3, and its DLL is named "python33.dll" i.e. it looks for a ZIP file with a matching name of its version.

References

The main articles I found that helped me and pointed out this way of packaging up Python were:
  • Distributing a Python Embedding Program which states that everything can be put into a ZIP file, except for some special files, and gives an example where those files are listed.
  • How to Distribute Commercial Python Applications which describes the different packaging options available for Python based applications, and concludes that "For complex scenarios, my recommendation is to go with bundling CPython yourself and steer clear of the freezers".
    • It does not however tell you how to "bundle CPython yourself" - it just states that it is possible to do it.

Tuesday, 17 July 2012

Python for the DBA (1)

I really like the Python programming language for general purpose programming and quickly putting together little utilities. There are similarities to Perl in terms of being an interpreted language with powerful data processing capabilities, but there the direct comparisons end. Python is a more modern language than Perl, and has a relatively clean design (nothing is ever perfect though, and Python itself has evolved over the years). The main reasons I prefer Python over Perl are:
  • Very clean syntax - no need for obscure characters in front of variable names and so on
  • Very readable code - it is generally clear what a piece of code is doing
  • Minimal syntax dressing overhead - Python avoids the need for begin / end block statement markers
  • True support for functions
    • Perl just pushes all the arguments into a single list that you must disassemble
  • Strongly typed - Python supports multiple data types and checks at run time that operations are valid
  • Rich set of types - number, string, list, dictionaries, plus others e.g. set, tuple
  • Supports classic "function" based programming - just like 'C' does
    • Very simple to write code to directly do what you want, and modularise common code into functions
  • Also supports full "object oriented programming" - full support for Classes as first level objects
    • Perl does not truly do classes as first level objects
I'm not trying to say that Python is better than Perl in all cases, or other programming languages. I'm just saying that Python is a very good and usable programming language, and that I prefer it over Perl now. With Perl I find it gets very confusing when you try and use any level of complexity and the syntax is not obvious or consistent, whereas with Python I don't get any of this because it is such a cleanly designed language. With Python I can easily build a small application by just putting together the necessary source code while using functions for modularity (what I call "just build and run"), avoiding the overheads of defining classes with fully object oriented programming. However, if I have a more complex set of requirements then I can do a full blown object based solution using classes within Python (too complex to just build, so some initial design is needed, probably with test cases too).

Python is very useful for a variety of different scenarios:
  • building a small application or utility by just writing the code - no "compile and link" steps as it is interpreted
  • building something iteratively (top down design & development) as your code is always runnable (being interpreted)
  • prototyping object based code quicker and easier than compiled languages such as Java
  • ability to extend Python with your own libraries - it is written in C and can call your own compiled libraries
  • quickly prototyping real Java code that uses standard Java Classes in the JVM via Jython 
    • Jython is a Python interpreter written in Java
    • this also uses far fewer lines of source code than would be needed in Java, because of Python's fundamentally different design
Again, some of these capabilities are not unique to Python, but the combination of them all together make it a winner for me.

Python Overview

Python should be straightforward enough to understand on reading it. The only major difference to other languages is that Python does not have begin / end statement block markers and instead relies on statement indenting to determine which statements are part of the same block. You will see that statements with embedded statement blocks in them have a colon (':') at the end of the first line, and the following lines are indented. Initially it takes some getting used to, but it does end up with less typing on your part and a consistent code layout.

Although Python supports what I call direct coding (write some statements and then just run them), it is really fully object oriented behind the scenes. Thus you see many standard functions actually returning objects, against which you invoke one of their methods (see Database Example below). The syntax of "object.method (arguments)" occurs frequently, rather than the non-object way of "function (object, other-arguments)".

Python Development

You could edit and run Python programs from the command line ("python filename.py" - the ".py" extension is just a common convention), or you could use an Integrated Development Environment, such as Eclipse with the PyDev plugin. The nice thing about this is that it does syntax checking for you as you type, so you can easily spot various errors before you run your code. And you can run your Python application from within the development environment, so you don't have to leave the editor each time you want to test something.

Python Database Access

Like Perl, Python defines a standard API for database access. The intention is that developers can write code based on this standard API, and a variety of drivers can be provided for access to different databases. By changing the driver used you can connect to different databases, but the bulk of your code remains the same.

There are a bunch of drivers out there for Oracle, but the most common one seems to be cx_Oracle.

Python Database Examples

Using Python is relatively straightforward. I won't describe the database API in detail, as it is obvious when you use it. In summary, you call a "connect" library function with connection details and get back a connection handle (really an object). Using this you can then create a new statement handle, execute it, and fetch back any data or other results. One neat thing is that Python has a "fetchall" method that fetches back all data rows for a SELECT into a single list (actually a list of rows, each row being a list of columns). This then lets you traverse the data using Python logic, with no more fetch calls. I assume that there are efficiencies with the "fetchall" call over how the data is transferred back over the network, but I have not done any tests about this. It will of course end up using more memory within your Python program to hold all the results fetched.

Example

Connect to an Oracle database and execute a SELECT from a table, printing out the data fetched.
from cx_Oracle import connect

conn = connect ("scott", "tiger", "orcl")
# Or: conn = connect('scott/tiger@orcl')
curs = conn.cursor()
curs.execute("SELECT name, value FROM atable")
rows = curs.fetchall()
for i in range(len(rows)):
    print "Row", i, "name", rows[i][0], "value", rows[i][1]
Notes:
  • len is a built in function that returns the number of entries in a list
  • range is a built in function that produces a list of numbers from 0 to one less than the supplied value
  • List members can be accessed using array like notation