The Data Charmer: SQL

Showing posts with label SQL. Show all posts

Sunday, January 15, 2017

MySQL group replication: installation with Docker

Overview

MySQL Group Replication was released as GA with MySQL 5.7.17. It is essentially a plugin that, when enabled, allows users to set replication with this new way.

There has been some confusion about the stability and usability of this release. Until recently, MySQL Group Replication (MGR) was only available in the Labs, which traditionally denotes a preview or an use-at-your-own-risk feature. Several months ago we saw the release of Group Replication as a Docker image, which allowed users to deploy a peer-to-peer cluster (every node is a master.) However, about one month after such release, word came from Oracle discouraging this setup, and inviting users to use Group Replicator in Single Primary mode which is functionally equivalent to traditional replication, with just some synchronous component more. There hasn't been an update of MGR for Docker since.

BTW, some more confusion came from the use of "synchronous replication" to refer to Group Replication operations. In reality, what in many presentations was called synchronous replication is only a synchronous transfer of binary logs data. The replication itself, i.e. the operation that makes a node able to retrieve the data inserted in the master, is completed asynchronously. Therefore, if you looked at MGR as a way of using multiple masters without conflicts, this is not the solution.

What we have is a way of replicating from a node that is the Primary in the group, with some features designed to facilitate high availability solutions. And all eyes are on the next product, which is based on MGR, named MySQL InnoDB Cluster which is MGR + an hormone pumped MySQL Shell (released with the same version number 1.0.5 in two different packages,) and MySQL-Router.

MGR has several limitations, mostly related to multi-primary mode.

Another thing that users should know is that the performance of MGR is inferior to that of asynchronous replication, even in Single-Primary mode. As an example, loading the test employees database takes 92 seconds in MGR, against 49 seconds in asynchronous replication (same O.S., same MySQL version, same server setup.)

Installing MySQL Group Replication

One of the biggest issue with MGR has been the quality of its documentation, which for a while was just lack of documentation altogether. What we have now has a set of instructions that refers to installing group replication in three nodes on the same host. You know, sandboxes, although without the benefit of using a tool to simplify operations. It's just three servers on the same host, and you drive with stick shift.

What we'll see in this post is how to set group replication using three servers in Docker. The advantage of using this approach is that the servers look and feel like real ones. Since the instructions assume that you are only playing with sandboxes (an odd assumption for a GA product) we lack the instructions for a real world setup. The closest thing to a useful manual is the tutorial given by Frédéric Descamps and Kenny Gryp at PerconaLive Amsterdam in October. The instructions, however, are muddled up by the fact that they were using the still unreliable InnoDB Cluster instead of a bare bones Group Replicator. What follows is my own expansion of the sandboxed rules as applied to distinct servers.

The environment:

I am using Docker 1.12.6 on Linux, and the image for mysql/mysql-server:5.7.17. I deploy three containers, with a customized my.cnf containing the bare minimum options to run Group Replication. Here's the template for the configuration files:

$ cat my-template.cnf
[mysqld]
user=mysql
server_id=_SERVER_ID_
gtid_mode=ON
enforce_gtid_consistency=ON
master_info_repository=TABLE
relay_log_info_repository=TABLE
binlog_checksum=NONE
log_slave_updates=ON
log_bin=mysql-bin
relay-log=relay
binlog_format=ROW
log-error=mysqld.err

transaction_write_set_extraction=XXHASH64
loose-group_replication_group_name="aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
loose-group_replication_start_on_boot=off
loose-group_replication_local_address= "172.19.0._IP_END_:6606"
loose-group_replication_group_seeds= "172.19.0.2:6606,172.19.0.3:6606,172.19.0.4:6606"
loose-group_replication_ip_whitelist="172.19.0.2,172.19.0.3,172.19.0.4,127.0.0.1"
loose-group_replication_bootstrap_group= off

Here I take a shortcut. Recent versions of Docker assign a predictable IP address to new containers. To make sure I get the right IPs, I use a private network to deploy the containers. In a perfect world, I should use the container names for this purpose, but the manual lacks the instructions to set up the cluster progressively. For now, this method requires full knowledge about the IPs of the nodes, and I play along with what I have.

This is the deployment script:

#!/bin/bash
exists_net=$(docker network ls | grep -w group1 )
if [ -z "$exists_net" ]
then
    docker network create group1
fi
docker network ls

for node in 1 2 3
do
    export SERVERID=$node
    export IPEND=$(($SERVERID+1))
    perl -pe 's/_SERVER_ID_/$ENV{SERVERID}/;s/_IP_END_/$ENV{IPEND}/' my-template.cnf > my${node}.cnf
    datadir=ddnode${node}
    if [ ! -d $datadir ]
    then
        mkdir $datadir
    fi
    unset SERVERID
    docker run -d --name=node$node --net=group1 --hostname=node$node \
        -v $PWD/my${node}.cnf:/etc/my.cnf \
        -v $PWD/data:/data \
        -v $PWD/$datadir:/var/lib/mysql \
        -e MYSQL_ROOT_PASSWORD=secret \
        mysql/mysql-server:5.7.17

    ip=$(docker inspect --format '{{ .NetworkSettings.Networks.group1.IPAddress}}' node${node})
    echo "${node} $ip"
done

This script deploys three nodes, called node1, node2, and node3. For each one, the template is modified to use a different server ID. They use an external data directory created on the current directory (see Customizing MYSQL in Docker for more details on this technique.) Moreover, each node can access the folder /data, which contains this set of SQL commands:

reset master;
SET SQL_LOG_BIN=0;
CREATE USER rpl_user@'%';
GRANT REPLICATION SLAVE ON *.* TO rpl_user@'%' IDENTIFIED BY 'rpl_pass';
SET SQL_LOG_BIN=1;
CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='rpl_pass' FOR CHANNEL 'group_replication_recovery';
INSTALL PLUGIN group_replication SONAME 'group_replication.so';

Operations

After deploying the containers using the above script, I wait a few seconds to give time to the servers to be ready. I can peek at the error logs, which are in the directories ddnode1, ddnode2, and ddnode3, as defined in the installation command. Then I run the SQL code:

$ for N in 1 2 3; do docker exec -ti node$N bash -c 'mysql -psecret < /data/user.sql' ; done

At this stage, the plugin is installed in all three nodes. I can start the cluster:

$ docker exec -ti node1 mysql -psecret
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.7.17-log MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SET GLOBAL group_replication_bootstrap_group=ON;
Query OK, 0 rows affected (0.00 sec)

mysql> START GROUP_REPLICATION;
Query OK, 0 rows affected (1.14 sec)

mysql>SET GLOBAL group_replication_bootstrap_group=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | ecba1582-db68-11e6-a492-0242ac130002 | node1       |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)

The above operations have started the replication with the bootstrap, an operation that must be executed only once, and that defines the primary node.

After setting the replication, I can enter some data, and then see what happens in the other nodes:

mysql> create schema test;
Query OK, 1 row affected (0.01 sec)

mysql> use test
Database changed
mysql> create table t1 (id int not null primary key, msg varchar(20));
Query OK, 0 rows affected (0.06 sec)

mysql> insert into t1 values (1, 'hello from node1');
Query OK, 1 row affected (0.01 sec)

mysql> show binlog events;
+------------------+------+----------------+-----------+-------------+----------------------------------------------------------------------------+
| Log_name         | Pos  | Event_type     | Server_id | End_log_pos | Info                                                                       |
+------------------+------+----------------+-----------+-------------+----------------------------------------------------------------------------+
| mysql-bin.000001 |    4 | Format_desc    |         1 |         123 | Server ver: 5.7.17-log, Binlog ver: 4                                      |
| mysql-bin.000001 |  123 | Previous_gtids |         1 |         150 |                                                                            |
| mysql-bin.000001 |  150 | Gtid           |         1 |         211 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:1'          |
| mysql-bin.000001 |  211 | Query          |         1 |         270 | BEGIN                                                                      |
| mysql-bin.000001 |  270 | View_change    |         1 |         369 | view_id=14845163185775300:1                                                |
| mysql-bin.000001 |  369 | Query          |         1 |         434 | COMMIT                                                                     |
| mysql-bin.000001 |  434 | Gtid           |         1 |         495 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:2'          |
| mysql-bin.000001 |  495 | Query          |         1 |         554 | BEGIN                                                                      |
| mysql-bin.000001 |  554 | View_change    |         1 |         693 | view_id=14845163185775300:2                                                |
| mysql-bin.000001 |  693 | Query          |         1 |         758 | COMMIT                                                                     |
| mysql-bin.000001 |  758 | Gtid           |         1 |         819 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:3'          |
| mysql-bin.000001 |  819 | Query          |         1 |         912 | create schema test                                                         |
| mysql-bin.000001 |  912 | Gtid           |         1 |         973 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:4'          |
| mysql-bin.000001 |  973 | Query          |         1 |        1110 | use `test`; create table t1 (id int not null primary key, msg varchar(20)) |
| mysql-bin.000001 | 1110 | Gtid           |         1 |        1171 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:5'          |
| mysql-bin.000001 | 1171 | Query          |         1 |        1244 | BEGIN                                                                      |
| mysql-bin.000001 | 1244 | Table_map      |         1 |        1288 | table_id: 219 (test.t1)                                                    |
| mysql-bin.000001 | 1288 | Write_rows     |         1 |        1341 | table_id: 219 flags: STMT_END_F                                            |
| mysql-bin.000001 | 1341 | Xid            |         1 |        1368 | COMMIT /* xid=144 */                                                       |
+------------------+------+----------------+-----------+-------------+----------------------------------------------------------------------------+
19 rows in set (0.00 sec)

The binary log events show that we are replicating using the ID of the group, instead of the ID of the single server.

In the other two nodes I run the operation a bit differently:

$ docker exec -ti node2 mysql -psecret
mysql> select * from performance_schema.global_variables where variable_name in ('read_only', 'super_read_only');
+-----------------+----------------+
| VARIABLE_NAME   | VARIABLE_VALUE |
+-----------------+----------------+
| read_only       | OFF            |
| super_read_only | OFF            |
+-----------------+----------------+
2 rows in set (0.01 sec)

mysql>  START GROUP_REPLICATION;
Query OK, 0 rows affected (5.62 sec)

mysql> select * from performance_schema.global_variables where variable_name in ('read_only', 'super_read_only');
+-----------------+----------------+
| VARIABLE_NAME   | VARIABLE_VALUE |
+-----------------+----------------+
| read_only       | ON             |
| super_read_only | ON             |
+-----------------+----------------+
2 rows in set (0.01 sec)

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | ecba1582-db68-11e6-a492-0242ac130002 | node1       |        3306 | ONLINE       |
| group_replication_applier | ecf2eae5-db68-11e6-a492-0242ac130003 | node2       |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.01 sec)

Now the cluster has two nodes, and I've seen that the nodes are automatically defined as read-only. I can repeat the same operation in the third one.

$ docker exec -ti node2 mysql -psecret
mysql> START GROUP_REPLICATION;
Query OK, 0 rows affected (2.35 sec)

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | ecba1582-db68-11e6-a492-0242ac130002 | node1       |        3306 | ONLINE       |
| group_replication_applier | ecf2eae5-db68-11e6-a492-0242ac130003 | node2       |        3306 | ONLINE       |
| group_replication_applier | ed259dfc-db68-11e6-a4a6-0242ac130004 | node3       |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

What about the data? It's been replicated:

mysql> show schemas;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
| test               |
+--------------------+
5 rows in set (0.00 sec)

mysql> show tables from test;
+----------------+
| Tables_in_test |
+----------------+
| t1             |
+----------------+
1 row in set (0.01 sec)

Monitoring

In this flavor of replication there is no SHOW SLAVE STATUS. Everything I've got is in performanceschema tables and in mysql.slavemasterinfo and mysql.slaverelayloginfo, and sadly it is not a lot.

mysql> select * from replication_group_member_stats\G
*************************** 1. row ***************************
                      CHANNEL_NAME: group_replication_applier
                           VIEW_ID: 14845163185775300:3
                         MEMBER_ID: ecba1582-db68-11e6-a492-0242ac130002
       COUNT_TRANSACTIONS_IN_QUEUE: 0
        COUNT_TRANSACTIONS_CHECKED: 3
          COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
TRANSACTIONS_COMMITTED_ALL_MEMBERS: aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:1-6
    LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:5
1 row in set (0.00 sec)


mysql> select * from replication_connection_status\G
*************************** 1. row ***************************
             CHANNEL_NAME: group_replication_recovery
               GROUP_NAME:
              SOURCE_UUID:
                THREAD_ID: NULL
            SERVICE_STATE: OFF
COUNT_RECEIVED_HEARTBEATS: 0
 LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
 RECEIVED_TRANSACTION_SET:
        LAST_ERROR_NUMBER: 0
       LAST_ERROR_MESSAGE:
     LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 2. row ***************************
             CHANNEL_NAME: group_replication_applier
               GROUP_NAME: aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
              SOURCE_UUID: aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
                THREAD_ID: NULL
            SERVICE_STATE: ON
COUNT_RECEIVED_HEARTBEATS: 0
 LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
 RECEIVED_TRANSACTION_SET: aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:1-6
        LAST_ERROR_NUMBER: 0
       LAST_ERROR_MESSAGE:
     LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
2 rows in set (0.00 sec)

Compared to regular replication, we lose the ID of the node where the data was originated. Instead, we get the ID of the group replication (which we set in the configuration file.) This is useful for a smoother operation of replacing the primary node (a.k.a. the master) with another node, but we have lost some valuable information that could have been added to the output rather than simply being replaced. Another valuable piece of information that is missing is the transactions that were executed (we only see RECEIVED_TRANSACTION_SET.) As in regular replication, we can get this information with "SHOW MASTER STATUS" or "SELECT @@global.gtid_executed", but as mentioned in improving the design of MySQL replication there are several flaws in this paradigm. What we see in MGR is a reduction of replication monitoring data, while we would have expected some improvement, given the complexity of the operations for this new technology.

Summing up

MySQL Group Replication is an interesting technology. If we consider it in the framework of a component for high availability (which will be completed when the InnoDB Cluster is released) it might improve the workflow of many database users.

As it is now, however, it gives the feeling of being a rushed up piece of software that does not offer any noticeable advantage to users, especially considering that the documentation released with it is far below the standards of other MySQL products.

Thursday, September 23, 2010

Book review : SQL Antipatterns

SQL Antipatterns, by Bill Karwin

I remember that when I finished reading The Lord Of The Rings, I felt a pang of disappointment. "What? Already finished? What am I going to read now? What can give me the same pleasure and sense of accomplishment that these wonderful pages have given me?"
That's how I felt when I came to the last page of SQL Antipatterns. And, no, Bill Karwin doesn't tell imaginary tales from a fictitious world. This book is full of very real and very practical advice, but all the material is presented with such grace and verve that I could not put it down until the very end. I read it cover to cover in just a few hours, and I savored every page.

What is this Antipatterns, anyway? The title may deceive a casual bookshop browser into believing that it's about some philosophical database theory. Digging further, you realize that it's a collection of best practice in database handling, but told from the side of the bad practice. Every chapter starts with the story of a disaster, followed by an explanation of why it happened, and how to recognize the symptoms. Along the way, you learn what people do to fix the problem when they don't understand where exactly the trouble comes from. And then comes the best practice section, where you are told of several ways of fixing the problem for good.
It's a pleasure from the start. When I read the introductory stories, I recognized mistakes that I have witnessed in my past experience and some that I made myself. It was a fascinating sequence of "look at this mess!", "Here is what you shouldn't do", "It happens when ...", and "this is what you should do".
The more I read, the more I wanted to read. And after a few chapters, a pattern appeared among the Antipatterns. The stories are not a casual collection of database mistakes. There is a thread. Most of the stories tell you what happens when you violate the relational theory principles, and the best practice solutions focus on how to apply relational principles in real life.
The relationship between the horror stories and the relational database theory becomes apparent when you get to the appendix. In other books, you can often skip the appendixes as a reminder of something that you know already. Not here. If you skip this one, you will miss a sort of Ellery Queen ending, where the detective puts together all the pieces of the mystery for the amazed audience.

I feel I would do a disservice to Bill if I revealed more about the book. I may have already spoiled some of the surprise by disclosing the inner structure of the book. But certainly I won't give away any of the juicy stories that make this book an invaluable tool for every database professional.
I can only say this. When I read a technical book, I usually find something wrong, or debatable, or some technical solution that I could improve upon. Not in this book. I completely agree with the theory and the practice that is suggested from start to finish. My QA ego suffers at the idea of not having found anything wrong, but this slight disappointment is more than compensated by the pleasure of reading such a beautiful book. Well done, Bill! I am already queuing to buy your next one!

Monday, May 31, 2010

OpenSQLCamp EU 2010 - Call for participation

The European OpenSQLCamp 2010 will take place in parallel to the Free and Open Source Conference 2010 (FrOSCon) on Saturday 21st and Sunday 22nd August at the Fachhochschule Bonn-Rhein-Sieg in St. Augustin, Germany. St. Augustin is located close to Bonn and Cologne.
The Call for Participation is now online.

The event is organized by yours truly and Felix Schupp, and we are open to cooperation from other volunteers.
Specifically, we need help to beat the drum. Even if you can't participate, we will appreciate your help in making the Call for Participation known.
OpenSQLCamp2010 will use the FrOSCon's Pentabarf conference coordination system to collect talk submissions and perform the organizing and scheduling of the talks.
Please create an account there, if you don't have one already. Once you have activated your account via the email address you provided, please log into the system and create a new event. Make sure to select track OpenSQLCamp for your submission!

IMPORTANT! - FrOSCon uses CA certificates. If you browser does not recognize them, then you need to Import the CAcert Root Certificate before using the CfP pages.

The deadline for submitting your proposal is Sunday, July 11th, 2010 (12:00pm PST).

Thursday, August 20, 2009

MySQL Sandbox and Spider at FrOSCon and OpenSQLCamp

FrOSCon and the OpenSQLCamp are about to start.
I am packing for Sankt Augustin, where I will attend the fourth edition of FrOSCon and the second OpenSQLCamp. I will have two sessions, Sharding for the masses, about the Spider storage engine and MySQL Sandbox 3, about one of my favorite tools.

The program is very rich. There will be several tracks in the main event and in the associated conferences. If you have any involvement or simply some curiority in open source matters, You will find something interesting at FrOSCon.

Wednesday, July 22, 2009

OpenSQLCamp democracy

We have seen this before. Actually, we got the idea from Drupal, where talk proposals are public, and the most voted ones get in the schedule. Nonetheless, it's a pleasure to see that a transparent voting system is accepted and used.

The OpenSQLCamp 2009 European edition, is under scrutiny. There are 27 session proposals, from which we will need to get 12 in the schedule.
The open voting is done via Twitter or the mailing list.
I have a good feeling about it. Since I am proposing a public talk, I must be prepared for public scrutiny. It's only fair that my proposal is evaluated by the same audience that will later decide to come see my presentation. It's an incentive to write a clear and compelling proposal. I can't get away with "I am well known. I will say something about replication."
If you are attending OpenSQLCamp at FrOSCon, let us know what you would like to see. Let the best proposals win.

Tuesday, June 09, 2009

Speaking at FrOSCon 2009 and getting ready to OpenSQLCamp-Europe

For the fourth time in a row, I will be speaking at FrOSCon, one of the most charming open source events in Europe.
Hosted in the bright environment of the Department of Computer Science of the University of Applied Sciences Bonn-Rhein-Sieg, this event will get you hooked from the beginning. The organization is done by volunteers, who have always done an amazing job, with even better results than more expensive and famous conferences.
This year, there will be some more action than ever before. In addition to the main event, the organizers have given away a few developers rooms, to let some projects build their own event within the main one. There will be a Java subconference, and, closer to my interests, the European edition of the OpenSQLCamp 2009, which applies to all open database related projects, such as MySQL, PostgreSQL, JavaDB, and many others that are listed in the announcements page. The principal organizer is Lenz Grimmer, who is looking for volunteers to share the burden.
The organizers have also launched a creative contest, which most geeks will find irresistible. Check it out!
FrOSCon started as one of the many LAMP events, and in its fourth year it has emerged as an example for others. I warmly recommend it to all people interested in open source.

Tuesday, October 07, 2008

Preparing for Open SQL Camp

The Open SQL Camp will take place in Charlottesville, VA, USA, on November 14, 15, and 16.
Attendees are requested to register in the event's Wiki, and if you are interested in presenting something, there is a mailing list to discuss your intended topics.

I have proposed a topic about the MySQL community driven replication monitoring project, a hackish solution to the old problem of replication monitoring, implemented entirely on the server side, without using system applications.

Wednesday, July 30, 2008

Don't guess. Test! - A sample database with test suite

Some time ago, with the help of Patrick Crews, I built a sample database for testing.
Now this database is published as a stand-alone project on Launchpad.

What's special about it?
Unlike the previous databases used in MySQL documentation and tutorials, this database has some weight. The total data is over 160 MB, distributed across 6 table, for a total of about 4 million records. It is not huge, but it is large enough to be non-trivial.
The second important feature is that this database comes with a test suite. This will allow you to make sure that you have loaded the right data.

Getting started

Using the sample database is trivial.
Open the test db downloads page and get the current full archive (25 MB) containing the installation script, the dump files, and the test suite. Then, expand it.

$ tar -xjf $HOME/Downloads/employees_db-full-1.0.4.tar.bz2
$ cd employees_db/
$ vim employees.sql

Edit the employee.sql script, and uncomment the storage engine that you want to use for your tests. By default, it uses InnoDB.

   set storage_engine = InnoDB;
-- set storage_engine = MyISAM;
-- set storage_engine = Falcon;
-- set storage_engine = PBXT;
-- set storage_engine = Maria;

Then, load the files.

$ mysql -t < employees.sql
+-----------------------------+
| INFO                        |
+-----------------------------+
| CREATING DATABASE STRUCTURE |
+-----------------------------+
+------------------------+
| INFO                   |
+------------------------+
| storage engine: MyISAM |
+------------------------+
+---------------------+
| INFO                |
+---------------------+
| LOADING departments |
+---------------------+
+-------------------+
| INFO              |
+-------------------+
| LOADING employees |
+-------------------+
+------------------+
| INFO             |
+------------------+
| LOADING dept_emp |
+------------------+
+----------------------+
| INFO                 |
+----------------------+
| LOADING dept_manager |
+----------------------+
+----------------+
| INFO           |
+----------------+
| LOADING titles |
+----------------+
+------------------+
| INFO             |
+------------------+
| LOADING salaries |
+------------------+

Finally, test that your data is what it is supposed to be. The test suite include two methods, one with MD5 and one with SHA1.

$ time mysql -t < test_employees_sha.sql
+----------------------+
| INFO                 |
+----------------------+
| TESTING INSTALLATION |
+----------------------+
+--------------+------------------+------------------------------------------+
| table_name   | expected_records | expected_crc                             |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     |          2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+------------------+------------------------------------------+
| table_name   | found_records    | found_crc                                |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     |          2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK            | ok        |
| departments  | OK            | ok        |
| dept_manager | OK            | ok        |
| dept_emp     | OK            | ok        |
| titles       | OK            | ok        |
| salaries     | OK            | ok        |
+--------------+---------------+-----------+

real 0m37.067s
user 0m0.007s
sys 0m0.009s

$ time mysql -t < test_employees_md5.sql
+----------------------+
| INFO                 |
+----------------------+
| TESTING INSTALLATION |
+----------------------+
+--------------+------------------+----------------------------------+
| table_name   | expected_records | expected_crc                     |
+--------------+------------------+----------------------------------+
| employees    |           300024 | 4ec56ab5ba37218d187cf6ab09ce1aa1 |
| departments  |                9 | d1af5e170d2d1591d776d5638d71fc5f |
| dept_manager |               24 | 8720e2f0853ac9096b689c14664f847e |
| dept_emp     |           331603 | c2c4fc7f0506e50959a6c67ad55cac31 |
| titles       |           443308 | bfa016c472df68e70a03facafa1bc0a8 |
| salaries     |          2844047 | fd220654e95aea1b169624ffe3fca934 |
+--------------+------------------+----------------------------------+
+--------------+------------------+----------------------------------+
| table_name   | found_records    | found_crc                        |
+--------------+------------------+----------------------------------+
| employees    |           300024 | 4ec56ab5ba37218d187cf6ab09ce1aa1 |
| departments  |                9 | d1af5e170d2d1591d776d5638d71fc5f |
| dept_manager |               24 | 8720e2f0853ac9096b689c14664f847e |
| dept_emp     |           331603 | c2c4fc7f0506e50959a6c67ad55cac31 |
| titles       |           443308 | bfa016c472df68e70a03facafa1bc0a8 |
| salaries     |          2844047 | fd220654e95aea1b169624ffe3fca934 |
+--------------+------------------+----------------------------------+
+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK            | ok        |
| departments  | OK            | ok        |
| dept_manager | OK            | ok        |
| dept_emp     | OK            | ok        |
| titles       | OK            | ok        |
| salaries     | OK            | ok        |
+--------------+---------------+-----------+

real 0m33.453s
user 0m0.007s
sys 0m0.009s

Now we're ready to use it.
The test-db project is open. Participate!

Wednesday, March 19, 2008

SQL injection and bad programming practice

I live in a town that is the nemesis of e-commerce applications. The name of my town is Quartu Sant'Elena
Notice that the name contains an apostrophe, which for all practical purposes is represented by a single quote.
Single quotes have a bad reputation, because they may be a symptom of SQL injection. Whenever I enter the name of my town in a web form to buy something, I hold my breath, because I dread what comes next.
The smartest applications have a Ajax interface with online completion, and take the name of the town without problems. The less advanced ones show a multiple choice list containing my town name.
The bad ones refuse the name of the town as invalid, and force me to enter an alternate spelling (Quartu S.Elena), which is recognized by most Italians as being equivalent.
The very bad ones, after forcing me to change the name of the town, refuse my credit card as invalid. The reason? The billing address of my credit card does not match with the one I entered in the web form.
The very terrible ones check the credit card billing address some days after the transaction was completed. I bought a domain name from a big registrar company. They accepted the credit card and assigned me the domain, which I started using immediately. Three days later, I got a message saying that my credit card charge was not being accepted. The domain was blocked, so I could not register it with another company, and the company did not solve my problem, despite hours of phone calls. All for a lazy programming practice!
Real SQL injection prevention is achieved by input checking and string escaping, not by blindly rejecting everything that looks like a quote.

The Data Charmer