Database Soup: postgresql

Showing posts with label postgresql. Show all posts

Thursday, July 12, 2018

Should I Run Postgres on Kubernetes? Part II

Having covered some of the reasons why you would want to run a PostgreSQL on Kubernetes, I'm not going to cover some of the reasons why you wouldn't want to. This should help you make a balanced judgment about whether or not it's the right next move for you.

Not a Kubernetes Shop

Of course, everything I said about an integrated environment applies in reverse; if you're not already running other services on Kubernetes, or planning to migrate them, then moving your databases to Kubernetes is a lot of effort with minimal reward. I'd even go so far as to say that your databases should probably be the last thing you move. Stateful services are harder, and require a greater mastery of Kubernetes. Start with the easy stuff.

Also, and perhaps more importantly, nothing in Kubernetes will remove from you (as admin) the necessity of having knowledge of the database platform you're deploying. While it does make somethings (like HA) easier to automate, it doesn't change the basics of database management. You still need to have someone on staff who knows PostgreSQL, it's just that that person can manage a lot more databases than they could before.

Everything is Alpha

Like everything in the new stacks, Kubernetes is under heavy development and many of the tools and features you're going to want to use are alpha, beta, or pre-1.0 software. For folks used to PostgreSQL, where even our daily builds are runnable in production, you're going to find the world of containerish stuff very alpha indeed.

This means that moving your databases (or anything else) to Kubernetes at this stage means embracing the requirement to develop both some Kubernetes expertise, and a tolerance for risk. Realistically, this means that you're only going to do it today if you have pressing needs to scale your database support and admin team that you can't meet otherwise. Folks in regulated environments with very low tolerance for mistakes should probably be the last to move.

In a year or so, this will change; there is more work today on making things stable in Kubernetes than there is on new features. But right now you need to balance risk and reward.

Performance

Linux containers are just processes, and as a result perform largely like applications on the host OS would perform. However, there are some areas where running on Kubernetes can add overhead. Like any other abstraction layer, it's going to be problematic if your database is already your primary performance bottleneck. These overhead sources include:

Extra network hops from connection redirection, leading to increased latency;
Performance bottlenecks causes by various network overlays used for virtual networking;
Storage latency caused by storage abstractions, depending on your storage system.

That said, this overhead is not any worse (and sometimes considerably better) than running a database on AWS, where the majority of PostgreSQL servers today are hosted. So it's not a blocker for most users, just the ones who are already performance-constrained and maxxing out their hardware.

Large Databases

However, large database may face more performance and management issues than smaller databases. Kubernetes scheduling, Linux cgroups, namespace memory management, and other components of the stack are built with the idea that each deployed microservice will be using a minority of the resources on the physical nodes. When you move over to a database that needs to use the majority of a node's physical resources, you'll find yourself working around a lot of built-in functionality to make it work.

One particular issue is that, if you are using Linux containers, you cannot turn off the Out-Of-Memory Killer. Veteran PostgreSQL admins will recognize the problem this is for large databases.

Like other limitations, this won't be true forever. Among other things, every component in Kubernetes is swappable, so you can expect more specialized workload managers in the future. But right now, if you're doing data warehousing you might do it off Kubernetes or using a dedicated cluster with special settings.

Conclusion and Checklists

To wrap up, you can compare your database use-case with these pluses and minuses to decide whether you should scope out moving your PostgreSQL infrastructure to Kubernetes. To the degree that automation, self-service, and zero-admin small databases are valuable to you, consider it. But if stability, marginal performance, and big data analytics are your main requriements, it's maybe not time for your databases yet.

Wednesday, July 11, 2018

Should I Run Postgres on Kubernetes? Part I

In preparation for my workshop on running PostgreSQL on Kubernetes on Monday, I wanted to talk a bit about why you'd want to run your database there in the first place -- and why you wouldn't.

The container world, starting with Docker and then moving to Kubernetes, has focused on stateless services like web applications. This has been largely because stateless services are simply easier to manage in new environments, and can be handled generically was well, allowing Kubernetes to be relatively platform-agnostic. Once you get into services that require storage and other persistent resources, the idea of "run your containerized application on any platform" becomes much more challenging.

Somewhere along the way "stateful services are hard" morphed into "you shouldn't run stateful services" as a kind of mantra. Considering how much work contributors have put into making stateful apps possible, that's simply wrong. And, for that matter, when has running databases on any stack ever been easy?

Let's start with some of the reasons you would want to run Postgres (or other databases) on Kubernetes. In tommorrow's post, I'll go into some of the reasons why you would not want to.

One Environment to Rule Them All

The biggest reason is to simplify your development and deployment picture by putting all application components on Kubernetes. It supplies a whole set of scaffolding to make deploying and integrating applications and databases easier, including shared secrets, universal discovery, load balancing, service monitoring, and scaling. While there are integration points, like the Service Catalog, that support treating external databases as Kubernetes services, it's always going to be harder to manage a database that has to be deployed by a completely different process from the application it supports.

As a small example, let's take database connection credentials. If both the database and the application are on Kubernetes, rotating/updating credentials is simple: you just update a Secrets object (or a plugin provider like KeyCloak), and both the application and the database will pick up the new logins. If the database is external, this becomes two separate update processes, each of which uses different tools.

With the integration from locating everything on Kubernetes, setting up other parts of your development infrastructure get much easier. Deploying a test application to dev or staging can work the same way as it does in prod, minimizing mistakes. CI/CD can more easily include database updates and upgrades.

Do the DevOps

With this level of integration, and the right coworkers, Kubernetes then enables a "real DevOps workflow" where each development team owns their own database. In dev and staging environments, and maybe even in production, developers can be given the ability to self-service database support, deploying small replication clusters using predefined templates.

This isn't theoretical; it's what online fashion retailer Zalando is actually doing, allowing them to manage a large company with hundreds of online applications and a handful of database staff.

In traditional -- or even VM cloud -- environments, this is much, much harder to do. First, the cost and resource consumption of database servers (or virtual servers) requires ops gating their deployment. Second, the challenge of setting up databases in a way that protects against data loss is a blocker for motivated dev teams.

Much Easier HA

I spent a long time working on fully automated HA systems for PostgreSQL. I found that, for HA to work, I needed multiple services outside PostgreSQL itself:

a "source of truth" to prevent split-brain;
something to ensure that the minimum number of replicas were running;
routing to the current master that can be quickly changed;
a way to enforce shutdown of rogue nodes;
a UI to view all of the members of my cluster.

Kubernetes supplies all of the above for you using various objects and tools. This means that the automatic HA code that needs to run in the Postgres containers can be very simple, and substantially more reliable. Today, I can't imagine trying to implement database high availability without it.

Zero-Brain-Cell Admin for Small Databases

In a presentation by Peter van Hardenberg, he pointed out that the median size of Heroku's hundreds of thousands of PostgreSQL databases was less than 1GB. Databases this size also usually support a single application. This is the size of database you must automate; no human being should be spending time and effort administrating an individual database that fits on a 2005 USB key.

In an orchestrated environment, it becomes much easier to treat the large number of tiny databases needed in your ecosystem as the replaceable cogs they are. Yes, you can automate your database management entirely with configuration management systems, but it's both easier and more reliable to do it in a container cloud. Kubernetes helps you to not think about your less critical databases, so that you can spend your time on the databases that really need mollycoddling.

Not for Everyone

Of course, as with any change of infrastructure, there are downsides to moving databases to Kubernetes. These are serious and will be the reasons why some users and admins will stick to the techniques they know. I'll explore those in my next post.

Tuesday, April 3, 2018

New Annotated Config Files for PostgreSQL 10

Teal Deer: The Annotated.conf has been updated for PostgreSQL 10, and it's a Github repo now.

13 years ago, for PostgreSQL 8.0, I released the first Annotated.conf because we had (at that time) over 140 configuration settings and most people had no idea how to set them. Today Postgres has nearly twice as many (269), and the problem is twice as bad. More about that later.

The repository is intended to be a quick reference for how to figure out how to set a lot of these parameters. It includes my tutorial slides in both Libreoffice and PDF format, and a giant table of settings and recommendations in either CSV format or as a PostgreSQL 10 database dump. Accompanying each setting are my notes on why you'd change the setting and, if so, to what.

Probably more usefully, I've included two sample .conf files. postgresql.10.simple.conf contains the 20 most commonly changed settings with detailed advice on how to set them. extra.10.conf has the next 20 most-likely-to-be-changed settings, with somewhat more limited advice. The idea is for you follow the instructions, and then drop one or both of these files into your include_dir configuration directory and have a more sensible configuration without messing with the preposterously long default configuration file that ships with most PostgreSQL packages. Those 20 to 40 settings should cover 90% of the needs of 90% of users.

If you feel that my advice is inaccurate, incomplete, or out-of-date, then Github has both Issues and Pull Requests available for you to suggest replacements.

Now, I can't help but feel that the configuration situation with PostgreSQL is getting worse. While the database has improved with some better sensible defaults, and some good auto-configuration code (for example, for transaction log size), overall both the number of settings and the number of settings most users have to care about keeps going up.

The worst part of this is settings added with absolutely no information on how to determine a reasonable level for the setting. For example, these four data flushing settings were added in version 10:

#backend_flush_after
#bgwriter_flush_after
#checkpoint_flush_after
#wal_writer_flush_after

... but there is no concrete advice on how to set these. Yes, there are docs, but even as someone who has a solid grounding in database, memory, and filesystem performance, I'd hesitate to recommend any specific level for these for a specific problem. I'd like to be optimistic and assume that an explanation is coming in Postgres 11 or so, but after four years there's still nothing on how to set vacuum_multixact_freeze_min_age.

In the meantime, hopefully annotated.conf gives you a good start on tinkering with your own PostgreSQL settings.

Wednesday, January 11, 2017

Retiring from the Core Team

Those of you in the PostgreSQL community will have noticed that I haven't been very active for the past year. My new work on Linux containers and Kubernetes has been even more absorbing than I anticipated, and I just haven't had a lot of time for PostgreSQL work.

For that reason, as of today, I am stepping down from the PostgreSQL Core Team.

I joined the PostgreSQL Core Team in 2003. I decided to take on project advocacy, with the goal of making PostgreSQL one of the top three databases in the world. Thanks to the many contributions by both advocacy volunteers and developers -- as well as the efforts by companies like EnterpriseDB and Heroku -- we've achieved that goal. Along the way, we proved that community ownership of an OSS project can compete with, and ultimately outlast, venture-funded startups.

Now we need new leadership who can take PostgreSQL to the next phase of world domination. So I am joining Vadim, Jan, Thomas, and Marc in clearing the way for others.

I'll still be around and still contributing to PostgreSQL in various ways, mostly around running the database in container clouds. It'll take a while for me to hand off all of my PR responsibilities for the project (assuming that I ever hand all of them off).

It's been a long, fun ride, and I'm proud of the PostgreSQL we have today: both the database, and the community. Thank you for sharing it with me.

Wednesday, May 18, 2016

Changing PostgreSQL Version Numbering

Per yesterday's developer meeting, the PostgreSQL Project is contemplating a change to how we do version numbers. First, let me explain how we do version numbers now. Our current version number composition is:

9 . 5 . 3

Major1 . Major2 . Minor

That is, the second number is the "major version" number, reflecting our annual release. The third number is the update release number, reflecting cumulative patch releases. Therefore "9.5.3" is the third update to to version 9.5.

The problem is the first number, in that we have no clear criteria when to advance it. Historically, we've advanced it because of major milestones in feature development: crash-proofing for 7.0, Windows port for 8.0, and in-core replication for 9.0. However, as PostgreSQL's feature set matures, it has become less and less clear on what milestones would be considered "first digit" releases. The result is arguments about version numbering on the mailing lists every year which waste time and irritate developers.

As a result, the PostgreSQL Project is proposing a version numbering change, to the following:

10 . 2

Major . Minor

Thus "10.2" would be the second update release for major version 10. The version we release in 2017 would be "10" (instead of 10.0), and the version we release in 2018 will be "11".

The "sortable" version number available from the server, libpq, and elsewhere would remain the same six digits, zero-filled in the middle. So 10.2 would be 100002.

The idea is that this will both put an end to the annual arguments, as well as ending the need to explain to users that 9.5 to 9.6 is really a major version upgrade requiring downtime.

Obviously, there is potential for breakage of a lot of tools, scripts, automation, packaging and more in this. That's one reason we're discussing this now, almost a year before 10 beta is due to come out.

The reason for this blog post is that I'm looking for feedback on what this version number change will break for you. Particularly, I want to hear from driver authors, automation engineers, cloud owners, application stack owners, and other folks who are "downstream" of PostgreSQL. Please let us know what technical problems this will cause for you, and how difficult it will be to resolve them in the next nine months.

We are not, at this point, interested in comments on how you feel about the version change or alternate version naming schemes. That discussion has already happened, at length. You can read it here, here, and here, as well as at the developer meeting.

Places to provide feedback:

comments on this blog
posts on the pgsql-hackers or pgsql-advocacy mailing lists
PostgreSQL Facebook group

Thanks for any feedback you can provide.

Note that the next release of PostgreSQL, due later this year, will be "9.6" regardless. We're deciding what we do after that.

Thursday, April 28, 2016

Don't delete pg_xlog

This StackOverflow question reminded me of this old blog post, which is still relevant today:

pg_log, pg_xlog and pg_clog

There are three directories in a default $PGDATA directory when you create it which are named "pg_*log".

pg_log

$PGDATA/pg_log is the default location for the database activity logs, which include error messages, query logging, and startup/shutdown messages. This is where you should first look for information when PostgreSQL won't start. Many Linux distributions and other packaging systems relocate this log directory to somewhere like /var/log/postgresql.

You can freely delete, rename, compress, and move files in pg_log without penalty, as long as the postgres user still has rights to write to the directory. If pg_log becomes bloated with many large files, you probably need to decrease the number of things you're logging by changing the settings in postgresql.conf.

Do note that if you "delete" the current log file on a Linux or Unix system, it may remain open but not accessible, just sending any successive log messages to /dev/null until the file rotates.

pg_xlog

$PGDATA/pg_xlog is the PostgreSQL transaction log. This set of binary log files, with names like '00000001000000000000008E', contain images of the data from recent transactions. These logs are also used for binary replication.

If replication, archiving, or PITR is failing, this directory can become bloated with gigabytes of logs the database server is saving for when archiving resumes. This can cause you to run out of disk space
.
Unlike pg_log, you may not freely delete, move, or compress files in this directory. You may not even move the directory without symlinking it back to its original location. Deleting pg_xlog files may result in unrecoverable database corruption.

If you find yourself in a situation where you've got 100GB of files in pg_xlog and the database won't start, and you've already disabled archiving/replication and tried clearing disk space every other way, then please take two steps:

Move files from pg_xlog to a backup disk or shared network drive, don't delete them, and
Move only a few of the oldest files, enough to allow PostgreSQL to start again.

pg_clog

$PGDATA/pg_clog contains a log of transaction metadata. This log tells PostgreSQL which transactions completed and which did not. The clog is small and never has any reason to become bloated, so you should never have any reason to touch it.

Should you ever delete files from pg_clog, you might as well delete the entire database directory. There is no recovery from a missing clog.

Note that this means, if you back up the files in a $PGDATA directory, you should make sure to include the pg_clog and pg_xlog as well, or you may find that your backup is not usable.

Tuesday, April 26, 2016

Join us for the 3rd pgCon User Unconference

This year, we're continuing to experiment with new formats for the pgCon unconference. In 2013 and 2014 we had an Unconference on the Saturday of pgCon. In 2015 we had a limited Developer Unconference on Wednesday.

This year, we will have a Developer Unconference on Wednesday, and a User Unconference on Saturday. We're doing this because people were disappointed that we didn't do the User Unconference last year, and asked us to bring it back. So, hopefully you planned to stay over Saturday!

The User Unconference has several purposes:

to give various teams and special interest groups an opportunity to schedule something
to let folks whose technology was released too late for the CfP another chance to present something
to continue discussions started around talks in the main program

So, please join us! And if you have ideas for User Unconference sessions which you want to make sure get on the program, please list them on the wiki page using the template provided. Note that final sessions will be chosen at 10am Saturday morning, though.

Thursday, March 31, 2016

9.5.2 update release and corrupt indexes

We've released an off-schedule update release today, because of a bug in one of 9.5's features which has forced us to partially disable the feature. This is, obviously, not the sort of thing we do lightly.

One of the performance features in 9.5 was an optimization which speeded up sorts across the board for text and numeric values, contributed by Peter Geoghegan. This was an awesome feature which speeded up sorts across the board by 50% to 2000%, and since databases do a lot of sorting, was an overall speed increase for PostgreSQL. It was especially effective in speeding up index builds.

That feature depends on a built-in function in glibc, strxfrm(), which could be used to create a sortable hash of strings. Now, the POSIX standard says that strxfrm() + strcmp() should produce sorting results identical to the strcoll() function. And in our general tests, it did.

However, there are dozens of versions of glibc in the field, and hundreds of collations, and it's computationally unreasonable to test all combinations. Which is how we missed the problem until a user reported it. It turns out that for certain releases of glibc (particularly anything before 2.22 on Linux or BSD), with certain collations, strxfrm() and strcoll() return different results due to bugs. Which can result in an index lookup failing to find rows which are actually there. In the bug report, for example, an index on a German text column on RedHat Enterprise Linux 6.5 would fail to find many rows in a "between" search.

As a result, we've disabled the feature in 9.5.2 for all indexes which are on collations other than the simplified "C" collation. This sucks.

Also, if you're on 9.5.0 or 9.5.1 and you have indexes on columns with real collations (i.e. not "C" collation), then you should REINDEX (or CREATE CONCURRENTLY + DROP CONCURRENTLY) each of those indexes. Which really sucks.

Of course we're discussing ways to bring back the feature, but nobody has a solution yet. In the meantime, you can read more about the problem on the wiki page.

Friday, February 19, 2016

JSONB, PostgreSQL and Go

Just ran across this excellent pair of blog posts on using JSONB with Go application design to simplify your data model. Since that blog is not syndicated on Planet Postgres, reblogging it so that more people see it.

If you find other great PostgreSQL content on the web, make sure folks know about it.

Friday, January 8, 2016

Configuration changes in 9.5: transaction log size

If you downloaded 9.5 after yesterday's release, you might have noticed some changes to postgresql.conf, especially if you copied over you favorite 9.4 config and it refused to start up. Particularly, the parameter checkpoint_segments is gone, and has been replaced by min_wal_size and max_wal_size. What does this mean?

Well, checkpoint_segments was the old way we had for users to determine how "big" the transaction log should be. Users with a large server or a lot of transactions per second would set it high, and users with small VMs and a slow database could set it low. This had some problems, though:

The default setting worked for pretty much nobody, so you always had to adjust it.
The WAL* always used the maximum space available, even if it wasn't needed.
Figuring out the largest size your WAL could be required some math and a knowledge of version-specific PostgreSQL internals.

The last was the most confusing part for users; the calculation for maximum WAL size was:

( ( checkpoint_segments * 2 ) + 1 ) * 16MB ) + ( wal_keep_segments * 16MB )

... which meant that people generally sized it by picking an arbitrary number and then adjusting up or down based on feedback.

The new parameters are way simpler:

min_wal_size: the minimum size the transaction log will be;
max_wal_size: the maximum size the transaction log will be (but see below)

This means that your transaction log on disk shouldn't ever be larger than ( max_wal_size + wal_keep_segments ). It is a "soft" limit though; if PostgreSQL gets really behind, or if archiving is failing, it will get higher than max_wal_size.

However, that isn't the really cool part. Heikki did an amazing thing, in that the WAL is sized dynamically based on how much was used during the previous cycles. So you can set max_wal_size to some high value (default is 1GB), and not worry about PostgreSQL using a bunch of extra disk space if it's not needed. This means that we can set a default which will be "good enough" for 80% of our users, and we have. This makes me do a little happy dance.

The other 20% may want to tune, still though, so here's some guidelines:

if you know your database write traffic is "bursty", with long periods of inactivity followed by furious writes, increase min_wal_size;
if you do bulk loads larger than 1GB, increase max_wal_size to the size of a bulk load;
if you write more than 1GB of data every 10 minutes, increase max_wal_size;
if you get "checkpoints occurring to frequently" error messages, try increasing both parameters incrementally.

Most users won't need to touch those parameters at all, though. Which is as it should be.

* WAL == Write Ahead Log == Transaction Log == XLog

Wednesday, December 23, 2015

A Christmas present: PostgreSQL 9.5 RC1

It's been much-delayed, but PostgreSQL 9.5 RC1 is finally available. This release contains not only fixes for all known issues in 9.5 features, it also contains an "ultimate solution" to multixact truncation handling. Multixacts were the cause of multiple bugfix releases in versions 9.3 and 9.4, and several hackers have taken the opportunity in 9.5 to implement a new design for this. This has been a large part of the delay in releasing 9.5.

So: if you plan to upgrade to 9.5, download and test now. Final will be released in early January, barring discovery of new ghastly bugs in the RC.

To make it easier for you to test, I've updated the postgres95-test Docker image, ready for you to download and try out whatever you want.

9.5 brings us a lot of cool features which are likely to justify upgrading for you:

UPSERT
Row Level Security
Grouping Sets/CUBE/ROLLUP
Faster sorts for text
FDW push-down and partitioning support
BRIN indexes

If you want to read up on the features so you know what to test:

What's New in 9.5
My articles on 9.5 for LWN: Part 1 and Part 2
Discussion on Multxact fixes
Guide to 9.5 beta testing

Tuesday, December 8, 2015

Meet the newest member of the PostgreSQL community: MongoDB, Inc.

Well, this story is better told by the guy who broke it, John De Goes.

The TL;DR is that MongoDB will be shipping PostgreSQL as its "legacy BI connector" in version 3.2, using PostgreSQL+Multicorn FDW as a wrapper to connect SQL-based BI systems to Mongo.

I have two thoughts about this:

Welcome to the PostgreSQL community, MongoDB staff!
Multicorn? I mean, not that Multicorn isn't great, but don't you think you should be using a C-based FDW driver for performance?

Anyway, hopefully the folks at MongoDB will put in a talk at pgNYC about this. We're always keen to hear about creative uses of FDWs, and if this isn't one, I don't know what is.

PostgreSQL, The Center Of Your Dataverse™

(even if you're using Mongo)

Thursday, November 12, 2015

PostgreSQL link round-up

First, in case you somehow missed it, PostgreSQL 9.5 Beta 2 is now out. Time for another round of testing! There's fewer and fewer bugs found, so we're not far from a final release. I don't know about anyone else, but we're going into production on Beta 2 in a couple places. Can't wait any longer for Upsert and RLS.

Second, pgConfSV has announced its keynotes, from CitusData, Pivotal, and -- as a bit of a surprise -- from streaming data thing Kafka (before you ask, I didn't pick the keynotes). I believe registration is still open, so you can still go to the conference next week if you act quickly.

Thirdly, I have a roundup of some cool blogs covering PostgreSQL which aren't on Planet and you may have missed:

Compose.io did a terrific two-part article on why PostgreSQL is the best open source database. Read it here: Part I Part II.

They also covered using JSON in Postgres with Python.

In stranger news, there's an amazingly strange story about Soylent, PostgreSQL, and auto-responding to poisoning accusations. Yes, Soylent Inc. uses PostgreSQL, why not? Read the whole weird tale here on Zapier's blog.

That's it for my round up ... if you have some good links, post them in the comments.

Tuesday, November 10, 2015

Ready to be an Accidental DBA at pgConfSV?

My AccidentalDBA tutorial materials have been updated for pgConfSV next week. If you are planning to take this tutorial, please read and follow the setup instructions before you get to the conference.

I created this tutorial in 2009 because of the number of people these days who find themselves in charge of a PostgreSQL server ... or many servers ... without any DBA background, training, or desire to be ops. The tutorial is intended to teach you the minimum you need to keep your PostgreSQL server from falling over so you can get back to whatever your main job is.

You can take it self-paced if you can't make it to pgConfSV. The tutorial includes all materials and notes. In this version, I've added the option of a Docker container (preferred), pg_stat_statements, and a restore-to-point-in-time exercise.

If you are going to attend this tutorial, note that it requires advance setup which will be harder to perform once you get to your hotel, and near-impossible if you wait until 10 minutes before it starts. So read up and install now.

And ... I believe there's still a few registrations left for the tutorial day, so consider signing up. You have nothing to lose but your fear of database admin!

Monday, August 31, 2015

Lock-polling script for ALTER TABLE

One of PostgreSQL's advantages over many other SQL databases is letting users modify database objects with a minimum of locking and fuss, and do it in a transaction so it can be rolled back. However, a minimum of locking isn't no locking, so one still has to plan for getting a lock to push through the database migration.

For example, say I wanted to add two new columns to a table. I already know that by making the column nullable, and not setting a default, I can add them with a metadata-only change which requires only an update to the system catalogs. However, this table change does require an ACCESS EXCLUSIVE lock for a few milleseconds to go through, which can be tricky. See, an access exclusive lock blocks everything, including reads and autovacuum, which can be a bit messy if your timing is bad. You can end up waiting for that lock behind a long-running query, and then all the other database traffic can pile up behind you.

Now, if we had ALTER TABLE CONCURRENTLY it would take care of this for you. But we don't (yet), so you need to fake it with a DO script. Here's an example DO script which polls for a lock every 2 seconds until it can get one, then pushes through the table change:

    DO $f$
    DECLARE ntries INT := 10;
        sleepytime INT := 2;
    BEGIN

    FOR get_lock IN 1 .. ntries LOOP
            BEGIN
                    LOCK TABLE mytable
                    IN ACCESS EXCLUSIVE MODE NOWAIT;
                    ALTER TABLE mytable ADD COLUMN new_col1 INT,
                            ADD COLUMN new_col2 VARCHAR;
                    RAISE INFO 'table updated';
                    RETURN;
            EXCEPTION
                    WHEN lock_not_available THEN
                            PERFORM pg_sleep(sleepytime);
            END;
    END LOOP;

    RAISE INFO 'unable to obtain lock after % tries', ntries;

    END;$f$;

The idea here is that you keep trying to LOCK ... NOWAIT, which will throw and error if it can't get the lock immediately. Then it sleeps and tries again 2 seconds later. If you're using 9.3 or later, you can take an even better approach, using lock_timeout:

    DO $f$
    DECLARE ntries INT := 10;
        sleepytime INT := 2;
    BEGIN
    SET lock_timeout = '100ms';
    FOR get_lock IN 1 .. ntries LOOP
            BEGIN
                    LOCK TABLE mytable IN ACCESS EXCLUSIVE MODE;
                    ALTER TABLE mytable ADD COLUMN a int,
                            ADD COLUMN b INT,
                            ADD COLUMN c INT;
                    RAISE INFO 'table updated';
                    RETURN;
            EXCEPTION
                    WHEN lock_not_available THEN
                            PERFORM pg_sleep(sleepytime);
            END;
    END LOOP;

    RAISE INFO 'unable to obtain lock after % tries', ntries;

    END;$f$;

The advantage of this is that it doesn't have to get the lock immediately; it waits 100ms, then gives up on getting the lock. That helps in situations where the table is never completely free of read queries, without risking a serious pile-up of requests.

You should be able to adapt this approach. to the migrations you actually need to push through.

Friday, August 28, 2015

Stupid Hacks: Dictionary replace function

I write a lot more PL/pgSQL than I'd like to. Not that I don't like SQL, but as a language PL/pgSQL really shows its thrown-together origin; it's like 80's primitive. One thing that PL/pgSQL lacks is good string manipulation tools, which is particularly tragic given that the #1 thing to do in PL/pgSQL is to generate queries from parameters and run EXECUTE.

Postgres has two built-in ways to do string substitution: concatenation and format(). Both have drawbacks. Let me give you an example:

EXECUTE 'SELECT ' || col1 || ', ' || col2 || ' FROM ' || userschema ||
'.accounts WHERE ' || filterclause || ' ORDER BY ' || col1 || ',' || col2;

EXECUTE format('SELECT %s, %s FROM %s.accounts WHERE %s ORDER BY %s, %s', col1, col2, userschema, filterclause, col1, col2);

You can see the problem here. Both formats are hard to read and hard to maintain. Python and Perl have a good fix for this: dictionary/hash-based string substitution, where you can swap in the dictionary keys for the values. So I wrote up a quick hack to do this in PL/pgSQL.

Here's the dict-replace function.

Using it, you'd replace the above with:

EXECUTE replace_vars('SELECT ${col1}, ${col2} FROM ${userschema}.accounts
WHERE ${filterclause} ORDER BY ${col1}, ${col2}', vardict);

Of course, you need to first set up the vardict as a JSON value, in the form:

vardict := '{ "col1" : "username", 'col2' : "branchname", ...}'

Still, much more readable, eh? No refcounting, no repeating variables, no string breaks. I used Bash's variable substitution syntax of ${var} because it seemed like the thing least likely to conflict with user text, unlike anything involving %.

There's some caveats, though: it's not performant, and you could probably find strings which will break it, particularly if you're swapping in JSON values. It's not SQL-injection safe, so don't use it for stuff handling user input. You still have to do your own quoting of strings. And if you have access to PL/Perl or PL/Python you don't need this nonsense.

But for the minimal case, it should help.

Friday, August 14, 2015

Most of the talks chosen for pgConfSV

Since we're still working on the schedule, I wanted to give folks a preview of the talks we've picked for pgConfSV. We still have a few more talks to pick and the tutorials to finalize. Regardless, it's looking like a really exciting lineup!
We'll have folks from Heap, Wanelo, TreasureData, Rackspace, Turnitin.com, EMC, Joyent, Square, and more. The sad part will be missing three talks in each timeslot.

Note that this set of talks is not final; we're still swapping a couple of things, and some speakers have not confirmed.

pgConfSV is November 17th and 18th in South San Francisco. If you're going to AnsibleCon, that's the day after, so you can do both!

Wednesday, August 12, 2015

Two Great Tastes That Taste Great Together: cstore + Pipeline

cstore_fdw, the column-store extension for PostgreSQL by CitusData, is a really good way to add compressed storage for archival data, and analytic data intended to be aggregated, to your application. Because it's a column store, though, cstore wants new data added in batches, the bigger the better. This means that you need to find some way to batch-up incoming data, preferably one able to accept a lot of new data rapidly.

This is where PipelineDB comes in. Since PipelineDB is open source now, and based on 9.4.4, I can add extensions to it, including cstore_fdw. I've done so with the PipelineDB 0.7.7a Docker container, so if you use Docker it's simply available.

As a demonstration of this, I'll set up some fake clickstream data, archived to a cstore table hourly. First, I wrote a quick python script to generate it continuously and push it to a Pipeline stream.

Then I created the stream and continuous view in PipelineDB:

    CREATE STREAM seenstream ( user_id int, page_id int, ts timestamptz );

    CREATE CONTINUOUS VIEW seenfeed as select user_id, page_id, ts
    FROM seenstream
    WHERE arrival_timestamp > ( clock_timestamp() - interval '90 minutes' );

Next, I created the cstore table:

    CREATE EXTENSION cstore_fdw;

    CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

    CREATE FOREIGN TABLE seen (
        user_id int,
        page_id int,
        ts timestamp tz
    )
    SERVER cstore_server
    OPTIONS (compression 'pglz');

Finally, I added a simple script which ran the following query once per hour:

    INSERT INTO seen
    SELECT user_id, page_id, ts
    FROM seenfeed
    WHERE ts >= ( now() - interval '1 hour' )
    ORDER BY user_id, page_id, ts;

... and then I started everything in motion.

Now, for cstore the ORDER BY is vitally important; it determines how the blocks you create for the column store are organized. In this particular case, I knew that I would be doing more analysis by user. But most users would do ORDER BY ts instead.

After a few hours, I checked back, and now I can run some analytical queries on the cstore table. For example, user activity:

    select user_id, count(distinct page_id), max(ts)
    from seen group by user_id;

    user_id | count |              max
    --------+-------+-------------------------------
          1 |    92 | 2015-08-11 22:59:51.504777+00
          2 |    86 | 2015-08-11 22:54:09.77318+00
          3 |    89 | 2015-08-11 22:59:14.574697+00

page activity:

    select page_id, count(distinct user_id) as duv, max(ts) as last
from seen group by page_id order by duv desc;

    page_id | duv |             last
    --------+-----+-------------------------------
         71 | 96 | 2015-08-11 22:59:38.690743+00
         99 | 96 | 2015-08-11 22:58:43.004618+00
          4 | 96 | 2015-08-11 22:57:45.95007+00

... and more.

Now, in a production environment, you'll want to do more than that. Depending on traffic, you might batch inserts daily instead of hourly. You might want to use several continuous views to store different summaries of the data instead of raw data. But the above should be enough to show you how well these two tools go together.

Tuesday, August 11, 2015

Please, security test our code!

Since Oracle was so nice as to remind everyone what software security is really like with closed-source software, I wanted to remind people how finding and reporting security issues works in PostgreSQL:

Feel free to "reverse engineer" the code. In fact, here it is on github if you want to scrutinize it.
We generally credit security researchers who find real security holes (with limitations for duplicates, etc.). Over the last few years, some of the most critical bugs in PostgreSQL were found by professional security researchers doing things like fuzz testing.
If you think you've found a security issue, please report it to [email protected]. If it turns out to be a non-issue, we'll tell you, and you can report it as a regular bug.
Be prepared to answer questions about your report. We showed you our code, you can show us yours.

Our open approach to security is the reason why PostgreSQL was rated by the Database Hacker's Handbook as "the most secure by default": more secure than Oracle. And why for five Defcons in a row, security hackers have been unable to crack PostgreSQL in the annual Schemaverse competition.

And please ... update your servers! We send out those incremental updates for a reason, and often as not, there are security patches. More information on our Security Page.

Note: the above is my personal opinion and is not the opinion of The PostgreSQL Project or any other organization.

Friday, August 7, 2015

Understanding Unintuitive TABLESAMPLE Results

In 9.5 Alpha 2:

create table mil ( id int, val text );
insert into mil select i, i::text || '-val' from
generate_series(1,1000000) as gs(i);
analyze;

postgres=# select * from mil tablesample system ( 0.04 );
id | val
----+-----
(0 rows)

Huh, what?

So, what I didn't understand here is the way rows are selected for TABLESAMPLE SYSTEM. Since SYSTEM is page-based, I thought that we selected the requested % of pages, and then pick that many pages at random. Since this table had exactly 185 rows per page, it should return 370 rows every time (2 pages). But that's not what happened. In fact, running the following query I got a variety of counts:

SELECT count(*) FROM (select * from mil tablesample system ( 0.04 ) ) as a;

925? 0? What the hey?

What's really happening is that pages for SYSTEM are selected a different way. Each page is checked against the probability once. This means that, while on average you'll get the number of pages you're expecting, the numbers will vary from request to request quite a bit.

This also means that SYSTEM is a bad choice for really small sample sizes, like 0.01%. BERNOULLI is better, because it'll be checking by row, and therefore the size of the return sample will be much more predictable. It will still have a bit of variation, though; in my testing, +/- 10% on a few hundred rows.

Gulcin Yildirim has a great explanation of this on the 2nd Quadrant blog.

So, what if you need TABLESAMPLE to get a very specific number of rows for you? Well, that's why Petr Jelinek wrote the optional (loadable) SYSTEM_ROWS sampling method. This can be loaded as the tsm_system_rows extension in 9.5.

Hopefully that helps folks be less confused than me about how TABLESAMPLE works.