Discussion:
Design ? table vs. view?
John McKown
2014-07-15 12:33:56 UTC
Permalink
I have a table which has some "raw" data in it. By "raw", I mean it is
minimally processed from a log file. Every week, I update this table by
processing the weekly log using awk to create a "psql script" file which
looks similar to:

COPY rawdata FROM STDIN;
.... lines created by awk script
\.

The table schema is:
Table "jobrun.rawdata"
Column | Type |
----------+---------------+
lpar | character(4) |
yyddd | character(5) |
timedata | character(11) |
jobid | character(8) |
msgid | character(7) |
jobname | character(8) |

Now, this data is not really very useful in its raw form. So I "process" it
via a view:

View "jobrun.rundata"
Column | Type |
---------+--------------------------+
lpar | character(4) |
msgid | character(7) |
jobname | character(8) |
jobid | character(8) |
msgtime | timestamp with time zone |
View definition:
SELECT rawdata.lpar,
rawdata.msgid,
rawdata.jobname,
rawdata.jobid,
to_timestamp((rawdata.yyddd::text || ' '::text) ||
rawdata.timedata::text, 'YYDDD HH24:MI:SS.MS'::text) AS msgtime
FROM rawdata;

My question is this: If I do a number of SELECTs on the "rundata" table.
So, would it be worth while to make this a table in itself? The plus of a
view is that I don't need to worry about updates. And I still have the
"raw" data around. In reality, this is just the first VIEW. I create three
other views. Two views are to "subset" the data based on the contents of
the "msgid" value (there are only 2 possibilities at present: 'START' and
'END'). The final view, which is my actual information is a FULL OUTER JOIN
of the START and END subset, based on lpar,jobname, and jobid:

View "jobrun.runinfo"
Column | Type |
----------+--------------------------+
lpar | character(4) |
jobname | character(8) |
jobid | character(8) |
runstart | timestamp with time zone |
runend | timestamp with time zone |
View definition:
SELECT COALESCE(a.lpar, b.lpar) AS lpar,
COALESCE(a.jobname, b.jobname) AS jobname,
COALESCE(a.jobid, b.jobid) AS jobid,
a.msgtime AS runstart,
b.msgtime AS runend
FROM runstart a
FULL JOIN runend b ON a.lpar = b.lpar AND a.jobname = b.jobname AND
a.jobid = b.jobid;

So the overhead may be quite high, because to SELECT from RUNINFO,
PostgreSQL must realize all four views.

I appreciate your thoughts on if this is OK, given that performance is
currently acceptable. Mainly because this work is basically only done one a
week, on Sundays. And I don't do it myself, it is done via a scheduler (not
cron, but similar) which runs some scripts.
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown
David G Johnston
2014-07-15 13:46:05 UTC
Permalink
Post by John McKown
I have a table which has some "raw" data in it. By "raw", I mean it is
minimally processed from a log file. Every week, I update this table by
processing the weekly log using awk to create a "psql script" file which
COPY rawdata FROM STDIN;
.... lines created by awk script
\.
Table "jobrun.rawdata"
Column | Type |
----------+---------------+
lpar | character(4) |
yyddd | character(5) |
timedata | character(11) |
jobid | character(8) |
msgid | character(7) |
jobname | character(8) |
Now, this data is not really very useful in its raw form. So I "process" it
View "jobrun.rundata"
Column | Type |
---------+--------------------------+
lpar | character(4) |
msgid | character(7) |
jobname | character(8) |
jobid | character(8) |
msgtime | timestamp with time zone |
SELECT rawdata.lpar,
rawdata.msgid,
rawdata.jobname,
rawdata.jobid,
to_timestamp((rawdata.yyddd::text || ' '::text) ||
rawdata.timedata::text, 'YYDDD HH24:MI:SS.MS'::text) AS msgtime
FROM rawdata;
My question is this: If I do a number of SELECTs on the "rundata" table.
So, would it be worth while to make this a table in itself? The plus of a
view is that I don't need to worry about updates. And I still have the
"raw" data around. In reality, this is just the first VIEW. I create three
other views. Two views are to "subset" the data based on the contents of
the "msgid" value (there are only 2 possibilities at present: 'START' and
'END'). The final view, which is my actual information is a FULL OUTER JOIN
View "jobrun.runinfo"
Column | Type |
----------+--------------------------+
lpar | character(4) |
jobname | character(8) |
jobid | character(8) |
runstart | timestamp with time zone |
runend | timestamp with time zone |
SELECT COALESCE(a.lpar, b.lpar) AS lpar,
COALESCE(a.jobname, b.jobname) AS jobname,
COALESCE(a.jobid, b.jobid) AS jobid,
a.msgtime AS runstart,
b.msgtime AS runend
FROM runstart a
FULL JOIN runend b ON a.lpar = b.lpar AND a.jobname = b.jobname AND
a.jobid = b.jobid;
So the overhead may be quite high, because to SELECT from RUNINFO,
PostgreSQL must realize all four views.
I appreciate your thoughts on if this is OK, given that performance is
currently acceptable. Mainly because this work is basically only done one a
week, on Sundays. And I don't do it myself, it is done via a scheduler (not
cron, but similar) which runs some scripts.
I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is. I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over "runinfo"
for API or permission reasons.

In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Design-table-vs-view-tp5811577p5811589.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
John McKown
2014-07-16 01:42:59 UTC
Permalink
On Tue, Jul 15, 2014 at 8:46 AM, David G Johnston <
Post by John McKown
I have a table which has some "raw" data in it. By "raw", I mean it is
minimally processed from a log file. Every week, I update this table by
processing the weekly log using awk to create a "psql script" file which
<snip>
Post by John McKown
So the overhead may be quite high, because to SELECT from RUNINFO,
PostgreSQL must realize all four views.
I appreciate your thoughts on if this is OK, given that performance is
currently acceptable. Mainly because this work is basically only done one a
week, on Sundays. And I don't do it myself, it is done via a scheduler (not
cron, but similar) which runs some scripts.
I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is. I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over "runinfo"
for API or permission reasons.
In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.
David J.
Being the indecisive nut that I am, I am going to do both <grin/>. I will
keep the current view. But when I update the rawdata, what I will then do
is:

drop table runinfo_table;
create table runinfo_table as select distinct * from runinfo;

I am fairly confident that there cannot be any duplicates in runinfo. But,
being paranoid as well, I will do the DISTINCT just to be sure. I may
change the VIEW to do that in the future, and remove it from the
preceeding. Since the process which updates the rawdata table is automated
and runs on a Sunday, the time needed to recreate runinfo_table is not
relevant to me. So I get what I want, unless I update rawdata off schedule.
I cannot imagine why I would do that since the logs from which I create it
are generally only available after 17:00 local time on Sunday. Getting the
iogs-to-date information for the time since the last dump is basically a
PITA and my current use is not critical. Actually, it is more a
"skunkworks" project of my own to produce a set of nice graphs, using R,
which _might_ turn out to be interesting to management, but the production
of which _will_ help me learn PostgreSQL and R better (hopefully).

Many thanks.
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown
Rémi Cura
2014-07-16 08:02:31 UTC
Permalink
Hey,
I guess you know all about PL/R,
the R language extension for postgres .
It is very convenient, though be carefull as sometime it crashed my server.

Cheers,
Rémi-C
Post by John McKown
On Tue, Jul 15, 2014 at 8:46 AM, David G Johnston <
Post by John McKown
I have a table which has some "raw" data in it. By "raw", I mean it is
minimally processed from a log file. Every week, I update this table by
processing the weekly log using awk to create a "psql script" file which
<snip>
Post by John McKown
So the overhead may be quite high, because to SELECT from RUNINFO,
PostgreSQL must realize all four views.
I appreciate your thoughts on if this is OK, given that performance is
currently acceptable. Mainly because this work is basically only done
one
Post by John McKown
a
week, on Sundays. And I don't do it myself, it is done via a scheduler (not
cron, but similar) which runs some scripts.
I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is. I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over "runinfo"
for API or permission reasons.
In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.
David J.
Being the indecisive nut that I am, I am going to do both <grin/>. I will
keep the current view. But when I update the rawdata, what I will then do
drop table runinfo_table;
create table runinfo_table as select distinct * from runinfo;
I am fairly confident that there cannot be any duplicates in runinfo. But,
being paranoid as well, I will do the DISTINCT just to be sure. I may
change the VIEW to do that in the future, and remove it from the
preceeding. Since the process which updates the rawdata table is automated
and runs on a Sunday, the time needed to recreate runinfo_table is not
relevant to me. So I get what I want, unless I update rawdata off schedule.
I cannot imagine why I would do that since the logs from which I create it
are generally only available after 17:00 local time on Sunday. Getting the
iogs-to-date information for the time since the last dump is basically a
PITA and my current use is not critical. Actually, it is more a
"skunkworks" project of my own to produce a set of nice graphs, using R,
which _might_ turn out to be interesting to management, but the production
of which _will_ help me learn PostgreSQL and R better (hopefully).
Many thanks.
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan
Maranatha! <><
John McKown
Loading...