Performance Issue
Hey everyone! I've got a situation that could really use some feedback. It's a little bit off subject, though PHP is certainly involved.
Let's say there's this Linux box. This box runs a daily chron job that fires off a Perl script, which in turn checks the file size of the current Apache log file for each web server user account. If the file size exceeds a preset amount (about 15 MB), it appends a current date stamp to the filename, compresses it using the 'gzip' utility, and starts a new log file.
My job is to write a PHP script to run on this server on a per-account basis that, on request, will go through the account log files and compile a set of traffic statistics for a specified date range based on certain patterns in the data. After working through the logic of the process, I've determined that I really have no accurate way of determining the contents of any file except for decompressing it and reading it. If I do this, though, the log has to be recompressed after I'm done with it.
Here's my current logic. I've written a function, getLogHits(), that gets a list of files in the 'logs' directory for the account in question. For each file in the list, the function decompresses the file and sends the filename to another function, analyzeLogFile(), which in turn reads the data from the file line by line and stores any hits within the date range in an array. Once it finds a hit that is outside the date range, it closes the file and returns the array. getLogHits() then recompresses the file and moves to the next file in the list.
After some experimentation with the PHP extensions for de/compressing log files, I determined that calling the 'gzip' utility via the exec() function was faster. Decompression usually takes between 0 and 1 seconds, recompression takes between 1 and 2 seconds. Over all, the best case scenario is that a file contains no relevant data and takes an average of 2 seconds to perform an open-and-close analysis. That being the case, I don't think compression is the bottleneck in my script. For analysis, I'm using ereg() to search for whatever pattern is necessary. Each line in each log file (up until the date range is compromised) is analyzed using this pattern and added to an array if it's a match.
Part of my problem is that, because I have to check each file, the processing required for generating a report that spans even just the past 90 days is a problem if the site has been on the server for as long as several years. At the moment, the site I'm testing it on produces the error below before it reaches the end of the process. I've included the address to the script itself and the address to display the source code for the test I'm running. Any and all comments and criticism would be greatly appreciated.
Fatal error: Allowed memory size of 104857600 bytes exhausted (tried to allocate 35 bytes) in /home/ccc10k/web/get_log_hits.php on line 74
Executable test file
Test file source code
Function file source code
Let's say there's this Linux box. This box runs a daily chron job that fires off a Perl script, which in turn checks the file size of the current Apache log file for each web server user account. If the file size exceeds a preset amount (about 15 MB), it appends a current date stamp to the filename, compresses it using the 'gzip' utility, and starts a new log file.
My job is to write a PHP script to run on this server on a per-account basis that, on request, will go through the account log files and compile a set of traffic statistics for a specified date range based on certain patterns in the data. After working through the logic of the process, I've determined that I really have no accurate way of determining the contents of any file except for decompressing it and reading it. If I do this, though, the log has to be recompressed after I'm done with it.
Here's my current logic. I've written a function, getLogHits(), that gets a list of files in the 'logs' directory for the account in question. For each file in the list, the function decompresses the file and sends the filename to another function, analyzeLogFile(), which in turn reads the data from the file line by line and stores any hits within the date range in an array. Once it finds a hit that is outside the date range, it closes the file and returns the array. getLogHits() then recompresses the file and moves to the next file in the list.
After some experimentation with the PHP extensions for de/compressing log files, I determined that calling the 'gzip' utility via the exec() function was faster. Decompression usually takes between 0 and 1 seconds, recompression takes between 1 and 2 seconds. Over all, the best case scenario is that a file contains no relevant data and takes an average of 2 seconds to perform an open-and-close analysis. That being the case, I don't think compression is the bottleneck in my script. For analysis, I'm using ereg() to search for whatever pattern is necessary. Each line in each log file (up until the date range is compromised) is analyzed using this pattern and added to an array if it's a match.
Part of my problem is that, because I have to check each file, the processing required for generating a report that spans even just the past 90 days is a problem if the site has been on the server for as long as several years. At the moment, the site I'm testing it on produces the error below before it reaches the end of the process. I've included the address to the script itself and the address to display the source code for the test I'm running. Any and all comments and criticism would be greatly appreciated.
Fatal error: Allowed memory size of 104857600 bytes exhausted (tried to allocate 35 bytes) in /home/ccc10k/web/get_log_hits.php on line 74
Executable test file
Test file source code
Function file source code
