1

I am new to Python3 and working with textfiles. I am trying to extract all filenames from a log file that end in JavaScript (.js) extensions. The file contains other file extensions also. I want to return only the filename and not the path, sort the output alphabetically and display uniuqe values as there are repeats in the log entries.

Examples from the log file are:

72.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /include/jquery.js HTTP/1.1" 200 25139

22.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /include/jquery.jshowoff.js HTTP/1.1" 200 25139

In this case I just want to return jquery.js and jquery.jshowoff.js and not the HTTP request and other log data.

This is my code so far:

filepath = '/home/user/Documents/access_log.txt'
with open(filepath, 'r') as access_log:
    contents = access_log.readlines()
    for line in contents:
        if ".js" in line:
            print(line)

My ouput does return only lines that contain .js in them but I don't know how to extract the rest. I have tried to use regex to match but have not been successful as I'm also new to using that. Any help would be greatly appreciated.

2 Answers 2

4

This can be done with regex, but I figured I'd give just a python solution.

The approach I took was to split each line based on the OS path character: /. For Windows OS this would be '\' (so keep that in mind if you want this to be cross-platform). This gives a list. Then we search each element in the list for ".js ". The space should always be there. The element with the filename will have extra stuff after the filename, so just split on ".js " and only keep the first element of that split. I commented these pieces in the code too.

with open(filepath, 'r') as access_log:
    contents = access_log.readlines()
    log_filenames = []
    for line in contents:
        # log_filenames on mac/linux will use / so split on that then search for filename
        for fragment in line.split('/'):
            if ".js " in fragment:
                # there will be text after .js, so remove it
                frags = fragment.split('.js ')
                # split on ".js " will give us the base filename as first element of list
                basename = frags[0]
                filename = basename + '.js'
                log_filenames.append(filename)
    # get unique values
    log_filenames = list(set(log_filenames))
    # sort
    log_filenames.sort()
    print('\n'.join(log_filenames))

Outputs:

jquery.js
jquery.jshowoff.js

Note: In getting unique values I converted the set back to a list just in case you're not used to working with sets.

Sign up to request clarification or add additional context in comments.

2 Comments

@Sturat Thank you for your help. Ive slightly adjusted your code as your snippet only outputted jquery.js but there's other files that are a bit longer that also include jquery.js in them. In the frags = fragment.split('.js ') i removed the .js and left it as a space. then returned the basename without .js and the output looks to be what I'm after. Now i need to sort the output so that no duplicates are returned.
@Atreyu I forgot about no dups and sorting. i fixed that. But, the code works for me. If you copy and pasted it, try typing it instead (mainly the .split('.js '). If it was only returning jquery.js then that means there wasn't a space in that `.split('.js ').
3

Here is another pure-Python solution, using the following logfile.txt as my input:

72.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /include/jquery.js HTTP/1.1" 200 25139
22.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /include/jquery.jshowoff.js HTTP/1.1" 200 25139
72.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /2468.js HTTP/1.1" 200 25139
72.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /Abcd.js HTTP/1.1" 200 25139
22.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /abcd.js HTTP/1.1" 200 25139
72.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /aBcd.js HTTP/1.1" 200 25139
22.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET / asd.js HTTP/1.1" 200 25139
72.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /include/index.html HTTP/1.1" 200 25139
72.133.47.242 - - [25/Apr/2013:15:45:28 -0700] "GET /include/login.jsp HTTP/1.1" 200 25139

All the JavaScript filenames are stored in a set, since you only want unique values. Before being printed, they are sorted alphabetically.

It iterates over each line, finds the index of the first .js starting from the end of the string, then it finds the index of the first / starting from where it found the .js, heading towards the left.

The line is sliced using these 2 indexes to give us the filename. If .js is not found, rfind returns -1, which doesn't matter because we check at the end if the filename ends with .js before adding it to the set. You could use rindex, but you would need to handle the ValueError for lines that don't have .js.

filenames = set()

with open(r"C:\Users\Old Joe\Desktop\logfile.txt") as f:
    for line in f:
        end = line.rfind(".js") + 3 # 3 = len(".js")
        start = line.rfind("/", 0, end) + 1 # 1 = len("/")
        filename = line[start:end]
        if filename.endswith(".js"):
            filenames.add(filename)


for filename in sorted(filenames, key=str.lower):
    print(filename)

Output:

 asd.js
2468.js
aBcd.js
abcd.js
Abcd.js
jquery.js
jquery.jshowoff.js
login.js

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.