T∃∀m 13

Inspiration

The web might be huge, but that doesn't mean the bit of it that we see most of is the most interesting. I firmly believe that the best bits of the internet are the indie websites - things like the personal blogs, miscellaneous trinkets that people publish and forget about and the constant variety.

Also the Marginalia search engine is cool.

What it does

Rummage is a search engine for the small web*. It seeks out small, interesting, independent websites and shows them to you. The aim is to help you discover the depths of the internet that you never would have known existed otherwise.

* well - it would be if it worked.

How I built it

The entire codebase, from the web scraping and indexing to the web UI, is built in Go with SQLite3 as the storage engine. The crawler, indexer and search logic/frontend are three seperate components that work in tandem to create a fully functional system.

Challenges I ran into

  • Resolving URLs is hard - turing a relative anchor on a webpage into an absolute URL is more challenging that you would casually expect
  • Efficiency - due to poor design choices that were made to optimise for speed (eg: just dump every crawled page into a file in a directory), indexing took a LONG time - about 5 hours to complete 16% of all the data that was scraped.
    • This also impacted the performance of the final product
  • Writing a good ranking algorithm is not straight forwards, especially when you don't really know what you're doing
    • Nothing in the search results really relates that well to the search query
  • Ran out of time - hence why it doesn't filter for the small web at all

Accomplishments that I'm proud of

It works! It's technically a functional search engine that shows you places on the internet!

What I learned

  • You should think more about how best to store data before you go and download 30GB of website to flat files
  • Ranking algorithms exist! And they do things!!

What's next for Rummage

I'm going to delete everything and try again, but this time as something that's sustainable, functional and cool(er). And on a day I've had more than 2 hours of sleep.

Built With

Share this project:

Updates