Organizations are complex. In an ideal world, every team would be communicating perfectly and fulfilling their individual responsibilities, but we don’t live in an ideal world. So in larger companies maintaining lots of content across various data sources and destinations, something like real holistic site search can be a real pain in the neck. If the teams aren’t working together, and there is no simple, standard database of all the searchable information, how do you gather all that data together to create a search index out of it?
We’ve wrestled with that question before, so we took it upon ourselves to solve it by introducing Algolia Crawler. It’s an AI-powered web crawler that can automatically scan your site and create search indexes for you for each type of content it finds.
To demonstrate just how simple this is to set up, I’ve set up a small project with the help of a Vercel blog template and ChatGPT-generated articles. Here’s the repo if you’re curious, but the gist of it is that we’re hosting several articles praising assorted vegetables. You can see the articles on the live demo here. What we’d like to do is catalogue these articles in a search index so that our users can search through them. Here’s the thing: to mimic the real-life scenario where we may not have all of the information at hand, we’re going to do that without pulling structured data on these blog posts from a CMS or wherever the posts source actually lives. We’re going to generate this from the live, production blog URL using Algolia Crawler.
First, click on the Data Sources tab in the menu on the bottom left of the screen. It’s that icon that looks like a silo. In there, you can visit the Crawler tab and click the “Add new crawler” button.
It’ll walk you through the process of verifying that you own the domain you’re crawling to prove you are authorized to crawl it. Domain verification prevents Algolia Crawler from being used maliciously. If you’re curious about the ethics of web scraping, check out this wonderful article from Towards Data Science.
Once you run through the process, it’ll tell you that your data is ready in several indices.
Head to the Configuration page and update whatever you’d like. Here you can set the Crawler to run on a schedule, handle authentication, and a smattering of other useful stuff. If you change any of this, make sure to press the button to rerun the Crawler so the indices it creates are updated.
Under the URL inspector page, you can find what URLs the Crawler found and crawled. If you’re testing with a smaller set of pages, it might be a good idea to make sure its picking up everything you want. The Crawler is very advanced, but it can’t pick up random URLs that aren’t linked to anything. This testing stage can help you confirm that all the URLs you want to crawl are discoverable.
Lastly, check the Data Analysis tab. That’ll flag you if there are issues with the data the Crawler is bringing back. The run you’re seeing in the screenshots here (the one of GPT-generated vegetable articles) didn’t have any issues, and yours likely won’t either. This is just a way to surface potential hiccups before they ever mess with your data.
If you head over to the index you’ve created, you should see all your crawled data.
Now, you can just use this data like you would build any other search interface! We don’t need to go into too much detail because we’ve got great instructional guides on what you can do with indexes already available for you here, here, and here.
If you’re interested in using Algolia Crawler, chances are you’re working on projects with lots of stakeholders. If that’s the case, a UI demo is a simple, three-click way to build a prototype with your newly structured data. Under the UI Demos tab in the index page, just click either InstantSearch or Autocomplete under the Get started dropdown to generate a new UI demo.
It’ll take you to a page where you can search through the data and demonstrate the speed and flexibility to stakeholders.
Here’s the UI Demo I created for our sample demonstration.
You can get far more convoluted with the setup to use more specific features if you wish, but you can get a simple prototype up and running without any friction in just a few minutes. The biggest annoyance of the whole process was getting ChatGPT to write the articles we’re scanning over. So if you’re looking to jump into indexing the content on your company’s site without investing tons of time and effort, why not check out Algolia Crawler? And if you create something cool with it, be sure to let us know using the chat bubble at the bottom right. Maybe your project or story will end up featured here on the blog.
Jaden Baptista
Freelance Writer at Author's Collective