You’re working in a company which has many sources of information. There’s this wiki maintained by the support team, and the sharepoint intranet used by the marketing team. There’s also this shared folder where anyone drops Word, excel, PDF files. The developers also have their JIRA bug tracker. Looking for a piece of information is a nightmare. You looked for a tool that would index all of this data and make it searchable from one single place, and eventually found SearchBlox. Built on top on the Apache Lucene search engine, SearchBlox offers a really nice and easy to use interface. There even is a free edition ! But soon you hit a problem : how do you index a website requiring authentication ?
You configured SearchBlx by creating some “collections”, each pointing to a website or a folder to archive. But, to access your intranet website, you have to authenticate yourself before accessing it. So must SearchBlox. If your web site is protected by a simple html form authentication, here is the procedure to follow :
As an example, let’s assume that you want to index the content of Jira, the Atlassian bug tracker. Authentication is done using an HTML form.
- In your search blox admin page, edit or create your http collection. In the “Paths tab”, set *Root URLs to your website base URL (ex : https://jira.mydomain.com/jira)
- Use your browser (Chrome ideally), and go to your website authentication page (ex : https://jira.mydomain.com/jira/login.jsp). Using your browser developer’s tools, you must identify which fields are sent using the post. Using Chrome developer tools, you can locate the HTML form used to send authentication information (username and password). The form fields are indicated by <input> tags. Write down the value of the “name” attribute for each of these fields. In the screenshot below, illustrating the Jira authentication form, the “os_username” field is visible. We’ll also identify the “os_password”, “os_destination” and “login” ones.
- Also identify the HTTP method used by the form (POST or GET). The info is visible in the “method” attribute of the <form> HTML tag.
- Now, in your SearchBlox collection “settings” tab, locate the “Form authentication” section :
Fill it with the previously collected data : Form URL, Form Action, and Form name/value couples. Add as many name/value settings as you have fields in your authentication form. Input your username and password in the “Value” of the corresponding “Name” :
- Eventually, start the indexation (in the “Index” tab), and TADAAAM ! You’re done. You can start searching your website.
That’s a good start will you say, but what do I do if my web site doesn’t use HTML form authentication ? What if it’s a Microsoft SharePoint website, using NTLM authentication (i.e : Windows/Active Directory/LDAP authentication). Unfortunately, you can’t index a NTLM authenticated web site. SearchBlox indicates on its web site that only Basic HTTP authentication and form-based authentication are supported. Their support team confirmed that the feature is not available, but that it may be added. Meanwhile, if you have some workaround suggestions, thanks to let me know !