PATCH
/
1
/
crawlers
/
{id}
/
config

Authorizations

Authorization
string
headerrequired

Basic authentication header of the form Basic <encoded-value>, where <encoded-value> is the base64-encoded string username:password.

Path Parameters

id
string
required

Crawler ID.

Body

application/json

Crawler configuration to update. You can only update top-level configuration properties. To update a nested configuration, such as actions.recordExtractor, you must provide the complete top-level object such as actions.

actions
object[]
required

Instructions about how to process crawled URLs.

Each action defines:

  • The targeted subset of URLs it processes.
  • What information to extract from the web pages.
  • The Algolia indices where the extracted records will be stored.

A single web page can match multiple actions. In this case, the crawler produces one record for each matched action.

appId
string
required

Algolia application ID where the crawler creates and updates indices. The Crawler add-on must be enabled for this application.

rateLimit
number
required

Number of concurrent tasks per second.

If processing each URL takes n seconds, your crawler can process rateLimit / n URLs per second.

Higher numbers mean faster crawls but they also increase your bandwidth and server load.

Required range: 1 < x < 100
apiKey
string

Algolia API key for indexing the records.

The API key must have the following access control list (ACL) permissions: search, browse, listIndexes, addObject, deleteObject, deleteIndex, settings, editSettings. The API key must not be the admin API key of the application. The API key must have access to create the indices that the crawler will use. For example, if indexPrefix is crawler_, the API key must have access to all crawler_* indices.

exclusionPatterns
string[]

URLs to exclude from crawling.

externalData
string[]

References to external data sources for enriching the extracted records.

For more information, see Enrich extracted records with external data.

extraUrls
string[]

URLs from where to start crawling.

These are the same as startUrls. URLs you crawl manually can be added to extraUrls.

ignoreCanonicalTo

Whether to ignore canonical redirects.

If true, canonical URLs for pages are ignored.

ignoreNoFollowTo
boolean

Whether to ignore the nofollow meta tag or link attribute. If true, links with the rel="nofollow" attribute or links on pages with the nofollow robots meta tag will be crawled.

ignoreNoIndex
boolean

Whether to ignore the noindex robots meta tag. If true, pages with this meta tag will be crawled.

ignoreQueryParams
string[]

Query parameters to ignore while crawling.

All URLs with the matching query parameters will be treated as identical. This prevents indexing duplicated URLs, that just differ by their query parameters.

ignoreRobotsTxtRules
boolean

Whether to ignore rules defined in your robots.txt file.

indexPrefix
string

A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.

Maximum length: 64
initialIndexSettings
object

Initial index settings, one settings object per index.

This setting is only applied when the index is first created. Settings are not re-applied. This prevents overriding any settings changes after the index was created.

Function for extracting URLs for links found on crawled pages.

login
object

Authorization method and credentials for crawling protected content.

maxDepth
number

Maximum path depth of crawled URLs. For example, if maxDepth is 2, https://example.com/foo/bar is crawled, but https://example.com/foo/bar/baz won't. Trailing slashes increase the URL depth.

Required range: 1 < x < 100
maxUrls
number

Maximum number of crawled URLs.

Setting maxUrls doesn't guarantee consistency between crawls because the crawler processes URLs in parallel.

Required range: 1 < x < 15000000
renderJavaScript

Crawl JavaScript-rendered pages by rendering them with a headless browser.

Rendering JavaScript-based pages is slower than crawling regular HTML pages.

requestOptions
object

Options to add to all HTTP requests made by the crawler.

safetyChecks
object

Checks to ensure the crawl was successful.

saveBackup
boolean

Whether to back up your index before the crawler overwrites it with new records.

schedule
string

Schedule for running the crawl, expressed in Later.js syntax. If omitted, you must start crawls manually.

  • The interval between two scheduled crawls must be at least 24 hours.
  • Times are in UTC.
  • Minutes must be explicit: at 3:00 pm not at 3 pm.
  • Everyday is every 1 day.
  • Midnight is at 12:00 pm.
  • If you omit the time, a crawl might start any time after midnight UTC.
sitemaps
string[]

Sitemaps with URLs from where to start crawling.

startUrls
string[]

URLs from where to start crawling.

Response

200 - application/json
taskId
string
required

Universally unique identifier (UUID) of the task.