Tools
Crawler
- Get started with the Crawler
- REST API
- Actions
- Configuration
- Crawler
- Domains
- Tasks
Test crawling a URL
Tests a URL with the crawler’s configuration and shows the extracted records.
You can override parts of the configuration to test your changes before updating the configuration.
Basic authentication header of the form Basic <encoded-value>
, where <encoded-value>
is the base64-encoded string username:password
.
Crawler ID.
URL to test.
Crawler configuration to update.
You can only update top-level configuration properties.
To update a nested configuration, such as actions.recordExtractor
,
you must provide the complete top-level object such as actions
.
Instructions about how to process crawled URLs.
Each action defines:
- The targeted subset of URLs it processes.
- What information to extract from the web pages.
- The Algolia indices where the extracted records will be stored.
A single web page can match multiple actions. In this case, the crawler produces one record for each matched action.
Algolia API key for indexing the records.
The API key must have the following access control list (ACL) permissions:
search
, browse
, listIndexes
, addObject
, deleteObject
, deleteIndex
, settings
, editSettings
.
The API key must not be the admin API key of the application.
The API key must have access to create the indices that the crawler will use.
For example, if indexPrefix
is crawler_
, the API key must have access to all crawler_*
indices.
Algolia application ID where the crawler creates and updates indices. The Crawler add-on must be enabled for this application.
URLs to exclude from crawling.
References to external data sources for enriching the extracted records.
For more information, see Enrich extracted records with external data.
URLs from where to start crawling.
These are the same as startUrls
.
URLs you crawl manually can be added to extraUrls
.
Whether to ignore canonical redirects.
If true, canonical URLs for pages are ignored.
Whether to ignore the nofollow
meta tag or link attribute.
If true, links with the rel="nofollow"
attribute or links on pages with the nofollow
robots meta tag will be crawled.
Whether to ignore the noindex
robots meta tag.
If true, pages with this meta tag will be crawled.
Query parameters to ignore while crawling.
All URLs with the matching query parameters will be treated as identical. This prevents indexing duplicated URLs, that just differ by their query parameters.
Whether to ignore rules defined in your robots.txt
file.
A prefix for all indices created by this crawler. It's combined with the indexName
for each action to form the complete index name.
Initial index settings, one settings object per index.
This setting is only applied when the index is first created. Settings are not re-applied. This prevents overriding any settings changes after the index was created.
Function for extracting URLs for links found on crawled pages.
JavaScript function (as a string) for extracting URLs for links found on crawled pages.
By default, all URLs that comply with the pathsToMatch
, fileTypesToMatch
, and exclusions
settings are added to the crawl.
The Crawler dashboard has an editor with autocomplete and validation,
which makes editing the linkExtractor
property easier.
Authorization method and credentials for crawling protected content.
URL with your login form.
Options for the HTTP request for logging in.
HTTP method for sending the request.
Headers to add to all requests.
Form content.
Timeout for the request.
Maximum path depth of crawled URLs.
For example, if maxDepth
is 2, https://example.com/foo/bar
is crawled,
but https://example.com/foo/bar/baz
won't.
Trailing slashes increase the URL depth.
Maximum number of crawled URLs.
Setting maxUrls
doesn't guarantee consistency between crawls
because the crawler processes URLs in parallel.
Number of concurrent tasks per second.
If processing each URL takes n seconds,
your crawler can process rateLimit / n
URLs per second.
Higher numbers mean faster crawls but they also increase your bandwidth and server load.
Crawl JavaScript-rendered pages by rendering them with a headless browser.
Rendering JavaScript-based pages is slower than crawling regular HTML pages.
Options to add to all HTTP requests made by the crawler.
Proxy for all crawler requests.
Timeout in milliseconds for the crawl.
Maximum number of retries to crawl one URL.
Headers to add to all requests.
Preferred natural language and locale.
Basic authentication header.
Cookie. The header will be replaced by the cookie retrieved when logging in.
Checks to ensure the crawl was successful.
These checks are triggered after the crawl finishes but before the records are added to the Algolia index.
Maximum difference in percent between the numbers of records between crawls.
If the current crawl results in fewer than 1 - maxLostPercentage
records compared to the previous crawl,
the current crawling task is stopped with a SafeReindexingError
.
The crawler will be blocked until you cancel the blocking task.
Stops the crawler if a specified number of pages fail to crawl. If undefined, the crawler won't stop if it encounters such errors.
Whether to back up your index before the crawler overwrites it with new records.
Schedule for running the crawl, expressed in Later.js syntax. If omitted, you must start crawls manually.
- The interval between two scheduled crawls must be at least 24 hours.
- Times are in UTC.
- Minutes must be explicit:
at 3:00 pm
notat 3 pm
. - Everyday is
every 1 day
. - Midnight is
at 12:00 pm
. - If you omit the time, a crawl might start any time after midnight UTC.
Sitemaps with URLs from where to start crawling.
URLs from where to start crawling.
Authorizations
Basic authentication header of the form Basic <encoded-value>
, where <encoded-value>
is the base64-encoded string username:password
.
Path Parameters
Crawler ID.
Body
URL to test.
Crawler configuration to update.
You can only update top-level configuration properties.
To update a nested configuration, such as actions.recordExtractor
,
you must provide the complete top-level object such as actions
.
Instructions about how to process crawled URLs.
Each action defines:
- The targeted subset of URLs it processes.
- What information to extract from the web pages.
- The Algolia indices where the extracted records will be stored.
A single web page can match multiple actions. In this case, the crawler produces one record for each matched action.
Whether to generate objectID
properties for each extracted record.
If false, you must manually add objectID
properties to the extracted records.
Whether the crawler should cache crawled pages.
With caching, the crawler only crawls changed pages.
To detect changed pages, the crawler makes HTTP conditional requests to your pages.
The crawler uses the ETag
and Last-Modified
response headers returned by your web server during the previous crawl.
The crawler sends this information in the If-None-Match
and If-Modified-Since
request headers.
If your web server responds with 304 Not Modified
to the conditional request, the crawler reuses the records from the previous crawl.
Caching is ignored in these cases:
- If your crawler configuration changed between two crawls.
- If
externalData
changed between two crawls.
Whether the crawler cache is active.
Patterns for additional pages to visit to find links without extracting records.
The crawler looks for matching pages and crawls them for links, but doesn't extract records from the (intermediate) pages themselves.
File types for crawling non-HTML documents.
Non-HTML documents are first converted to HTML by an Apache Tika server.
Crawling non-HTML documents has the following limitations:
- It's slower than crawling HTML documents.
- PDFs must include the used fonts.
- The produced HTML pages might not be semantic. This makes achieving good relevance more difficult.
- Natural language detection isn't supported.
- Extracted metadata might vary between files produced by different programs and versions.
doc
, email
, html
, odp
, ods
, odt
, pdf
, ppt
, xls
Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects.
The crawler continues from the transformed URLs.
The mapping doesn't transform URLs listed in the startUrls
, siteMaps
, pathsToMatch
, and other settings.
The mapping also doesn't replace hostnames found in extracted text.
Hostname that should be added in the records.
Index name where to store the extracted records from this action.
The name is combined with the prefix you specified in the indexPrefix
option.
256
Unique identifier for the action. This option is required if schedule
is set.
Key-value pairs to replace matching paths with new values.
The crawl continues from the transformed URLs.
The mapping doesn't transform URLs listed in the startUrls
, siteMaps
, pathsToMatch
, and other settings.
The mapping also doesn't replace paths found in extracted text.
Patterns for URLs to which this action should apply.
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
function
JavaScript function (as a string) for extracting information from a crawled page and transforming it into Algolia records for indexing.
The Crawler dashboard has an editor with autocomplete and validation,
which makes editing the recordExtractor
property easier.
DOM selectors for nodes that must be present on the page to be processed. If the page doesn't match any of the selectors, it's ignored.
Algolia application ID where the crawler creates and updates indices. The Crawler add-on must be enabled for this application.
Number of concurrent tasks per second.
If processing each URL takes n seconds,
your crawler can process rateLimit / n
URLs per second.
Higher numbers mean faster crawls but they also increase your bandwidth and server load.
1 < x < 100
Algolia API key for indexing the records.
The API key must have the following access control list (ACL) permissions:
search
, browse
, listIndexes
, addObject
, deleteObject
, deleteIndex
, settings
, editSettings
.
The API key must not be the admin API key of the application.
The API key must have access to create the indices that the crawler will use.
For example, if indexPrefix
is crawler_
, the API key must have access to all crawler_*
indices.
URLs to exclude from crawling.
References to external data sources for enriching the extracted records.
For more information, see Enrich extracted records with external data.
URLs from where to start crawling.
These are the same as startUrls
.
URLs you crawl manually can be added to extraUrls
.
Whether to ignore canonical redirects.
If true, canonical URLs for pages are ignored.
Whether to ignore the nofollow
meta tag or link attribute.
If true, links with the rel="nofollow"
attribute or links on pages with the nofollow
robots meta tag will be crawled.
Whether to ignore the noindex
robots meta tag.
If true, pages with this meta tag will be crawled.
Query parameters to ignore while crawling.
All URLs with the matching query parameters will be treated as identical. This prevents indexing duplicated URLs, that just differ by their query parameters.
Whether to ignore rules defined in your robots.txt
file.
A prefix for all indices created by this crawler. It's combined with the indexName
for each action to form the complete index name.
64
Initial index settings, one settings object per index.
This setting is only applied when the index is first created. Settings are not re-applied. This prevents overriding any settings changes after the index was created.
Index settings.
Attributes used for faceting.
Facets are attributes that let you categorize search results. They can be used for filtering search results. By default, no attribute is used for faceting. Attribute names are case-sensitive.
Modifiers
-
filterOnly("ATTRIBUTE")
. Allows the attribute to be used as a filter but doesn't evaluate the facet values. -
searchable("ATTRIBUTE")
. Allows searching for facet values. -
afterDistinct("ATTRIBUTE")
. Evaluates the facet count after deduplication withdistinct
. This ensures accurate facet counts. You can apply this modifier to searchable facets:afterDistinct(searchable(ATTRIBUTE))
.
Creates replica indices.
Replicas are copies of a primary index with the same records but different settings, synonyms, or rules. If you want to offer a different ranking or sorting of your search results, you'll use replica indices. All index operations on a primary index are automatically forwarded to its replicas. To add a replica index, you must provide the complete set of replicas to this parameter. If you omit a replica from this list, the replica turns into a regular, standalone index that will no longer be synced with the primary index.
Modifier
virtual("REPLICA")
. Create a virtual replica, Virtual replicas don't increase the number of records and are optimized for Relevant sorting.
Maximum number of search results that can be obtained through pagination.
Higher pagination limits might slow down your search. For pagination limits above 1,000, the sorting of results beyond the 1,000th hit can't be guaranteed.
x < 20000
Attributes that can't be retrieved at query time.
This can be useful if you want to use an attribute for ranking or to restrict access, but don't want to include it in the search results. Attribute names are case-sensitive.
Creates a list of words which require exact matches. This also turns off word splitting and concatenation for the specified words.
Attributes, for which you want to support Japanese transliteration.
Transliteration supports searching in any of the Japanese writing systems. To support transliteration, you must set the indexing language to Japanese. Attribute names are case-sensitive.
Attributes for which to split camel case words. Attribute names are case-sensitive.
Searchable attributes to which Algolia should apply word segmentation (decompounding). Attribute names are case-sensitive.
Compound words are formed by combining two or more individual words, and are particularly prevalent in Germanic languages—for example, "firefighter". With decompounding, the individual components are indexed separately.
You can specify different lists for different languages.
Decompounding is supported for these languages:
Dutch (nl
), German (de
), Finnish (fi
), Danish (da
), Swedish (sv
), and Norwegian (no
).
Decompounding doesn't work for words with non-spacing mark Unicode characters.
For example, Gartenstühle
won't be decompounded if the ü
consists of u
(U+0075) and ◌̈
(U+0308).
Languages for language-specific processing steps, such as word detection and dictionary settings.
You should always specify an indexing language.
If you don't specify an indexing language, the search engine uses all supported languages,
or the languages you specified with the ignorePlurals
or removeStopWords
parameters.
This can lead to unexpected search results.
For more information, see Language-specific configuration.
af
, ar
, az
, bg
, bn
, ca
, cs
, cy
, da
, de
, el
, en
, eo
, es
, et
, eu
, fa
, fi
, fo
, fr
, ga
, gl
, he
, hi
, hu
, hy
, id
, is
, it
, ja
, ka
, kk
, ko
, ku
, ky
, lt
, lv
, mi
, mn
, mr
, ms
, mt
, nb
, nl
, no
, ns
, pl
, ps
, pt
, pt-br
, qu
, ro
, ru
, sk
, sq
, sv
, sw
, ta
, te
, th
, tl
, tn
, tr
, tt
, uk
, ur
, uz
, zh
Searchable attributes for which you want to turn off prefix matching. Attribute names are case-sensitive.
Whether arrays with exclusively non-negative integers should be compressed for better performance. If true, the compressed arrays may be reordered.
Numeric attributes that can be used as numerical filters. Attribute names are case-sensitive.
By default, all numeric attributes are available as numerical filters. For faster indexing, reduce the number of numeric attributes.
To turn off filtering for all numeric attributes, specify an attribute that doesn't exist in your index, such as NO_NUMERIC_FILTERING
.
Modifier
equalOnly("ATTRIBUTE")
. Support only filtering based on equality comparisons=
and!=
.
Control which non-alphanumeric characters are indexed.
By default, Algolia ignores non-alphanumeric characters like hyphen (-
), plus (+
), and parentheses ((
,)
).
To include such characters, define them with separatorsToIndex
.
Separators are all non-letter characters except spaces and currency characters, such as $€£¥.
With separatorsToIndex
, Algolia treats separator characters as separate words.
For example, in a search for "Disney+", Algolia considers "Disney" and "+" as two separate words.
Attributes used for searching. Attribute names are case-sensitive.
By default, all attributes are searchable and the Attribute ranking criterion is turned off.
With a non-empty list, Algolia only returns results with matches in the selected attributes.
In addition, the Attribute ranking criterion is turned on: matches in attributes that are higher in the list of searchableAttributes
rank first.
To make matches in two attributes rank equally, include them in a comma-separated string, such as "title,alternate_title"
.
Attributes with the same priority are always unordered.
For more information, see Searchable attributes.
Modifier
unordered("ATTRIBUTE")
. Ignore the position of a match within the attribute.
Without a modifier, matches at the beginning of an attribute rank higher than matches at the end.
An object with custom data.
You can store up to 32kB as custom data.
Characters and their normalized replacements. This overrides Algolia's default normalization.
Attribute that should be used to establish groups of results. Attribute names are case-sensitive.
All records with the same value for this attribute are considered a group.
You can combine attributeForDistinct
with the distinct
search parameter to control
how many items per group are included in the search results.
If you want to use the same attribute also for faceting, use the afterDistinct
modifier of the attributesForFaceting
setting.
This applies faceting after deduplication, which will result in accurate facet counts.
Maximum number of facet values to return when searching for facet values.
x < 100
Attributes to include in the API response.
To reduce the size of your response, you can retrieve only some of the attributes. Attribute names are case-sensitive.
*
retrieves all attributes, except attributes included in thecustomRanking
andunretrievableAttributes
settings.- To retrieve all attributes except a specific one, prefix the attribute with a dash and combine it with the
*
:["*", "-ATTRIBUTE"]
. - The
objectID
attribute is always included.
Determines the order in which Algolia returns your results.
By default, each entry corresponds to a ranking criteria. The tie-breaking algorithm sequentially applies each criterion in the order they're specified. If you configure a replica index for sorting by an attribute, you put the sorting attribute at the top of the list.
Modifiers
asc("ATTRIBUTE")
. Sort the index by the values of an attribute, in ascending order.desc("ATTRIBUTE")
. Sort the index by the values of an attribute, in descending order.
Before you modify the default setting, you should test your changes in the dashboard, and by A/B testing.
Attributes to use as custom ranking. Attribute names are case-sensitive.
The custom ranking attributes decide which items are shown first if the other ranking criteria are equal.
Records with missing values for your selected custom ranking attributes are always sorted last. Boolean attributes are sorted based on their alphabetical order.
Modifiers
-
asc("ATTRIBUTE")
. Sort the index by the values of an attribute, in ascending order. -
desc("ATTRIBUTE")
. Sort the index by the values of an attribute, in descending order.
If you use two or more custom ranking attributes, reduce the precision of your first attributes, or the other attributes will never be applied.
Relevancy threshold below which less relevant results aren't included in the results.
You can only set relevancyStrictness
on virtual replica indices.
Use this setting to strike a balance between the relevance and number of returned results.
Attributes to highlight.
By default, all searchable attributes are highlighted.
Use *
to highlight all attributes or use an empty array []
to turn off highlighting.
Attribute names are case-sensitive.
With highlighting, strings that match the search query are surrounded by HTML tags defined by highlightPreTag
and highlightPostTag
.
You can use this to visually highlight matching parts of a search query in your UI.
For more information, see Highlighting and snippeting.
Attributes for which to enable snippets. Attribute names are case-sensitive.
Snippets provide additional context to matched words.
If you enable snippets, they include 10 words, including the matched word.
The matched word will also be wrapped by HTML tags for highlighting.
You can adjust the number of words with the following notation: ATTRIBUTE:NUMBER
,
where NUMBER
is the number of words to be extracted.
HTML tag to insert before the highlighted parts in all highlighted results and snippets.
HTML tag to insert after the highlighted parts in all highlighted results and snippets.
String used as an ellipsis indicator when a snippet is truncated.
Whether to restrict highlighting and snippeting to items that at least partially matched the search query. By default, all items are highlighted and snippeted.
Number of hits per page.
1 < x < 1000
Minimum number of characters a word in the search query must contain to accept matches with one typo.
Minimum number of characters a word in the search query must contain to accept matches with two typos.
Whether typo tolerance is enabled and how it is applied.
If typo tolerance is true, min
, or strict
, word splitting and concatenation are also active.
Whether to allow typos on numbers in the search query.
Turn off this setting to reduce the number of irrelevant matches when searching in large sets of similar numbers.
Attributes for which you want to turn off typo tolerance. Attribute names are case-sensitive.
Returning only exact matches can help when:
- Searching in hyphenated attributes.
- Reducing the number of matches when you have too many. This can happen with attributes that are long blocks of text, such as product descriptions.
Consider alternatives such as disableTypoToleranceOnWords
or adding synonyms if your attributes have intentional unusual spellings that might look like typos.
Treat singular, plurals, and other forms of declensions as equivalent. You should only use this feature for the languages used in your index.
af
, ar
, az
, bg
, bn
, ca
, cs
, cy
, da
, de
, el
, en
, eo
, es
, et
, eu
, fa
, fi
, fo
, fr
, ga
, gl
, he
, hi
, hu
, hy
, id
, is
, it
, ja
, ka
, kk
, ko
, ku
, ky
, lt
, lv
, mi
, mn
, mr
, ms
, mt
, nb
, nl
, no
, ns
, pl
, ps
, pt
, pt-br
, qu
, ro
, ru
, sk
, sq
, sv
, sw
, ta
, te
, th
, tl
, tn
, tr
, tt
, uk
, ur
, uz
, zh
Removes stop words from the search query.
Stop words are common words like articles, conjunctions, prepositions, or pronouns that have little or no meaning on their own. In English, "the", "a", or "and" are stop words.
You should only use this feature for the languages used in your index.
af
, ar
, az
, bg
, bn
, ca
, cs
, cy
, da
, de
, el
, en
, eo
, es
, et
, eu
, fa
, fi
, fo
, fr
, ga
, gl
, he
, hi
, hu
, hy
, id
, is
, it
, ja
, ka
, kk
, ko
, ku
, ky
, lt
, lv
, mi
, mn
, mr
, ms
, mt
, nb
, nl
, no
, ns
, pl
, ps
, pt
, pt-br
, qu
, ro
, ru
, sk
, sq
, sv
, sw
, ta
, te
, th
, tl
, tn
, tr
, tt
, uk
, ur
, uz
, zh
Characters for which diacritics should be preserved.
By default, Algolia removes diacritics from letters.
For example, é
becomes e
. If this causes issues in your search,
you can specify characters that should keep their diacritics.
Languages for language-specific query processing steps such as plurals, stop-word removal, and word-detection dictionaries.
This setting sets a default list of languages used by the removeStopWords
and ignorePlurals
settings.
This setting also sets a dictionary for word detection in the logogram-based CJK languages.
To support this, you must place the CJK language first.
You should always specify a query language.
If you don't specify an indexing language, the search engine uses all supported languages,
or the languages you specified with the ignorePlurals
or removeStopWords
parameters.
This can lead to unexpected search results.
For more information, see Language-specific configuration.
af
, ar
, az
, bg
, bn
, ca
, cs
, cy
, da
, de
, el
, en
, eo
, es
, et
, eu
, fa
, fi
, fo
, fr
, ga
, gl
, he
, hi
, hu
, hy
, id
, is
, it
, ja
, ka
, kk
, ko
, ku
, ky
, lt
, lv
, mi
, mn
, mr
, ms
, mt
, nb
, nl
, no
, ns
, pl
, ps
, pt
, pt-br
, qu
, ro
, ru
, sk
, sq
, sv
, sw
, ta
, te
, th
, tl
, tn
, tr
, tt
, uk
, ur
, uz
, zh
Whether to split compound words in the query into their building blocks.
For more information, see Word segmentation.
Word segmentation is supported for these languages: German, Dutch, Finnish, Swedish, and Norwegian.
Decompounding doesn't work for words with non-spacing mark Unicode characters.
For example, Gartenstühle
won't be decompounded if the ü
consists of u
(U+0075) and ◌̈
(U+0308).
Whether to enable rules.
Whether to enable Personalization.
Determines if and how query words are interpreted as prefixes.
By default, only the last query word is treated as a prefix (prefixLast
).
To turn off prefix search, use prefixNone
.
Avoid prefixAll
, which treats all query words as prefixes.
This might lead to counterintuitive results and makes your search slower.
For more information, see Prefix searching.
prefixLast
, prefixAll
, prefixNone
Strategy for removing words from the query when it doesn't return any results. This helps to avoid returning empty search results.
-
none
. No words are removed when a query doesn't return results. -
lastWords
. Treat the last (then second to last, then third to last) word as optional, until there are results or at most 5 words have been removed. -
firstWords
. Treat the first (then second, then third) word as optional, until there are results or at most 5 words have been removed. -
allOptional
. Treat all words as optional.
For more information, see Remove words to improve results.
none
, lastWords
, firstWords
, allOptional
Search mode the index will use to query for results.
This setting only applies to indices, for which Algolia enabled NeuralSearch for you.
neuralSearch
, keywordSearch
Settings for the semantic search part of NeuralSearch.
Only used when mode
is neuralSearch
.
Indices from which to collect click and conversion events.
If null, the current index and all its replicas are used.
Whether to support phrase matching and excluding words from search queries.
Use the advancedSyntaxFeatures
parameter to control which feature is supported.
A string, null, or an array reference to optional words.
Searchable attributes for which you want to turn off the Exact ranking criterion. Attribute names are case-sensitive.
This can be useful for attributes with long values, where the likelihood of an exact match is high, such as product descriptions. Turning off the Exact ranking criterion for these attributes favors exact matching on other attributes. This reduces the impact of individual attributes with a lot of content on ranking.
Determines how the Exact ranking criterion is computed when the search query has only one word.
-
attribute
. The Exact ranking criterion is 1 if the query word and attribute value are the same. For example, a search for "road" will match the value "road", but not "road trip". -
none
. The Exact ranking criterion is ignored on single-word searches. -
word
. The Exact ranking criterion is 1 if the query word is found in the attribute value. The query word must have at least 3 characters and must not be a stop word. Only exact matches will be highlighted, partial and prefix matches won't.
attribute
, none
, word
Determine which plurals and synonyms should be considered an exact matches.
By default, Algolia treats singular and plural forms of a word, and single-word synonyms, as exact matches when searching. For example:
-
"swimsuit" and "swimsuits" are treated the same
-
"swimsuit" and "swimwear" are treated the same (if they are synonyms).
-
ignorePlurals
. Plurals and similar declensions added by theignorePlurals
setting are considered exact matches. -
singleWordSynonym
. Single-word synonyms, such as "NY" = "NYC", are considered exact matches. -
multiWordsSynonym
. Multi-word synonyms, such as "NY" = "New York", are considered exact matches.
ignorePlurals
, singleWordSynonym
, multiWordsSynonym
Advanced search syntax features you want to support.
-
exactPhrase
. Phrases in quotes must match exactly. For example,sparkly blue "iPhone case"
only returns records with the exact string "iPhone case". -
excludeWords
. Query words prefixed with a-
must not occur in a record. For example,search -engine
matches records that contain "search" but not "engine".
This setting only has an effect if advancedSyntax
is true.
exactPhrase
, excludeWords
Determines how many records of a group are included in the search results.
Records with the same value for the attributeForDistinct
attribute are considered a group.
The distinct
setting controls how many members of the group are returned.
This is useful for deduplication and grouping.
The distinct
setting is ignored if attributeForDistinct
is not set.
Whether to replace a highlighted word with the matched synonym.
By default, the original words are highlighted even if a synonym matches.
For example, with home
as a synonym for house
and a search for home
,
records matching either "home" or "house" are included in the search results,
and either "home" or "house" are highlighted.
With replaceSynonymsInHighlight
set to true
, a search for home
still matches the same records,
but all occurrences of "house" are replaced by "home" in the highlighted response.
Minimum proximity score for two matching words.
This adjusts the Proximity ranking criterion by equally scoring matches that are farther apart.
For example, if minProximity
is 2, neighboring matches and matches with one word between them would have the same score.
1 < x < 7
Properties to include in the API response of search
and browse
requests.
By default, all response properties are included. To reduce the response size, you can select, which attributes should be included.
You can't exclude these properties:
message
, warning
, cursor
, serverUsed
, indexUsed
,
abTestVariantID
, parsedQuery
, or any property triggered by the getRankingInfo
parameter.
Don't exclude properties that you might need in your search UI.
Maximum number of facet values to return for each facet.
x < 1000
Order in which to retrieve facet values.
-
count
. Facet values are retrieved by decreasing count. The count is the number of matching records containing this facet value. -
alpha
. Retrieve facet values alphabetically.
This setting doesn't influence how facet values are displayed in your UI (see renderingContent
).
For more information, see facet value display.
Whether the best matching attribute should be determined by minimum proximity.
This setting only affects ranking if the Attribute ranking criterion comes before Proximity in the ranking
setting.
If true, the best matching attribute is selected based on the minimum proximity of multiple matches.
Otherwise, the best matching attribute is determined by the order in the searchableAttributes
setting.
Extra data that can be used in the search UI.
You can use this to control aspects of your search UI, such as the order of facet names and values without changing your frontend code.
Order of facet names and facet values in your UI.
The redirect rule container.
widgets returned from any rules that are applied to the current search.
Whether this search will use Dynamic Re-Ranking.
This setting only has an effect if you activated Dynamic Re-Ranking for this index in the Algolia dashboard.
Filter applied during the re-ranking process.
If null, no filter is applied.
Function for extracting URLs for links found on crawled pages.
function
JavaScript function (as a string) for extracting URLs for links found on crawled pages.
By default, all URLs that comply with the pathsToMatch
, fileTypesToMatch
, and exclusions
settings are added to the crawl.
The Crawler dashboard has an editor with autocomplete and validation,
which makes editing the linkExtractor
property easier.
Authorization method and credentials for crawling protected content.
URL with your login form.
Options for the HTTP request for logging in.
HTTP method for sending the request.
Headers to add to all requests.
Preferred natural language and locale.
Basic authentication header.
Cookie. The header will be replaced by the cookie retrieved when logging in.
Form content.
Timeout for the request.
Maximum path depth of crawled URLs.
For example, if maxDepth
is 2, https://example.com/foo/bar
is crawled,
but https://example.com/foo/bar/baz
won't.
Trailing slashes increase the URL depth.
1 < x < 100
Maximum number of crawled URLs.
Setting maxUrls
doesn't guarantee consistency between crawls
because the crawler processes URLs in parallel.
1 < x < 15000000
Crawl JavaScript-rendered pages by rendering them with a headless browser.
Rendering JavaScript-based pages is slower than crawling regular HTML pages.
Options to add to all HTTP requests made by the crawler.
Proxy for all crawler requests.
Timeout in milliseconds for the crawl.
Maximum number of retries to crawl one URL.
Headers to add to all requests.
Checks to ensure the crawl was successful.
These checks are triggered after the crawl finishes but before the records are added to the Algolia index.
Maximum difference in percent between the numbers of records between crawls.
If the current crawl results in fewer than 1 - maxLostPercentage
records compared to the previous crawl,
the current crawling task is stopped with a SafeReindexingError
.
The crawler will be blocked until you cancel the blocking task.
1 < x < 100
Stops the crawler if a specified number of pages fail to crawl. If undefined, the crawler won't stop if it encounters such errors.
Whether to back up your index before the crawler overwrites it with new records.
Schedule for running the crawl, expressed in Later.js syntax. If omitted, you must start crawls manually.
- The interval between two scheduled crawls must be at least 24 hours.
- Times are in UTC.
- Minutes must be explicit:
at 3:00 pm
notat 3 pm
. - Everyday is
every 1 day
. - Midnight is
at 12:00 pm
. - If you omit the time, a crawl might start any time after midnight UTC.
Sitemaps with URLs from where to start crawling.
URLs from where to start crawling.
Response
Date and time when the test crawl started, in RFC 3339 format.
Date and time when the test crawl finished, in RFC 3339 format.
Logs from the record extraction.
Extracted records from the URL.
Name of the index where this record will be stored.
Extracted records.
Links found on the page, which match the configuration and would be processed.
External data associated with the tested URL. External data is refreshed automatically at the beginning of the crawl.
Was this page helpful?