Commit Graph

648 Commits

Author SHA1 Message Date
Nicolas b36406e465 Nick: log scrpaers 2024-07-03 17:28:53 -03:00
Eric Ciarla 2d0d5ac392 Update for llm-extraction-from-raw-html 2024-07-02 14:05:42 -04:00
rafaelsideguide 0175152577 Fixed PDF match custom scraping
Now it's working for both `https://getgc.ai/privacy` and `https://prairie.cards/products/wood-designs` usecases.
2024-07-02 11:25:17 -03:00
rafaelsideguide 96de948d6b Update index.test.ts 2024-07-02 11:04:09 -03:00
rafaelsideguide 7b7154ba1e bugfixed pageStatusCode 2024-07-02 10:51:35 -03:00
Rafael Miller 50eecf04a9
Update licence pyproject.toml
Closes #345
2024-07-02 10:01:49 -03:00
Rafael Miller f0f449fe51
Merge pull request #336 from snippet/allow-external-content-links
[Proposal] new feature allowExternalContentLinks
2024-07-02 09:45:21 -03:00
rafaelsideguide db4a743365 Added e2e test 2024-07-02 09:44:08 -03:00
Eric Ciarla 0821017f5b
Update README.md 2024-07-02 07:08:46 -04:00
Nicolas 42cd58a679
Merge pull request #332 from mendableai/feat/rawHtmlExtraction
Adds pageOptions.includeRawHtml and new extraction mode "llm-extraction-from-raw-html"
2024-07-01 18:23:26 -03:00
Nicolas c4f423981f Update pnpm-lock.yaml 2024-07-01 18:22:22 -03:00
rafaelsideguide 16aac7f8c5 Update single_url.ts 2024-07-01 18:21:15 -03:00
Nicolas 6d0c7a9ccd
Merge pull request #323 from mendableai/tests/crawl-limit-unit-tests
[Tests] Added crawl limit unit test
2024-07-01 17:56:04 -03:00
rafaelsideguide 4d6e25619b minor spacing and comment stuff 2024-07-01 16:05:34 -03:00
Eric Ciarla e1af815f8c Update scrape.ts 2024-07-01 08:48:21 -04:00
Eric Ciarla 7ae195bacc Update index.test.ts 2024-06-29 10:13:12 -04:00
Eric Ciarla 837b446390 Update index.test.ts 2024-06-29 08:48:42 -04:00
Eric Ciarla fe6e3aeadc Update index.test.ts 2024-06-29 08:44:21 -04:00
Eric Ciarla 6c9f0dfc91 Add tests 2024-06-29 08:32:20 -04:00
Jeff Pereira a5fb45988c new feature allowExternalContentLinks 2024-06-28 17:23:40 -07:00
Eric Ciarla 87b54488d3 update to includeRawHtml 2024-06-28 17:07:47 -04:00
Eric Ciarla 70fcf2ce03 init 2024-06-28 16:39:09 -04:00
Nicolas 9bf74bc774 Update single_url.ts 2024-06-28 15:51:18 -03:00
Nicolas 7e17498bcf Update single_url.ts 2024-06-28 15:45:16 -03:00
Nicolas 9e7298945c Update openapi.json 2024-06-26 21:25:38 -03:00
Nicolas 1ec0bf8adf Update openapi.json 2024-06-26 21:22:46 -03:00
Nicolas 042f81ddf2 Update removeUnwantedElements.test.ts 2024-06-26 21:20:11 -03:00
Nicolas 388ce3cbce Nick: small changes 2024-06-26 21:15:42 -03:00
Nicolas 1d4907acc9 Nick: 2024-06-26 21:02:58 -03:00
Nicolas 3b92fb8433
Merge pull request #322 from mendableai/tests/metadata
[Test] Added E2E tests for checking metadata values
2024-06-26 12:09:18 -03:00
rafaelsideguide 67d7650cf3 Added to e2e_noAuth 2024-06-26 12:07:55 -03:00
rafaelsideguide 009df6c930 Added crawl limit unit test
I think this test is over relying on mocks but I have no idea on how to fix this without changing the code arch structure
2024-06-26 09:54:25 -03:00
rafaelsideguide 05eaa3c68d Update index.test.ts 2024-06-26 09:32:02 -03:00
rafaelsideguide 4381109dd8 added default values and fixed pdf bug 2024-06-26 09:00:54 -03:00
Nicolas 45f2765601
Merge pull request #316 from snippet/types-webscraper
add some types
2024-06-25 22:03:21 -03:00
Nicolas 768a131b5c
Merge pull request #318 from mendableai/bug/fix-custom-scrape-pdf-google-drive
[Bug] Fixed the regex test for google drive pdf files
2024-06-25 18:27:11 -03:00
rafaelsideguide 5f69fc7677 Fixed the regex test 2024-06-25 18:24:01 -03:00
rafaelsideguide d02829d335 fixed clean jobs 2024-06-25 17:49:29 -03:00
Jeff Pereira 199cbe8bcb add some types 2024-06-25 12:20:25 -07:00
Nicolas 749b0c05dc Merge branch 'main' of https://github.com/mendableai/firecrawl 2024-06-25 15:21:15 -03:00
Nicolas e7be17db92 Nick: metadata fixes and lock duration for bull decreased to 2 hrs 2024-06-25 15:21:14 -03:00
Nicolas f84fb4b331
Merge pull request #313 from snippet/google-search-term-fix
fix multi-word search term issue: /search (w/o Serp)
2024-06-24 19:24:58 -03:00
Jeff Pereira 6ddf3a58a1 fix multi-word search term issue: /search (w/o Serp) 2024-06-24 14:21:52 -07:00
Nicolas 90b7fff366
Update crawler.ts 2024-06-24 16:52:01 -03:00
Nicolas 08c1fa799b
Update queue-worker.ts 2024-06-24 16:51:32 -03:00
rafaelsideguide 3ebdf93342 removed console.logs 2024-06-24 16:43:12 -03:00
Nicolas 56d42d9c9b Nick: 2024-06-24 16:33:07 -03:00
rafaelsideguide 21d29de819 testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
2024-06-24 16:25:07 -03:00
Nicolas 3c7b7e7242 NIck: fixes fallback 2024-06-23 18:59:08 -03:00
Caleb Peffer e59ba758f5 Caleb: changed posthog logging so that It associates jobs with a group. No 2024-06-18 17:42:21 -07:00