How do you crawl data from a website in python?
Web crawling is a powerful technique to collect data from the web by finding all the URLs for one or multiple domains. Python has several popular web crawling libraries and frameworks. Show
In this article, we will first introduce different crawling strategies and use cases. Then we will build a simple web crawler from scratch in Python using two libraries: requests and Beautiful Soup. Next, we will see why it’s better to use a web crawling framework like Scrapy. Finally, we will build an example crawler with Scrapy to collect film metadata from IMDb and see how Scrapy scales to websites with several million pages. What is a web crawler?Web crawling and web scraping are two different but related concepts. Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code. A web crawler starts with a list of URLs to visit, called the seed. For each URL, the crawler finds links in the HTML, filters those links based on some criteria and adds the new links to a queue. All the HTML or some specific information is extracted to be processed by a different pipeline. Web crawling strategiesIn practice, web crawlers only visit a subset of pages depending on the crawler budget, which can be a maximum number of pages per domain, depth or execution time. Most popular websites provide a robots.txt file to indicate which areas of the website are disallowed to crawl by each user agent. The opposite of the robots file is the sitemap.xml file, that lists the pages that can be crawled. Popular web crawler use cases include:
Next, we will compare three different strategies for building a web crawler in Python. First, using only standard libraries, then third party libraries for making HTTP requests and parsing HTML and finally, a web crawling framework. Building a simple web crawler in Python from scratchTo build a simple web crawler in Python we need at least one library to download the HTML from a URL and an HTML parsing library to extract links. Python provides standard libraries urllib for making HTTP requests and html.parser for parsing HTML. An example Python crawler built only with standard libraries can be found on Github. The standard Python libraries for requests and HTML parsing are not very developer-friendly. Other popular libraries like requests, branded as HTTP for humans, and Beautiful Soup provide a better developer experience. If you wan to learn more, you can check this guide about the best Python HTTP client. You can install the two libraries locally. A basic crawler can be built following the previous architecture diagram.
The code above defines a Crawler class with helper methods to download_url using the requests library, get_linked_urls using the Beautiful Soup library and add_url_to_visit to filter URLs. The URLs to visit and the visited URLs are stored in two separate lists. You can run the crawler on your terminal. The crawler logs one line for each visited URL.
The code is very simple but there are many performance and usability issues to solve before successfully crawling a complete website.
Next, we will see how Scrapy provides all these functionalities and makes it easy to extend for your custom crawls. Web crawling with ScrapyScrapy is the most popular web scraping and crawling Python framework with 40k stars on Github. One of the advantages of Scrapy is that requests are scheduled and handled asynchronously. This means that Scrapy can send another request before the previous one is completed or do some other work in between. Scrapy can handle many concurrent requests but can also be configured to respect the websites with custom settings, as we’ll see later. Scrapy has a multi-component architecture. Normally, you will implement at least two different classes: Spider and Pipeline. Web scraping can be thought of as an ETL where you extract data from the web and load it to your own storage. Spiders extract the data and pipelines load it into the storage. Transformation can happen both in spiders and pipelines, but I recommend that you set a custom Scrapy pipeline to transform each item independently of each other. This way, failing to process an item has no effect on other items. On top of all that, you can add spider and downloader middlewares in between components as it can be seen in the diagram below. Scrapy Architecture Overview [source] If you have used Scrapy before, you know that a web scraper is defined as a class that inherits from the base Spider class and implements a parse method to handle each response. If you are new to Scrapy, you can read this article for easy scraping with Scrapy.
Scrapy also provides several generic spider classes: CrawlSpider, XMLFeedSpider, CSVFeedSpider and SitemapSpider. The CrawlSpider class inherits from the base Spider class and provides an extra rules attribute to define how to crawl a website. Each rule uses a LinkExtractor to specify which links are extracted from each page. Next, we will see how to use each one of them by building a crawler for IMDb, the Internet Movie Database. Building an example Scrapy crawler for IMDbBefore trying to crawl IMDb, I checked IMDb robots.txt file to see which URL paths are allowed. The robots file only disallows 26 paths for all user-agents. Scrapy reads the robots.txt file beforehand and respects it when the ROBOTSTXT_OBEY setting is set to true. This is the case for all projects generated with the Scrapy command startproject.
This command creates a new project with the default Scrapy project folder structure.
Then you can create a spider in scrapy_crawler/spiders/imdb.py with a rule to extract all links.
You can launch the crawler in the terminal.
You will get lots of logs, including one log for each request. Exploring the logs I noticed that even if we set allowed_domains to only crawl web pages under https://www.imdb.com, there were requests to external domains, such as amazon.com.
IMDb redirects from URLs paths under whitelist-offsite and whitelist to external domains. There is an open Scrapy Github issue that shows that external URLs don’t get filtered out when the OffsiteMiddleware is applied before the RedirectMiddleware. To fix this issue, we can configure the link extractor to deny URLs starting with two regular expressions.
Rule and LinkExtractor classes support several arguments to filter out URLs. For example, you can ignore specific URL extensions and reduce the number of duplicate URLs by sorting query strings. If you don’t find a specific argument for your use case you can pass a custom function to process_links in LinkExtractor or process_values in Rule. For example, IMDb has two different URLs with the same content. https://www.imdb.com/name/nm1156914/ https://www.imdb.com/name/nm1156914/?mode=desktop&ref_=m_ft_dsk To limit the number of crawled URLs, we can remove all query strings from URLs with the url_query_cleaner function from the w3lib library and use it in process_links.
Now that we have limited the number of requests to process, we can add a parse_item method to extract data from each page and pass it to a pipeline to store it. For example, we can either extract the whole response.text to process it in a different pipeline or select the HTML metadata. To select the HTML metadata in the header tag we can code our own XPATHs but I find it better to use a library, extruct, that extracts all metadata from an HTML page. You can install it with pip install extract.
I set the follow attribute to True so that Scrapy still follows all links from each response even if we provided a custom parse method. I also configured extruct to extract only Open Graph metadata and JSON-LD, a popular method for encoding linked data using JSON in the Web, used by IMDb. You can run the crawler and store items in JSON lines format to a file.
The output file imdb.jl contains one line for each crawled item. For example, the extracted Open Graph metadata for a movie taken from the tags in the HTML looks like this.
The JSON-LD for a single item is too long to be included in the article, here is a sample of what Scrapy extracts from the Bài Viết Liên QuanQuảng CáoCó thể bạn quan tâmToplist được quan tâm#1
Top 7 sự tích hồ gươm - ngữ văn lớp 6 20236 tháng trước#2
Top 7 gdcd 6 bài 1 kết nối tri thức 20236 tháng trước#3
Top 7 ý nghĩa của xây dựng gia đình văn hóa 20236 tháng trước#4
Top 6 mẫu hợp đồng mượn đất làm nhà xưởng 20236 tháng trước#5
Top 3 tổng tài biến thái tôi yêu anh tập 27 20236 tháng trước#6
Top 6 kết thực phim mỹ nhân vô lệ 20236 tháng trước#7
#8
Top 8 đề tài và chủ de của tác phẩm tắt đèn 20236 tháng trước#9
Top 5 tiểu sử của thầy thích pháp hòa 20236 tháng trướcQuảng cáoXem NhiềuHạt giống nguyên chủng là gì năm 20244 ngày trướcDe thi học kì 1 hóa 9 tphcm năm 20245 ngày trướcHay mộng tinh là bệnh gì năm 20241 ngày trướcNước ta bắt đầu khai thác dầu mỏ năm nào năm 202410 giờ trướcPhục hồi nút erase trong cs6 bị lỗi năm 20246 ngày trướcNhung loai rau cu tot cho gan và tiên hóa năm 20245 ngày trướcDì ruột bán cháu ở thường xuân thanh hóa năm 20241 tuần trướcThu nhập trung bình của quản lý nhà hàng năm 20242 ngày trướcKiến bu quần lót là hiện tượng gì năm 20241 tuần trướcQuảng cáoChúng tôiTrợ giúpBản quyền © 2021 Học Tốt Inc.
|