Cheerio tải tệp HTML

Long gone are the days when you manually collected and processed the data helping you kickstart your projects. Whether an e-commerce website or a lead generation algorithm, one thing is sure. the data-gathering process was tedious and time-consuming

In this article, you will learn how Cheerio can come to your help with its extensive features of parsing markup languages, first with some trivial examples and then with a real-life use case

“But what is Cheerio?” you may ask yourself. Well, trying to clarify a common misconception, I will start with what Cheerio is not. một trình duyệt

The confusion may start from the fact that Cheerio parses documents written in a markup language and then offers an API to help you manipulate the resulting data structure. But unlike a browser, Cheerio will not visually render the document, load CSS files or execute Javascript

So basically what Cheerio does is to receive an HTML or XML input, parse the string and return the API. This makes it incredibly fast and easy to use, hence its popularity among Node. nhà phát triển js

Now, let’s see some hands-on examples of what Cheerio can do. First things first, you need to make sure your environment is all set up

It goes without saying that you must have Node. js installed on your machine. If you don’t, simply follow the instructions from their official website, according to your operating system

Make sure to download the Long Term Support version (LTS) and don’t forget about the Node. js Package Manager (NPM). You can run these commands to make sure the installation went okay

node -v
npm -v

The output should look like this

blog-image

Now, regarding the IDE debate. for this tutorial, I will be using Visual Studio Code, as it is pretty flexible and easy to use, but you are welcome to use any IDE you prefer

Just create a folder that will hold your little project and open a terminal. Run the following command to set up a Node. dự án js

npm init -y

This will create a default version of the package. json file, which can be modified at any time

Bước tiếp theo. I will install TypeScript along with the type definitions for Node. js

npm install typescript @types/node -save-dev

I chose TypeScript in this tutorial for its optional static typing to JavaScript objects, which makes the code more bulletproof when it comes to type errors.  

This is the same advantage that steadily increased its popularity among the JavaScript community, according to a recent CircleCI survey regarding the most popular programming languages

To verify the correct installation of the previous command, you can run

npx tsc --version

Now I will create the tsconfig. json configuration file at the root of the project directory, which should define the compiler options. If you want a better understanding of this file and its properties, the official TypeScript documentation got your back.  

If not, then simply copy and paste the following

{
"compilerOptions": {
"module": "commonjs",
"esModuleInterop": true,
"target": "es6",
"moduleResolution": "node",
"sourceMap": true,
"outDir": "dist"
},
"lib": ["es2015"]
}

Sắp xong. Now you have to install Cheerio (obviously)

npm install cheerio

Last but not least, create the src directory which will hold the code files. And speaking of the code file, create and place the index. ts file in the src directory

Hoàn hảo. Now you can get it started

For the moment I will illustrate some basic Cheerio features using a static HTML document. Simply copy and paste the content below into a new static. html file within your project




Page Name - Static HTML Example



Page Heading







Last Updated: Friday, September 23, 2022




Next, you have to serve the HTML file as input to Cheerio, which will then return the resulting API

import fs from 'fs'
import * as cheerio from 'cheerio'

const staticHTML = fs.readFileSync('static.html')
const $ = cheerio.load(staticHTML)

If you receive an error at this step, make sure that the input file contains a valid HTML document, as from Cheerio version 1. 0. 0 this criterion is verified as well

Now you can start experimenting with what Cheerio has to offer. The NPM package is well-known for its jQuery-like syntax and the use of CSS selectors to extract the nodes you are looking for. You can check out their official documentation to get a better idea

Let’s say you want to extract the page title

________số 8_______

We should test this, right? You are using Typescript, so you have to compile the code, which will create the dist directory, and then execute the associated index. tập tin js. For simplicity, I will define the following script in the package. tập tin json

"scripts": {
"test": "npx tsc && node dist/index.js",
}

This way, all I have to do is to run

npm init -y
0

and the script will handle both steps that I just described

Okay, but what if the selector matches more than one HTML element? Let’s try extracting the name and the stock value of the items presented in the unordered list

npm init -y
1

Now run the shortcut script again, and the output from your terminal should look like this

npm init -y
2

So that was basically the tip of the iceberg. Cheerio is also capable of parsing XML documents, extracting the style of the HTML elements and even altering the nodes’ attributes

But how can Cheerio help in a real use case?

Let’s say that we want to gather some data to train a Machine Learning model for a future bigger project. Usually, you would search Google for some training files and download them, or use the website’s API

But what do you do when you can’t find some relevant files or the website you are looking at does not provide an API, has a rate-limiting over the data or does not offer the whole data that you are seeing on a page?

Well, this is where web scraping comes in handy. If you are curious about more practical use cases of web scraping, you can check out this well-written article from our blog

Back to our sheep, for the sake of the example let’s consider that we are in this precise situation. wanting data, and data is nowhere to be found. Remember that Cheerio does not handle the HTML extraction nor the CSS loading or JS execution

Thus, in our tutorial, I am using Puppeteer to navigate to the website, grab the HTML and save it to a file. Then I will repeat the process from the previous section

To be more specific, I want to gather some public opinions from Reddit on a popular drum module and centralise the data in a single file that will be further fed to a potential ML training model. What happens next can also vary. sentiment analysis, market research and the list can go on

blog-image

Let’s see how this use case will look put into code. First, you need to install the Puppeteer NPM package

npm init -y
3

I will also create a new file reddit. ts, to keep the project more organised, and define a new script in the package. tập tin json

npm init -y
4

To get the HTML document, I will define a function looking like this

npm init -y
5

To quickly test this, add an entry point in your code and call the function

npm init -y
6

reddit. html file should appear in your project tree, which will contain the HTML document we want

Now, a bit of a more challenging part. you have to identify the nodes that are of interest for our use case. Go back to your browser (the real one) and navigate to the target URL. Go with your mouse cursor over the comments section, right-click and then choose the “Inspect” option

The Developer Tools tab will open, showing you the exact same HTML document you previously saved on your machine

blog-image

To extract only the comments, you have to identify the selectors unique to this section of the page. You can notice that the whole list of comments is within a div container with a sitetable nestedlisting class

Going deeper, each individual comment has a form element as a parent, with the usertext warn-on-unload class. Then, at the bottom, you can see that every comment’s text is divided between multiple p elements

Let’s see how this works in code

npm init -y
7

Alright, and now let’s update the entry point with the newly defined function and see how this code is working together

npm init -y
8

Execute the code with the script defined before

npm init -y
9

It will take around 5 to 10 seconds for the headless browser to open and navigate to our target URL. If you are more curious about it, you can add timestamps at the beginning and the end of each of our functions to really see how fast Cheerio is

Các bình luận, ý kiến. json file should be our final result

blog-image

This use case can be easily extended to parsing the number of upvotes and downvotes for every comment, or to get the comments’ nested replies. The possibilities are endless

Thank you for making it to the end of this tutorial. I hope you grasped how indispensable Cheerio is to the process of data extraction and how to quickly integrate it into your next scraping project

We are also using Cheerio in our product, WebScrapingAPI. If you ever find yourself tangled by the many challenges encountered in web scraping (IP blocks, bot detection, etc. ), consider giving it a try

How to use Cheerio in HTML?

Loading an HTML String ts to begin. nhập cổ vũ từ "cổ vũ"; . load('');');');');');');');');');');'); As you can see, all you need to do is pass an HTML string into Cheerio's load method.

How to get data from HTML in node JS?

Just install express and you are good to go. .
Bước 1. Cài đặt nhanh. Create a new folder and initialize a new Node project using the following command. .
Bước 2. Using sendFile() function. .
Bước 3. Kết xuất HTML trong Express. .
Bước 4. Render Dynamic HTML using templating engine

Cheerio có chạy JavaScript không?

Cheerio is a fast, lean implementation of core jQuery. It helps in traversing the DOM using a friendly and familiar API and works both in the browser and the server. It simply parses the HTML and XML and does not execute any Javascript in the document or load any external resources .

Does Cheerio use jQuery?

Cheerio is a server-side implementation of jQuery . The Crawler uses it to expose the page's DOM so you can extract the content you want using Cheerio's Selectors API.