Hướng dẫn python fastest xml parser - trình phân tích cú pháp xml nhanh nhất của python

Question

Tôi trông với tôi như thể bạn không cần bất kỳ khả năng DOM nào từ chương trình của bạn. Tôi sẽ thứ hai sử dụng thư viện phần tử (c). Nếu bạn sử dụng chức năng Iterparse của mô -đun Celementtree, bạn có thể làm việc theo cách của mình thông qua XML và đối phó với các sự kiện khi chúng xảy ra.

Nội dung chính Show

Nếu bạn đã từng cố gắng phân tích một tài liệu XML trong Python trước đó, thì bạn sẽ biết một nhiệm vụ như vậy có thể khó khăn đến mức nào. Một mặt, Zen of Python chỉ hứa hẹn một cách rõ ràng để đạt được mục tiêu của bạn. Đồng thời, thư viện tiêu chuẩn tuân theo các pin bao gồm phương châm bằng cách cho phép bạn chọn từ không chỉ một mà là một số trình phân tích cú pháp XML. May mắn thay, cộng đồng Python đã giải quyết vấn đề thặng dư này bằng cách tạo ra nhiều thư viện phân tích XML hơn nữa.
Trong hướng dẫn này, bạn sẽ học cách:
API đơn giản cho XML (Sax)
Truyền phát API cho XML (STAX)
Tìm hiểu về các trình phân tích cú pháp XML trong Thư viện tiêu chuẩn Python
>>> document.getElementById("skin") >>> document.getElementById("smiley") 5: Thực hiện DOM tối thiểu
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear() 25: Giao diện Sax cho Python
Các cuộc gọi lại này nhận được các tham số bổ sung về không gian tên phần tử. Để làm cho trình phân tích cú pháp SAX thực sự kích hoạt các cuộc gọi lại đó thay vì một số lần trước, bạn phải bật rõ ràng hỗ trợ không gian tên XML:
>>> document.getElementById("skin") >>> document.getElementById("smiley") 8: Trình phân tích cú pháp kéo phát trực tuyến
Khám phá thư viện trình phân tích cú pháp XML của bên thứ ba
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear() 90: Chuyển đổi XML thành đối tượng Python
import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear() 09: Chuyển đổi XML thành Từ điển Python
Nếu bạn thích JSON nhưng bạn không phải là một fan hâm mộ của XML, thì hãy xem import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear() 09, cố gắng thu hẹp khoảng cách giữa cả hai định dạng dữ liệu. Đúng như tên gọi, thư viện có thể phân tích tài liệu XML và đại diện cho nó dưới dạng từ điển Python, cũng là loại dữ liệu đích cho các tài liệu JSON trong Python. Điều này làm cho việc chuyển đổi giữa XML và JSON có thể.
import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear() 28: Đối phó với XML dị dạng
Khác với tốc độ, có sự khác biệt đáng chú ý giữa các trình phân tích cú pháp riêng lẻ. Ví dụ, một số trong số họ tha thứ hơn những người khác khi nói đến các yếu tố dị dạng, trong khi những người khác mô phỏng các trình duyệt web tốt hơn.
Xác định các mô hình với các biểu thức XPath
Tạo các mô hình từ lược đồ XML
Xử lý quả bom XML với các trình phân tích cú pháp an toàn
Xử lý quả bom XML với các trình phân tích cú pháp an toàn

Tuy nhiên, lưu ý, lời khuyên của Fredriks về việc sử dụng chức năng Celementtree Iterparse:

Để phân tích các tệp lớn, bạn có thể loại bỏ các yếu tố ngay khi bạn đã xử lý chúng:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

Các mẫu trên có một nhược điểm; Nó không xóa phần tử gốc, vì vậy bạn sẽ kết thúc với một phần tử duy nhất với nhiều phần tử trẻ em trống. Nếu các tập tin của bạn là lớn, thay vì chỉ lớn, đây có thể là một vấn đề. Để làm việc xung quanh điều này, bạn cần phải có được phần tử gốc. Cách dễ nhất để làm điều này là bật các sự kiện bắt đầu và lưu tham chiếu đến phần tử đầu tiên trong một biến:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

Lxml.iterparse () không cho phép điều này.

Cái trước không hoạt động trên Python 3.7, hãy xem xét cách sau để có được yếu tố đầu tiên.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

Nếu bạn đã từng cố gắng phân tích một tài liệu XML trong Python trước đó, thì bạn sẽ biết một nhiệm vụ như vậy có thể khó khăn đến mức nào. Một mặt, Zen of Python chỉ hứa hẹn một cách rõ ràng để đạt được mục tiêu của bạn. Đồng thời, thư viện tiêu chuẩn tuân theo các pin bao gồm phương châm bằng cách cho phép bạn chọn từ không chỉ một mà là một số trình phân tích cú pháp XML. May mắn thay, cộng đồng Python đã giải quyết vấn đề thặng dư này bằng cách tạo ra nhiều thư viện phân tích XML hơn nữa.XML document in Python before, then you know how surprisingly difficult such a task can be. On the one hand, the Zen of Python promises only one obvious way to achieve your goal. At the same time, the standard library follows the batteries included motto by letting you choose from not one but several XML parsers. Luckily, the Python community solved this surplus problem by creating even more XML parsing libraries.

Bỏ qua một bên, tất cả các trình phân tích cú pháp XML đều có vị trí của họ trong một thế giới đầy những thách thức nhỏ hơn hoặc lớn hơn. Nó có giá trị để làm quen với các công cụ có sẵn.

Trong hướng dẫn này, bạn sẽ học cách:

Chọn mô hình phân tích cú pháp XML phù hợpparsing model
Sử dụng trình phân tích cú pháp XML trong thư viện tiêu chuẩnstandard library
Sử dụng các thư viện phân tích cú pháp XML chínhlibraries
Parse XML Tài liệu khai báo bằng cách sử dụng liên kết dữ liệudata binding
Sử dụng trình phân tích cú pháp XML an toàn để loại bỏ các lỗ hổng bảo mậtsecurity vulnerabilities

Bây giờ, bạn hiểu các chiến lược khác nhau để phân tích các tài liệu XML cũng như điểm mạnh và điểm yếu của chúng. Với kiến thức này, bạn có thể chọn trình phân tích cú pháp XML phù hợp nhất cho trường hợp sử dụng cụ thể của bạn và thậm chí kết hợp nhiều hơn một để đọc các tệp XML đa gigabyte nhanh hơn.roadmap to guide you through the confusing world of XML parsers in Python. By the end of it, you’ll be able to pick the right XML parser for a given problem. To get the most out of this tutorial, you should already be familiar with XML and its building blocks, as well as how to work with files in Python.

Nếu bạn đã từng cố gắng phân tích một tài liệu XML trong Python trước đó, thì bạn sẽ biết một nhiệm vụ như vậy có thể khó khăn đến mức nào. Một mặt, Zen of Python chỉ hứa hẹn một cách rõ ràng để đạt được mục tiêu của bạn. Đồng thời, thư viện tiêu chuẩn tuân theo các pin bao gồm phương châm bằng cách cho phép bạn chọn từ không chỉ một mà là một số trình phân tích cú pháp XML. May mắn thay, cộng đồng Python đã giải quyết vấn đề thặng dư này bằng cách tạo ra nhiều thư viện phân tích XML hơn nữa.

Bỏ qua một bên, tất cả các trình phân tích cú pháp XML đều có vị trí của họ trong một thế giới đầy những thách thức nhỏ hơn hoặc lớn hơn. Nó có giá trị để làm quen với các công cụ có sẵn.

Trong hướng dẫn này, bạn sẽ học cách:

Bạn có thể sử dụng hướng dẫn này như một lộ trình để hướng dẫn bạn qua thế giới khó hiểu của các trình phân tích cú pháp XML trong Python. Đến cuối của nó, bạn sẽ có thể chọn trình phân tích cú pháp XML phù hợp cho một vấn đề nhất định. Để tận dụng tối đa hướng dẫn này, bạn đã quen thuộc với XML và các khối xây dựng của nó, cũng như cách làm việc với các tệp trong Python.

Chọn mô hình phân tích cú pháp XML phù hợpstandard operations for traversing and modifying document elements arranged in a hierarchy of objects. An abstract representation of the entire document tree is stored in memory, giving you random access to the individual elements.

Nó chỉ ra rằng bạn có thể xử lý các tài liệu XML bằng một vài chiến lược bất khả tri ngôn ngữ. Mỗi thể hiện sự đánh đổi bộ nhớ và tốc độ khác nhau, có thể biện minh một phần cho phạm vi rộng của các trình phân tích cú pháp XML có sẵn trong Python. Trong phần sau, bạn sẽ tìm ra sự khác biệt và điểm mạnh của họ.omnidirectional navigation, building its abstract representation in the first place can be time-consuming. Moreover, the XML gets parsed at once, as a whole, so it has to be reasonably small to fit the available memory. This renders the DOM suitable only for moderately large configuration files rather than multi-gigabyte XML databases.

Sử dụng trình phân tích cú pháp DOM khi sự tiện lợi quan trọng hơn thời gian xử lý và khi bộ nhớ không phải là vấn đề. Một số trường hợp sử dụng điển hình là khi bạn cần phân tích một tài liệu tương đối nhỏ hoặc khi bạn chỉ cần thực hiện phân tích cú pháp không thường xuyên.

API đơn giản cho XML (Sax)

Để giải quyết những thiếu sót của DOM, cộng đồng Java đã đưa ra một thư viện thông qua một nỗ lực hợp tác, sau đó trở thành một mô hình thay thế để phân tích XML trong các ngôn ngữ khác. Không có đặc điểm kỹ thuật chính thức, chỉ thảo luận hữu cơ trong danh sách gửi thư. Kết quả cuối cùng là API phát trực tuyến dựa trên sự kiện hoạt động tuần tự trên các yếu tố riêng lẻ thay vì toàn bộ cây.event-based streaming API that operates sequentially on individual elements rather than the whole tree.

Các phần tử được xử lý từ trên xuống dưới theo cùng một thứ tự chúng xuất hiện trong tài liệu. Trình phân tích cú pháp kích hoạt các cuộc gọi lại do người dùng xác định để xử lý các nút XML cụ thể vì nó tìm thấy chúng trong tài liệu. Cách tiếp cận này được gọi là phân tích cú pháp của Push Push vì các yếu tố được trình phân tích cú pháp đẩy vào chức năng của bạn.“push” parsing because elements are pushed to your functions by the parser.

Sax cũng cho phép bạn loại bỏ các yếu tố nếu bạn không quan tâm đến chúng. Điều này có nghĩa là nó có dấu chân bộ nhớ thấp hơn nhiều so với DOM và có thể xử lý các tệp lớn tùy ý, rất phù hợp để xử lý thông qua như lập chỉ mục, chuyển đổi sang các định dạng khác, v.v.single-pass processing such as indexing, conversion to other formats, and so on.

Tuy nhiên, việc tìm kiếm hoặc sửa đổi các nút cây ngẫu nhiên là cồng kềnh vì nó thường yêu cầu nhiều lần vượt qua trên tài liệu và theo dõi các nút đã truy cập. Sax cũng bất tiện cho việc xử lý các yếu tố lồng nhau sâu sắc. Cuối cùng, mô hình SAX chỉ cho phép phân tích cú pháp chỉ đọc.read-only parsing.

Nói tóm lại, Sax rẻ về không gian và thời gian nhưng khó sử dụng hơn DOM trong hầu hết các trường hợp. Nó hoạt động tốt để phân tích các tài liệu rất lớn hoặc phân tích dữ liệu XML đến trong thời gian thực.

Truyền phát API cho XML (STAX)

Mặc dù có phần ít phổ biến hơn trong Python, cách tiếp cận thứ ba này để phân tích cú pháp XML được xây dựng trên đỉnh Sax. Nó mở rộng ý tưởng phát trực tuyến nhưng thay vào đó, sử dụng mô hình phân tích cú pháp của Keith Pull, điều này giúp bạn có quyền kiểm soát nhiều hơn. Bạn có thể nghĩ về Stax như một trình lặp lại tiến lên đối tượng con trỏ thông qua tài liệu XML, trong đó người xử lý tùy chỉnh gọi trình phân tích cú pháp theo yêu cầu chứ không phải cách khác.streaming but uses a “pull” parsing model instead, which gives you more control. You can think of StAX as an iterator advancing a cursor object through an XML document, where custom handlers call the parser on demand and not the other way around.

Sử dụng Stax cung cấp cho bạn nhiều quyền kiểm soát hơn đối với quy trình phân tích cú pháp và cho phép quản lý trạng thái thuận tiện hơn. Các sự kiện trong luồng chỉ được tiêu thụ khi được yêu cầu, cho phép đánh giá lười biếng. Ngoài ra, hiệu suất của nó nên ngang bằng với Sax, tùy thuộc vào việc thực hiện trình phân tích cú pháp.state management. The events in the stream are only consumed when requested, enabling lazy evaluation. Other than that, its performance should be on par with SAX, depending on the parser implementation.

Tìm hiểu về các trình phân tích cú pháp XML trong Thư viện tiêu chuẩn Python

Trong phần này, bạn sẽ xem xét các trình phân tích XML tích hợp Python, có sẵn cho bạn trong gần như mọi phân phối Python. Bạn sẽ so sánh các trình phân tích cú pháp với hình ảnh đồ họa vector có thể mở rộng (SVG) mẫu, đây là định dạng dựa trên XML. Bằng cách xử lý cùng một tài liệu với các trình phân tích cú pháp khác nhau, bạn sẽ có thể chọn một tài liệu phù hợp nhất với bạn.

Hình ảnh mẫu mà bạn có thể lưu trong một tệp cục bộ để tham khảo, mô tả một khuôn mặt cười. Nó bao gồm nội dung XML sau:



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

Nó bắt đầu bằng khai báo XML, theo sau là định nghĩa loại tài liệu (DTD) và phần tử gốc

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

1. DTD là tùy chọn, nhưng nó có thể giúp xác thực cấu trúc tài liệu của bạn nếu bạn quyết định sử dụng trình xác nhận XML. Phần tử gốc chỉ định không gian tên mặc định

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

2 cũng như không gian tên có tiền tố

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

3 cho các phần tử và thuộc tính dành riêng cho trình soạn thảo. Tài liệu cũng chứa:XML declaration, followed by a Document Type Definition (DTD) and the

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

1 root element. The DTD is optional, but it can help validate your document structure if you decide to use an XML validator. The root element specifies the default namespace

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

2 as well as a prefixed namespace

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

3 for editor-specific elements and attributes. The document also contains:

Các yếu tố lồng nhau
Thuộc tính
Bình luận

Dữ liệu ký tự (

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

4)

Các thực thể được xác định trước và tùy chỉnh

Đi trước, lưu XML trong một tệp có tên Smiley.SVG và mở nó bằng trình duyệt web hiện đại, sẽ chạy đoạn trích JavaScript hiện tại ở cuối:

Hướng dẫn python fastest xml parser - trình phân tích cú pháp xml nhanh nhất của python

Mã thêm một thành phần tương tác vào hình ảnh. Khi bạn di chuột trên khuôn mặt cười, nó chớp mắt. Nếu bạn muốn chỉnh sửa khuôn mặt cười bằng giao diện người dùng đồ họa tiện lợi (GUI), thì bạn có thể mở tệp bằng trình chỉnh sửa đồ họa vector như Adobe Illustrator hoặc Inkscape.

Điều đáng chú ý là thư viện tiêu chuẩn Python, định nghĩa các giao diện trừu tượng để phân tích các tài liệu XML trong khi cho phép bạn cung cấp triển khai trình phân tích cú pháp cụ thể. Trong thực tế, bạn hiếm khi làm điều đó bởi vì Python đóng gói một ràng buộc cho thư viện nước ngoài, đây là một trình phân tích cú pháp XML nguồn mở được sử dụng rộng rãi được viết bằng C. Tất cả các mô-đun Python sau đây trong thư viện tiêu chuẩn sử dụng nước ngoài theo mui xe theo mặc định.abstract interfaces for parsing XML documents while letting you supply concrete parser implementation. In practice, you rarely do that because Python bundles a binding for the Expat library, which is a widely used open-source XML parser written in C. All of the following Python modules in the standard library use Expat under the hood by default.

Thật không may, trong khi trình phân tích cú pháp người nước ngoài có thể cho bạn biết nếu tài liệu của bạn được hình thành tốt, nó có thể xác nhận cấu trúc của các tài liệu của bạn đối với định nghĩa lược đồ XML (XSD) hoặc định nghĩa loại tài liệu (DTD). Vì vậy, bạn sẽ phải sử dụng một trong các thư viện của bên thứ ba được thảo luận sau.well-formed, it can’t validate the structure of your documents against an XML Schema Definition (XSD) or a Document Type Definition (DTD). For that, you’ll have to use one of the third-party libraries discussed later.

>>> document.getElementById("skin") >>> document.getElementById("smiley") 5: Thực hiện DOM tối thiểu

Xem xét rằng việc phân tích các tài liệu XML sử dụng DOM được cho là đơn giản nhất, bạn đã giành được điều đáng ngạc nhiên khi tìm thấy trình phân tích cú pháp DOM trong thư viện tiêu chuẩn Python. Tuy nhiên, điều đáng ngạc nhiên là thực sự có hai trình phân tích cú pháp DOM.

Gói

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

6 chứa hai mô -đun để làm việc với DOM trong Python:

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

5

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

8

Đầu tiên là việc triển khai bị loại bỏ của giao diện DOM phù hợp với phiên bản tương đối cũ của đặc tả W3C. Nó cung cấp các đối tượng phổ biến được xác định bởi API DOM như

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9,

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

0 và

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

1. Mô -đun này được ghi lại kém và có tính hữu dụng khá hạn chế, khi bạn sắp tìm hiểu.

Mô -đun thứ hai có một tên hơi sai lệch vì nó xác định trình phân tích cú pháp kéo phát trực tuyến, có thể tùy ý tạo ra một biểu diễn DOM của nút hiện tại trong cây tài liệu. Bạn sẽ tìm thấy thêm thông tin về trình phân tích cú pháp

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

2 sau.streaming pull parser, which can optionally produce a DOM representation of the current node in the document tree. You’ll find more information about the

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

2 parser later.

Có hai chức năng trong

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

3 cho phép bạn phân tích dữ liệu XML từ các nguồn dữ liệu khác nhau. Người ta chấp nhận tên tệp hoặc đối tượng tệp, trong khi một tên khác mong đợi chuỗi python:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

Chuỗi được trích xuất ba giúp nhúng một chuỗi đa dòng theo nghĩa đen mà không sử dụng ký tự tiếp tục (

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

4) ở cuối mỗi dòng. Trong mọi trường hợp, bạn sẽ kết thúc với một ví dụ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9, thể hiện giao diện DOM quen thuộc, cho phép bạn đi qua cây.

Ngoài ra, bạn sẽ có thể truy cập vào Tuyên bố XML, DTD và phần tử gốc:

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

Chuỗi được trích xuất ba giúp nhúng một chuỗi đa dòng theo nghĩa đen mà không sử dụng ký tự tiếp tục (

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

4) ở cuối mỗi dòng. Trong mọi trường hợp, bạn sẽ kết thúc với một ví dụ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9, thể hiện giao diện DOM quen thuộc, cho phép bạn đi qua cây.

Ngoài ra, bạn sẽ có thể truy cập vào Tuyên bố XML, DTD và phần tử gốc:

>>>

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

Chuỗi được trích xuất ba giúp nhúng một chuỗi đa dòng theo nghĩa đen mà không sử dụng ký tự tiếp tục (

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

4) ở cuối mỗi dòng. Trong mọi trường hợp, bạn sẽ kết thúc với một ví dụ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9, thể hiện giao diện DOM quen thuộc, cho phép bạn đi qua cây.

Ngoài ra, bạn sẽ có thể truy cập vào Tuyên bố XML, DTD và phần tử gốc:

Như bạn có thể thấy, mặc dù trình phân tích cú pháp XML mặc định trong Python có thể xác nhận các tài liệu, nhưng nó vẫn cho phép bạn kiểm tra

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

6, DTD, nếu nó có mặt. Lưu ý rằng khai báo XML và DTD là tùy chọn. Nếu khai báo XML hoặc thuộc tính XML đã cho bị thiếu, thì các thuộc tính python tương ứng sẽ là

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

7.

Để tìm một phần tử theo ID, bạn phải sử dụng thể hiện

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9 thay vì cha mẹ cụ thể

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

0. Hình ảnh SVG mẫu có hai nút có thuộc tính

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

00, nhưng bạn có thể tìm thấy một trong số đó:

Điều đó có thể đáng ngạc nhiên đối với một người chỉ làm việc với HTML và JavaScript nhưng đã làm việc với XML trước đây. Mặc dù HTML xác định ngữ nghĩa cho các yếu tố và thuộc tính nhất định như

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

01 hoặc

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

00, XML không gắn bất kỳ ý nghĩa nào với các khối xây dựng của nó. Bạn cần đánh dấu một thuộc tính là ID một cách rõ ràng bằng cách sử dụng DTD hoặc bằng cách gọi

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

03 trong Python, ví dụ:

Phong cách định nghĩa Thực hiện

DTD

>>>

Chuỗi được trích xuất ba giúp nhúng một chuỗi đa dòng theo nghĩa đen mà không sử dụng ký tự tiếp tục (

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

4) ở cuối mỗi dòng. Trong mọi trường hợp, bạn sẽ kết thúc với một ví dụ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9, thể hiện giao diện DOM quen thuộc, cho phép bạn đi qua cây.

Ngoài ra, bạn sẽ có thể truy cập vào Tuyên bố XML, DTD và phần tử gốc:

>>>

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

Chuỗi được trích xuất ba giúp nhúng một chuỗi đa dòng theo nghĩa đen mà không sử dụng ký tự tiếp tục (

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

4) ở cuối mỗi dòng. Trong mọi trường hợp, bạn sẽ kết thúc với một ví dụ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9, thể hiện giao diện DOM quen thuộc, cho phép bạn đi qua cây.

Ngoài ra, bạn sẽ có thể truy cập vào Tuyên bố XML, DTD và phần tử gốc:tag name. Unlike the

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

11 method, you can call

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

12 on the document or a particular parent element to reduce the search scope:

>>>

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

Chuỗi được trích xuất ba giúp nhúng một chuỗi đa dòng theo nghĩa đen mà không sử dụng ký tự tiếp tục (

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

4) ở cuối mỗi dòng. Trong mọi trường hợp, bạn sẽ kết thúc với một ví dụ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9, thể hiện giao diện DOM quen thuộc, cho phép bạn đi qua cây.

Ngoài ra, bạn sẽ có thể truy cập vào Tuyên bố XML, DTD và phần tử gốc:prefixed with a namespace identifier won’t be included. They must be searched using

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

16, which expects different arguments:

>>>

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

0

Chuỗi được trích xuất ba giúp nhúng một chuỗi đa dòng theo nghĩa đen mà không sử dụng ký tự tiếp tục (

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

4) ở cuối mỗi dòng. Trong mọi trường hợp, bạn sẽ kết thúc với một ví dụ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9, thể hiện giao diện DOM quen thuộc, cho phép bạn đi qua cây.

Ngoài ra, bạn sẽ có thể truy cập vào Tuyên bố XML, DTD và phần tử gốc:whitespace characters between elements:

>>>

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

1

Chuỗi được trích xuất ba giúp nhúng một chuỗi đa dòng theo nghĩa đen mà không sử dụng ký tự tiếp tục (

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

4) ở cuối mỗi dòng. Trong mọi trường hợp, bạn sẽ kết thúc với một ví dụ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

9, thể hiện giao diện DOM quen thuộc, cho phép bạn đi qua cây.

>>>

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

2

Lưu ý rằng bạn cũng phải

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

19 tài liệu để kết hợp các nút văn bản liền kề. Nếu không, bạn có thể kết thúc với một loạt các yếu tố XML dự phòng chỉ với khoảng trắng. Một lần nữa, đệ quy là cách duy nhất để truy cập các yếu tố cây vì bạn có thể lặp lại tài liệu và các yếu tố của nó bằng một vòng lặp. Cuối cùng, điều này sẽ cho bạn kết quả dự kiến:

>>>

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

3

Các yếu tố hiển thị một vài phương pháp và thuộc tính hữu ích để cho phép bạn truy vấn chi tiết của họ:

>>>

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

4

Chẳng hạn, bạn có thể kiểm tra không gian tên, tên thẻ hoặc thuộc tính của phần tử. Nếu bạn yêu cầu một thuộc tính bị thiếu, thì bạn sẽ nhận được một chuỗi trống (

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

20).

Đối phó với các thuộc tính tên là khác nhau nhiều. Bạn chỉ cần nhớ tiền tố tên thuộc tính phù hợp hoặc cung cấp tên miền:

>>>

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

5

Thật kỳ lạ, ký tự ký tự đại diện (

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

17) không hoạt động ở đây như đã làm với phương pháp

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

16 trước đó.

Vì hướng dẫn này chỉ về phân tích cú pháp XML, bạn sẽ cần kiểm tra tài liệu

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

3 cho các phương thức sửa đổi cây Dom. Họ chủ yếu tuân theo đặc điểm kỹ thuật của W3C.

Như bạn có thể thấy, mô -đun

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

3 không thuận tiện. Ưu điểm chính của nó đến từ việc trở thành một phần của thư viện tiêu chuẩn, điều đó có nghĩa là bạn không phải cài đặt bất kỳ sự phụ thuộc bên ngoài nào trong dự án của bạn để làm việc với DOM.

# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear() 25: Giao diện Sax cho Python

Để bắt đầu làm việc với Sax trong Python, bạn có thể sử dụng các hàm tiện lợi

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

26 và

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

27 như trước, nhưng từ gói

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

25 thay thế. Bạn cũng phải cung cấp ít nhất một đối số yêu cầu nữa, đó phải là một thể hiện xử lý nội dung. Theo tinh thần của Java, bạn cung cấp một bằng cách phân lớp một lớp cơ sở cụ thể:content handler instance. In the spirit of Java, you provide one by subclassing a specific base class:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

6

Trình xử lý nội dung nhận được một luồng các sự kiện tương ứng với các yếu tố trong tài liệu của bạn khi nó được phân tích cú pháp. Chạy mã này won đã làm bất cứ điều gì hữu ích vì lớp xử lý của bạn trống. Để làm cho nó hoạt động, bạn sẽ cần quá tải một hoặc nhiều phương thức gọi lại từ siêu lớp.stream of events corresponding to elements in your document as it’s being parsed. Running this code won’t do anything useful yet because your handler class is empty. To make it work, you’ll need to overload one or more callback methods from the superclass.

Bắn lên trình chỉnh sửa yêu thích của bạn, nhập mã sau và lưu nó vào một tệp có tên

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

29:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

7

Trình xử lý nội dung sửa đổi này in ra một vài sự kiện lên đầu ra tiêu chuẩn. Trình phân tích cú pháp SAX sẽ gọi ba phương thức này cho bạn để đáp ứng việc tìm thẻ bắt đầu, thẻ kết thúc và một số văn bản giữa chúng. Khi bạn mở một phiên tương tác của trình thông dịch Python, hãy nhập trình xử lý nội dung của bạn và cung cấp cho nó một ổ đĩa thử nghiệm. Nó sẽ tạo ra đầu ra sau:

>>>

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

8

Điều đó về cơ bản là mẫu thiết kế của người quan sát, cho phép bạn dịch XML thành định dạng phân cấp khác tăng dần. Giả sử bạn muốn chuyển đổi tệp SVG đó thành biểu diễn JSON đơn giản hóa. Đầu tiên, bạn sẽ muốn lưu trữ đối tượng Trình xử lý nội dung của mình trong một biến riêng biệt để trích xuất thông tin từ nó sau:

>>>

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

9

Vì trình phân tích cú pháp SAX phát ra các sự kiện mà không cung cấp bất kỳ bối cảnh nào về yếu tố mà nó tìm thấy, bạn cần theo dõi nơi bạn đang ở trong cây. Do đó, thật hợp lý khi đẩy và bật phần tử hiện tại vào một ngăn xếp, bạn có thể mô phỏng thông qua danh sách Python thông thường. Bạn cũng có thể xác định thuộc tính trợ giúp

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

30 sẽ trả về phần tử cuối cùng được đặt trên đỉnh của ngăn xếp:

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

0

Khi trình phân tích cú pháp SAX tìm thấy một phần tử mới, bạn có thể ngay lập tức nắm bắt tên thẻ và thuộc tính của nó trong khi tạo trình giữ chỗ cho các yếu tố trẻ em và giá trị, cả hai đều là tùy chọn. Hiện tại, bạn có thể lưu trữ mọi yếu tố dưới dạng đối tượng

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

31. Thay thế phương thức

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

32 hiện tại của bạn bằng một triển khai mới:

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

1

Trình phân tích cú pháp SAX cung cấp cho bạn các thuộc tính dưới dạng ánh xạ mà bạn có thể chuyển đổi thành Từ điển Python đơn giản với một cuộc gọi đến hàm

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

33. Giá trị phần tử thường được trải đều trên nhiều mảnh mà bạn có thể kết hợp bằng toán tử cộng (

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

34) hoặc câu lệnh gán tăng cường tương ứng:

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

2

Tổng hợp văn bản theo cách như vậy sẽ đảm bảo rằng nội dung đa dòng kết thúc trong phần tử hiện tại. Ví dụ: thẻ

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

35 trong tệp SVG mẫu chứa sáu dòng mã JavaScript, kích hoạt các cuộc gọi riêng biệt đến cuộc gọi lại

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

36.

Cuối cùng, một khi trình phân tích cú pháp vấp phải thẻ đóng, bạn có thể bật phần tử hiện tại từ ngăn xếp và nối nó vào con cái của nó. Nếu chỉ còn lại một yếu tố, thì đó sẽ là nguồn gốc tài liệu của bạn mà bạn nên giữ sau này. Ngoài ra, bạn có thể muốn làm sạch phần tử hiện tại bằng cách tháo các khóa có giá trị trống:

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

3

Lưu ý rằng

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

37 là một hàm được xác định bên ngoài cơ thể lớp. Việc làm sạch phải được thực hiện vào cuối vì ở đó, không có cách nào để biết phía trước có bao nhiêu đoạn văn bản để kết nối có thể có. Bạn có thể mở rộng phần thu gọn bên dưới để biết mã xử lý nội dung hoàn chỉnh.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

4

Bây giờ, đã đến lúc đưa mọi thứ vào thử nghiệm bằng cách phân tích XML, trích xuất phần tử gốc từ trình xử lý nội dung của bạn và bỏ nó vào chuỗi JSON:

>>>

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

5

Nó đáng chú ý rằng việc triển khai này không có bộ nhớ tăng so với DOM vì nó xây dựng một đại diện trừu tượng của toàn bộ tài liệu như trước đây. Sự khác biệt là bạn đã tạo ra một biểu diễn từ điển tùy chỉnh thay vì cây Dom tiêu chuẩn. Tuy nhiên, bạn có thể tưởng tượng việc viết trực tiếp vào một tệp hoặc cơ sở dữ liệu thay vì bộ nhớ trong khi nhận các sự kiện sax. Điều đó sẽ nâng cao giới hạn bộ nhớ máy tính của bạn một cách hiệu quả.

Nếu bạn muốn phân tích các không gian tên XML, thì bạn sẽ cần phải tạo và định cấu hình trình phân tích cú pháp SAX với một chút mã Boilerplate và cũng thực hiện các cuộc gọi lại hơi khác nhau:

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

6

Các cuộc gọi lại này nhận được các tham số bổ sung về không gian tên phần tử. Để làm cho trình phân tích cú pháp SAX thực sự kích hoạt các cuộc gọi lại đó thay vì một số lần trước, bạn phải bật rõ ràng hỗ trợ không gian tên XML:XML namespace support:

>>>

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

7

Nó đáng chú ý rằng việc triển khai này không có bộ nhớ tăng so với DOM vì nó xây dựng một đại diện trừu tượng của toàn bộ tài liệu như trước đây. Sự khác biệt là bạn đã tạo ra một biểu diễn từ điển tùy chỉnh thay vì cây Dom tiêu chuẩn. Tuy nhiên, bạn có thể tưởng tượng việc viết trực tiếp vào một tệp hoặc cơ sở dữ liệu thay vì bộ nhớ trong khi nhận các sự kiện sax. Điều đó sẽ nâng cao giới hạn bộ nhớ máy tính của bạn một cách hiệu quả.

Nếu bạn muốn phân tích các không gian tên XML, thì bạn sẽ cần phải tạo và định cấu hình trình phân tích cú pháp SAX với một chút mã Boilerplate và cũng thực hiện các cuộc gọi lại hơi khác nhau:

Các cuộc gọi lại này nhận được các tham số bổ sung về không gian tên phần tử. Để làm cho trình phân tích cú pháp SAX thực sự kích hoạt các cuộc gọi lại đó thay vì một số lần trước, bạn phải bật rõ ràng hỗ trợ không gian tên XML:

Cài đặt tính năng này biến phần tử

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

38 thành một bộ thuật bao gồm tên miền tên miền tên và tên thẻ.

Gói

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

25 cung cấp giao diện phân tích cú pháp XML dựa trên sự kiện khá được mô phỏng theo API Java gốc. Nó có phần hạn chế so với DOM nhưng phải đủ để thực hiện trình phân tích cú pháp phát trực tuyến XML cơ bản mà không cần dùng đến các thư viện của bên thứ ba. Với suy nghĩ này, có một trình phân tích cú pháp kéo dài hơn có sẵn trong Python, mà bạn sẽ khám phá tiếp theo.flat stream of events. Once again, you can call the familiar

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

26 or

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

27 functions imported from the module to parse the SVG image:

>>>

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

8

Nó đáng chú ý rằng việc triển khai này không có bộ nhớ tăng so với DOM vì nó xây dựng một đại diện trừu tượng của toàn bộ tài liệu như trước đây. Sự khác biệt là bạn đã tạo ra một biểu diễn từ điển tùy chỉnh thay vì cây Dom tiêu chuẩn. Tuy nhiên, bạn có thể tưởng tượng việc viết trực tiếp vào một tệp hoặc cơ sở dữ liệu thay vì bộ nhớ trong khi nhận các sự kiện sax. Điều đó sẽ nâng cao giới hạn bộ nhớ máy tính của bạn một cách hiệu quả.

Nếu bạn muốn phân tích các không gian tên XML, thì bạn sẽ cần phải tạo và định cấu hình trình phân tích cú pháp SAX với một chút mã Boilerplate và cũng thực hiện các cuộc gọi lại hơi khác nhau:

>>>

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

9

Các cuộc gọi lại này nhận được các tham số bổ sung về không gian tên phần tử. Để làm cho trình phân tích cú pháp SAX thực sự kích hoạt các cuộc gọi lại đó thay vì một số lần trước, bạn phải bật rõ ràng hỗ trợ không gian tên XML:



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

0

Cài đặt tính năng này biến phần tử

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

38 thành một bộ thuật bao gồm tên miền tên miền tên và tên thẻ.

Gói

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

25 cung cấp giao diện phân tích cú pháp XML dựa trên sự kiện khá được mô phỏng theo API Java gốc. Nó có phần hạn chế so với DOM nhưng phải đủ để thực hiện trình phân tích cú pháp phát trực tuyến XML cơ bản mà không cần dùng đến các thư viện của bên thứ ba. Với suy nghĩ này, có một trình phân tích cú pháp kéo dài hơn có sẵn trong Python, mà bạn sẽ khám phá tiếp theo.

>>> document.getElementById("skin") >>> document.getElementById("smiley") 8: Trình phân tích cú pháp kéo phát trực tuyến

Các trình phân tích cú pháp trong thư viện tiêu chuẩn Python thường hoạt động cùng nhau. Ví dụ, mô -đun

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

8 kết thúc trình phân tích cú pháp từ

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

25 để tận dụng bộ đệm và đọc tài liệu trong các khối. Đồng thời, nó sử dụng triển khai DOM mặc định từ

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

5 để biểu thị các thành phần tài liệu. Tuy nhiên, những yếu tố đó được xử lý cùng một lúc mà không có bất kỳ mối quan hệ nào cho đến khi bạn yêu cầu rõ ràng.

Mặc dù mô hình SAX tuân theo mẫu người quan sát, bạn có thể nghĩ Stax là mẫu thiết kế Iterator, cho phép bạn lặp qua một luồng các sự kiện phẳng. Một lần nữa, bạn có thể gọi các hàm

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

26 hoặc

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

27 quen thuộc được nhập từ mô -đun để phân tích hình ảnh SVG:

Chỉ cần một vài dòng mã để phân tích tài liệu. Sự khác biệt nổi bật nhất giữa

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

25 và

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

8 là thiếu các cuộc gọi lại kể từ khi bạn lái toàn bộ quá trình. Bạn có nhiều sự tự do hơn trong việc cấu trúc mã của mình và bạn không cần phải sử dụng các lớp nếu bạn không muốn.ElementTree API. It’s a lightweight, efficient, elegant, and feature-rich interface that even some third-party libraries build on. To get started with it, you must import the

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

50 module, which is a bit of a mouthful. Therefore, it’s customary to define an alias like this:



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

1

Trong mã hơi cũ hơn, bạn có thể đã thấy mô -đun

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

52 được nhập thay thế. Đó là một triển khai nhanh hơn nhiều lần so với cùng một giao diện được viết trong C. Hôm nay, mô -đun thông thường sử dụng việc triển khai nhanh bất cứ khi nào có thể, vì vậy bạn không cần phải bận tâm nữa.

Bạn có thể sử dụng API ElementTree bằng cách sử dụng các chiến lược phân tích cú pháp khác nhau:

	Non-incremental	Gia tăng (chặn)	Gia tăng (không chặn)
`# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()` 53	✔
`# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()` 54	✔
`# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()` 54		✔
`# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()` 54			✔

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

54DOM-like fashion. There are two appropriately named functions in the module that allow for parsing a file or a Python string with XML content:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

2

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56

Chiến lược không tăng tải toàn bộ tài liệu vào bộ nhớ theo kiểu giống như Dom. Có hai hàm được đặt tên phù hợp trong mô -đun cho phép phân tích tệp hoặc chuỗi Python có nội dung XML:pull parser, which yields a sequence of events and elements:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

3

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

4

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56

Chiến lược không tăng tải toàn bộ tài liệu vào bộ nhớ theo kiểu giống như Dom. Có hai hàm được đặt tên phù hợp trong mô -đun cho phép phân tích tệp hoặc chuỗi Python có nội dung XML: Start of an element
>>> End of an element

Phân tích phân tích một đối tượng tệp hoặc tên tệp có
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()
26 trả về một thể hiện của lớp
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()
58, đại diện cho toàn bộ phân cấp phần tử. Mặt khác, phân tích một chuỗi với
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()
59 sẽ trả về gốc cụ thể
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()
60. Comment element

Ngoài ra, bạn có thể đọc tài liệu XML tăng dần với trình phân tích cú pháp kéo phát trực tuyến, mang lại một chuỗi các sự kiện và yếu tố: Processing instruction, as in XSL

Theo mặc định,
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()
61 chỉ phát ra các sự kiện
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()
62 được liên kết với thẻ XML đóng. Tuy nhiên, bạn cũng có thể đăng ký các sự kiện khác. Bạn có thể tìm thấy chúng với các hằng số chuỗi như
# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()
63: Start of a namespace

Tại đây, một danh sách tất cả các loại sự kiện có sẵn: End of a namespace

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

64: Bắt đầu một phần tửblocking calls to read the next chunk of data, which might be unsuitable for asynchronous code running on a single thread of execution. To alleviate that, you can look into

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

71, which is a little bit more verbose:



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

5

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

62: Kết thúc phần tửnon-blocking incremental parsing strategy allows for a truly concurrent parsing of multiple XML documents on the fly while you download them.

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

66: Yếu tố bình luận

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

6

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56

Chiến lược không tăng tải toàn bộ tài liệu vào bộ nhớ theo kiểu giống như Dom. Có hai hàm được đặt tên phù hợp trong mô -đun cho phép phân tích tệp hoặc chuỗi Python có nội dung XML:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

7

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56

Chiến lược không tăng tải toàn bộ tài liệu vào bộ nhớ theo kiểu giống như Dom. Có hai hàm được đặt tên phù hợp trong mô -đun cho phép phân tích tệp hoặc chuỗi Python có nội dung XML:sequence protocol, letting you iterate over their direct children with a loop:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

8

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55



  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
    
]>
 xmlns="http://www.w3.org/2000/svg"
  xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
  viewBox="-105 -100 210 270" width="210" height="270">
   x="42" inkscape:z="555">Some value
  
     id="skin" x1="0" x2="0" y1="0" y2="1">
       offset="0%" stop-color="yellow" stop-opacity="1.0"/>
       offset="75%" stop-color="gold" stop-opacity="1.0"/>
       offset="100%" stop-color="orange" stop-opacity="1"/>
    
  
   id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
    
     cx="0" cy="0" r="50"
      fill="url(#skin)" stroke="orange" stroke-width="2"/>
    
     cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
     cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
    
     d="M-20 20 A25 25 0 0 0 20 20"
      fill="white" stroke="black" stroke-width="3"/>
  
   x="-40" y="75">&custom_entity; <svg>!

9

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56filtering only specific tag names using an optional

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

77 argument:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

0

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56XML namespace, such as

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

79, in your tag name—as long as it’s been defined. Otherwise, if you only provide the tag name without the right namespace, you could end up with fewer or more descendant elements than initially anticipated.

Chiến lược không tăng tải toàn bộ tài liệu vào bộ nhớ theo kiểu giống như Dom. Có hai hàm được đặt tên phù hợp trong mô -đun cho phép phân tích tệp hoặc chuỗi Python có nội dung XML:default namespace, you can leave the key blank or assign an arbitrary prefix, which must be used in the tag name later:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

55

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

1

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

56XPath expression rather than a simple tag name:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

2

Theo trùng hợp, chuỗi

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

83 xảy ra là một đường dẫn hợp lệ so với phần tử

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

84 hiện tại, đó là lý do tại sao hàm trả về kết quả không trống trước đó. Tuy nhiên, để tìm các hình elip được lồng một cấp sâu hơn trong hệ thống phân cấp XML, bạn cần một biểu thức đường dẫn dài hơn.

ElementTree có hỗ trợ cú pháp giới hạn cho ngôn ngữ Mini Xpath, mà bạn có thể sử dụng để truy vấn các phần tử trong XML, tương tự như các bộ chọn CSS trong HTML. Có những phương pháp khác chấp nhận một biểu thức như vậy:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

3

Trong khi

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

80 mang lại các yếu tố phù hợp một cách uể oải,

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

86 trả về một danh sách và

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

87 chỉ trả về phần tử khớp đầu tiên. Tương tự, bạn có thể trích xuất văn bản được bao quanh giữa các thẻ mở và đóng của các phần tử bằng cách sử dụng

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

88 hoặc lấy văn bản bên trong của toàn bộ tài liệu với

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

89:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

4

Bạn tìm kiếm văn bản được nhúng trong một phần tử XML cụ thể trước, sau đó ở mọi nơi trong toàn bộ tài liệu. Tìm kiếm theo văn bản là một tính năng mạnh mẽ của API ElementTree. Nó có thể sao chép nó bằng cách sử dụng các trình phân tích cú pháp tích hợp khác, nhưng với chi phí tăng độ phức tạp của mã và ít thuận tiện hơn.

API ElementTree có lẽ là một trong số đó trực quan nhất. Nó Pythonic, hiệu quả, mạnh mẽ và phổ quát. Trừ khi bạn có một lý do cụ thể để sử dụng DOM hoặc Sax, đây phải là lựa chọn mặc định của bạn.

Khám phá thư viện trình phân tích cú pháp XML của bên thứ ba

Thỉnh thoảng, với lấy các trình phân tích cú pháp XML trong thư viện tiêu chuẩn có thể cảm thấy như nhặt một chiếc búa tạ để bẻ khóa một hạt. Vào những thời điểm khác, nó ngược lại, và bạn mong muốn một trình phân tích cú pháp có thể làm nhiều hơn nữa. Ví dụ: bạn có thể muốn xác nhận XML theo lược đồ hoặc sử dụng các biểu thức XPath nâng cao. Trong những tình huống đó, tốt nhất là bạn nên kiểm tra các thư viện bên ngoài có sẵn trên PYPI.

Dưới đây, bạn sẽ tìm thấy một lựa chọn các thư viện bên ngoài với mức độ phức tạp và tinh tế khác nhau.

# get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear() 90: Chuyển đổi XML thành đối tượng Python

Nếu bạn đang tìm kiếm một lớp lót có thể biến tài liệu XML của bạn thành một đối tượng Python, thì không còn gì nữa. Mặc dù nó đã được cập nhật trong một vài năm, thư viện

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

90 có thể sớm trở thành cách yêu thích của bạn để phân tích XML trong Python. Có một hàm chỉ cần nhớ và nó chấp nhận một URL, tên tệp, đối tượng tệp hoặc chuỗi XML:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

5

Trong mỗi trường hợp, nó trả về một thể hiện của lớp

>>> document.getElementsByTagName("ellipse")
[
    ,
    
]

>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
    ,
    
]

0. Bạn có thể sử dụng toán tử DOT để truy cập vào trẻ em của nó và cú pháp khung vuông để lấy các thuộc tính XML hoặc một trong các nút con theo chỉ mục. Ví dụ, để có được phần tử gốc của tài liệu, bạn có thể truy cập nó như thể đó là thuộc tính của đối tượng. Để có được một trong các thuộc tính của phần tử XML, bạn có thể chuyển tên của nó làm khóa từ điển:dot operator to access its children and the square bracket syntax to get XML attributes or one of the child nodes by index. To get the document’s root element, for example, you can access it as if it was the object’s property. To get one of the element’s XML attributes, you may pass its name as a dictionary key:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

6

Không có tên hoặc tên phương thức cần nhớ. Thay vào đó, mỗi đối tượng phân tích cú pháp là duy nhất, vì vậy bạn thực sự cần biết cấu trúc tài liệu XML cơ bản để đi qua nó với

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

90.

Để tìm hiểu tên phần tử gốc là gì, hãy gọi

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

94 trên tài liệu:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

7

Điều này cho thấy tên của các yếu tố trẻ em ngay lập tức. Lưu ý rằng

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

90 xác định lại ý nghĩa của

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

94 cho các tài liệu phân tích cú pháp của nó. Thông thường, bạn gọi chức năng tích hợp này để kiểm tra một lớp hoặc mô-đun Python. Việc triển khai mặc định sẽ trả về danh sách các tên thuộc tính thay vì các phần tử con của tài liệu XML.

Nếu có nhiều hơn một đứa trẻ có tên thẻ đã cho, thì bạn có thể lặp lại chúng bằng một vòng lặp hoặc tham khảo từng chỉ mục:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

8

Bạn có thể nhận thấy rằng yếu tố

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

15 được đổi tên thành

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

98. Thật không may, thư viện có thể xử lý tốt các không gian tên XML, vì vậy nếu đó là thứ gì đó bạn cần phải dựa vào, thì bạn phải tìm nơi khác.XML namespaces well, so if that’s something you need to rely on, then you must look elsewhere.

Do ký hiệu dấu chấm, tên phần tử trong tài liệu XML phải là định danh Python hợp lệ. Nếu họ không, thì

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

90 sẽ tự động viết lại tên của họ bằng cách thay thế các ký tự bị cấm bằng dấu gạch dưới:

>>>

>>> from xml.dom.minidom import parse, parseString

>>> # Parse XML from a filename
>>> document = parse("smiley.svg")

>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
...     document = parse(file)
...

>>> # Parse XML from a Python string
>>> document = parseString("""\
... 
...   
... 
... """)

9

Tên thẻ trẻ em aren aren là thuộc tính đối tượng duy nhất bạn có thể truy cập. Các phần tử có một vài thuộc tính đối tượng được xác định trước có thể được hiển thị bằng cách gọi

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

00:

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

0

Đằng sau. Mặc dù nó có ý định đọc các tài liệu nhỏ, bạn vẫn có thể kết hợp nó với một cách tiếp cận khác để đọc các tệp XML đa gigabyte.poor performance. While it’s intended for reading tiny documents, you can still combine it with another approach to read multi-gigabyte XML files.

Đây là cách mà. Nếu bạn tham gia Lưu trữ Wikipedia, bạn có thể tải xuống một trong các tệp XML nén của họ. Một ở trên cùng phải chứa một ảnh chụp nhanh các bài viết Tóm tắt:

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

1

Nó có kích thước hơn 6 GB sau khi tải xuống, hoàn hảo cho bài tập này. Ý tưởng là quét tệp để tìm các thẻ mở và đóng liên tiếp

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

02 và sau đó phân tích đoạn XML giữa chúng bằng cách sử dụng

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

90 để thuận tiện.

Mô-đun

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

04 tích hợp cho phép bạn tạo chế độ xem ảo về nội dung tệp, ngay cả khi nó không phù hợp với bộ nhớ có sẵn. Điều này mang lại ấn tượng khi làm việc với một chuỗi các byte khổng lồ hỗ trợ tìm kiếm và cú pháp cắt thường xuyên. Nếu bạn quan tâm đến cách đóng gói logic này trong một lớp Python và tận dụng một trình tạo để đánh giá lười biếng, thì hãy mở rộng phần thu gọn bên dưới.virtual view of the file contents, even when it doesn’t fit the available memory. This gives an impression of working with a huge string of bytes that supports searching and the regular slicing syntax. If you’re interested in how to encapsulate this logic in a Python class and take advantage of a generator for lazy evaluation, then expand the collapsible section below.

Tại đây, mã hoàn chỉnh của lớp

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

05:

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

2

Nó có một trình quản lý bối cảnh tùy chỉnh, sử dụng giao thức Iterator được xác định là hàm tạo nội tuyến. Các vòng lặp đối tượng tạo kết quả trên tài liệu XML như thể nó là một luồng ký tự dài.iterator protocol defined as an inline generator function. The resulting generator object loops over the XML document as if it was a long stream of characters.

Lưu ý rằng vòng lặp

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

06 tận dụng cú pháp Python khá mới, toán tử Walrus (

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

07), để đơn giản hóa mã. Bạn có thể sử dụng toán tử này trong các biểu thức gán, trong đó một biểu thức có thể được đánh giá và gán cho một biến.assignment expressions, where an expression can be evaluated and assigned to a variable.

Không cần tham gia vào các chi tiết nitty-gritty, ở đây, cách bạn có thể sử dụng lớp tùy chỉnh này để xem một tệp XML lớn một cách nhanh chóng trong khi kiểm tra các yếu tố cụ thể kỹ lưỡng hơn với

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

90:

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

3

Đầu tiên, bạn mở một tệp để đọc và cho biết tên thẻ mà bạn muốn tìm. Sau đó, bạn lặp lại các yếu tố đó và nhận được một đoạn phân tích cú pháp của tài liệu XML. Nó gần giống như nhìn qua một cửa sổ nhỏ di chuyển trên một tờ giấy dài vô hạn. Đó là một ví dụ tương đối bề mặt bỏ qua một vài chi tiết, nhưng nó sẽ cho bạn một ý tưởng chung về cách sử dụng chiến lược phân tích cú pháp lai như vậy.

import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear() 09: Chuyển đổi XML thành Từ điển Python

Nếu bạn thích JSON nhưng bạn không phải là một fan hâm mộ của XML, thì hãy xem

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

09, cố gắng thu hẹp khoảng cách giữa cả hai định dạng dữ liệu. Đúng như tên gọi, thư viện có thể phân tích tài liệu XML và đại diện cho nó dưới dạng từ điển Python, cũng là loại dữ liệu đích cho các tài liệu JSON trong Python. Điều này làm cho việc chuyển đổi giữa XML và JSON có thể.conversion between XML and JSON possible.

Không giống như phần còn lại của trình phân tích cú pháp XML cho đến nay, cái này mong đợi một chuỗi python hoặc một đối tượng giống như tệp mở để đọc ở chế độ nhị phân:

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

4

Đầu tiên, bạn mở một tệp để đọc và cho biết tên thẻ mà bạn muốn tìm. Sau đó, bạn lặp lại các yếu tố đó và nhận được một đoạn phân tích cú pháp của tài liệu XML. Nó gần giống như nhìn qua một cửa sổ nhỏ di chuyển trên một tờ giấy dài vô hạn. Đó là một ví dụ tương đối bề mặt bỏ qua một vài chi tiết, nhưng nó sẽ cho bạn một ý tưởng chung về cách sử dụng chiến lược phân tích cú pháp lai như vậy.element order. However, starting from Python 3.6, plain dictionaries also keep the insertion order. If you’d like to work with regular dictionaries instead, then pass

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

31 as the

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

13 argument to the

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

26 function:

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

5

Đầu tiên, bạn mở một tệp để đọc và cho biết tên thẻ mà bạn muốn tìm. Sau đó, bạn lặp lại các yếu tố đó và nhận được một đoạn phân tích cú pháp của tài liệu XML. Nó gần giống như nhìn qua một cửa sổ nhỏ di chuyển trên một tờ giấy dài vô hạn. Đó là một ví dụ tương đối bề mặt bỏ qua một vài chi tiết, nhưng nó sẽ cho bạn một ý tưởng chung về cách sử dụng chiến lược phân tích cú pháp lai như vậy.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

09: Chuyển đổi XML thành Từ điển Pythonname conflicts between XML elements and their attributes, the library automatically prefixes the latter with an

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

16 character. You may also ignore attributes completely by setting the

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

17 flag appropriately:

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

6

Đầu tiên, bạn mở một tệp để đọc và cho biết tên thẻ mà bạn muốn tìm. Sau đó, bạn lặp lại các yếu tố đó và nhận được một đoạn phân tích cú pháp của tài liệu XML. Nó gần giống như nhìn qua một cửa sổ nhỏ di chuyển trên một tờ giấy dài vô hạn. Đó là một ví dụ tương đối bề mặt bỏ qua một vài chi tiết, nhưng nó sẽ cho bạn một ý tưởng chung về cách sử dụng chiến lược phân tích cú pháp lai như vậy.XML namespace declaration. These are treated like regular attributes, while the corresponding prefixes become part of the tag name. However, you can expand, rename, or skip some of the namespaces if you want to:

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

7

Đầu tiên, bạn mở một tệp để đọc và cho biết tên thẻ mà bạn muốn tìm. Sau đó, bạn lặp lại các yếu tố đó và nhận được một đoạn phân tích cú pháp của tài liệu XML. Nó gần giống như nhìn qua một cửa sổ nhỏ di chuyển trên một tờ giấy dài vô hạn. Đó là một ví dụ tương đối bề mặt bỏ qua một vài chi tiết, nhưng nó sẽ cho bạn một ý tưởng chung về cách sử dụng chiến lược phân tích cú pháp lai như vậy.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

09: Chuyển đổi XML thành Từ điển PythonJSON or YAML:

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

8

Đầu tiên, bạn mở một tệp để đọc và cho biết tên thẻ mà bạn muốn tìm. Sau đó, bạn lặp lại các yếu tố đó và nhận được một đoạn phân tích cú pháp của tài liệu XML. Nó gần giống như nhìn qua một cửa sổ nhỏ di chuyển trên một tờ giấy dài vô hạn. Đó là một ví dụ tương đối bề mặt bỏ qua một vài chi tiết, nhưng nó sẽ cho bạn một ý tưởng chung về cách sử dụng chiến lược phân tích cú pháp lai như vậy.

>>>

>>> document = parse("smiley.svg")

>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)

>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[]

>>> # Document Root
>>> document.documentElement

9

Đầu tiên, bạn mở một tệp để đọc và cho biết tên thẻ mà bạn muốn tìm. Sau đó, bạn lặp lại các yếu tố đó và nhận được một đoạn phân tích cú pháp của tài liệu XML. Nó gần giống như nhìn qua một cửa sổ nhỏ di chuyển trên một tờ giấy dài vô hạn. Đó là một ví dụ tương đối bề mặt bỏ qua một vài chi tiết, nhưng nó sẽ cho bạn một ý tưởng chung về cách sử dụng chiến lược phân tích cú pháp lai như vậy.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

09: Chuyển đổi XML thành Từ điển Python

Nếu bạn thích JSON nhưng bạn không phải là một fan hâm mộ của XML, thì hãy xem import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear() 09, cố gắng thu hẹp khoảng cách giữa cả hai định dạng dữ liệu. Đúng như tên gọi, thư viện có thể phân tích tài liệu XML và đại diện cho nó dưới dạng từ điển Python, cũng là loại dữ liệu đích cho các tài liệu JSON trong Python. Điều này làm cho việc chuyển đổi giữa XML và JSON có thể.

Không giống như phần còn lại của trình phân tích cú pháp XML cho đến nay, cái này mong đợi một chuỗi python hoặc một đối tượng giống như tệp mở để đọc ở chế độ nhị phân:Python binding for the C libraries libxml2 and libxslt, which support several standards, including XPath, XML Schema, and XSLT.

Thư viện tương thích với API ElementTree của Python, mà bạn đã học được trước đó trong hướng dẫn này. Điều đó có nghĩa là bạn có thể sử dụng lại mã hiện tại của mình bằng cách chỉ thay thế một câu lệnh nhập duy nhất:ElementTree API, which you learned about earlier in this tutorial. That means you can reuse your existing code by replacing only a single import statement:

Điều này sẽ cung cấp cho bạn một sự thúc đẩy hiệu suất tuyệt vời. Trên hết, thư viện

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22 đi kèm với một bộ tính năng rộng lớn và cung cấp các cách sử dụng chúng khác nhau. Ví dụ: nó cho phép bạn xác thực các tài liệu XML của mình đối với một số ngôn ngữ lược đồ, một trong số đó là định nghĩa lược đồ XML:performance boost. On top of that, the

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22 library comes with an extensive set of features and provides different ways of using them. For example, it lets you validate your XML documents against several schema languages, one of which is the XML Schema Definition:

>>>

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

0

Không có trình phân tích cú pháp XML nào trong thư viện tiêu chuẩn Python, có khả năng xác nhận các tài liệu. Trong khi đó,

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22 cho phép bạn xác định đối tượng

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

26 và chạy các tài liệu thông qua nó trong khi vẫn tương thích phần lớn với API ElementTree.

Bên cạnh API ElementTree,

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22 hỗ trợ giao diện LXML.Objectify thay thế mà bạn sẽ bao gồm sau trong phần ràng buộc dữ liệu.

import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear() 28: Đối phó với XML dị dạng

Bạn đã giành chiến thắng thường sử dụng thư viện cuối cùng trong so sánh này để phân tích cú pháp XML vì bạn chủ yếu gặp phải các tài liệu HTML quét web. Điều đó nói rằng, nó cũng có khả năng phân tích XML. Đẹp đi kèm với một kiến trúc có thể cắm được cho phép bạn chọn trình phân tích cú pháp cơ bản.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22 được mô tả trước đó thực sự được đề xuất bởi tài liệu chính thức và hiện là trình phân tích cú pháp XML duy nhất được thư viện hỗ trợ.pluggable architecture that lets you choose the underlying parser. The

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22 one described earlier is actually recommended by the official documentation and is currently the only XML parser supported by the library.

Tùy thuộc vào loại tài liệu mà bạn sẽ muốn phân tích, hiệu quả mong muốn và tính năng sẵn có, bạn có thể chọn một trong những trình phân tích cú pháp này:

loại tài liệu	Tên phân tích cú pháp	Thư viện Python	Tốc độ, vận tốc
HTML	`import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear()` 30	-	Vừa phải
HTML	`import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear()` 30	-	Vừa phải
HTML	`import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear()` 30	-	Vừa phải
`import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear()` 31	`import xml.etree.ElementTree as ET # Get an iterable. context = ET.iterparse(source, events=("start", "end")) for index, (event, elem) in enumerate(context): # Get the root element. if index == 0: root = elem if event == "end" and elem.tag == "record": # ... process record elements ... root.clear()` 32	-	Vừa phải

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

31

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

32

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

1

Chậm

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

33handle invalid content and it has a rich API for extracting information. Have a look at how it copes with incorrectly nested tags, forbidden characters, and badly placed text:

>>>

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

2

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22

Nhanh

>>>

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

3

XML

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

35 hoặc

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

36

Khác với tốc độ, có sự khác biệt đáng chú ý giữa các trình phân tích cú pháp riêng lẻ. Ví dụ, một số trong số họ tha thứ hơn những người khác khi nói đến các yếu tố dị dạng, trong khi những người khác mô phỏng các trình duyệt web tốt hơn.

Giả sử bạn đã cài đặt thư viện

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22 và

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

39 vào môi trường ảo đang hoạt động của mình, bạn có thể bắt đầu phân tích các tài liệu XML ngay lập tức. Bạn chỉ cần nhập

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

28:custom protocol and use data binding to translate XML into native Python objects.

Nếu bạn vô tình chỉ định một trình phân tích cú pháp khác nhau, giả sử

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22, thì thư viện sẽ thêm các thẻ HTML bị thiếu như

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

01 vào tài liệu được phân tích cú pháp cho bạn. Đó có lẽ là những gì bạn dự định trong trường hợp này, vì vậy hãy cẩn thận khi chỉ định tên phân tích cú pháp.

Đẹp là một công cụ mạnh mẽ để phân tích các tài liệu XML vì nó có thể xử lý nội dung không hợp lệ và nó có API phong phú để trích xuất thông tin. Hãy xem cách nó đối phó với các thẻ lồng nhau không chính xác, các ký tự bị cấm và văn bản được đặt xấu:

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

47

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

48

Một trình phân tích cú pháp khác sẽ tăng một ngoại lệ và đầu hàng ngay khi phát hiện ra điều gì đó sai với tài liệu. Ở đây, nó không chỉ bỏ qua các vấn đề, mà còn tìm ra những cách hợp lý để sửa chữa một số trong số chúng. Các yếu tố được lồng đúng bây giờ và không có nội dung không hợp lệ.Shift+2 key combination:

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

4

Thông báo này chứa một loại sự kiện bàn phím cụ thể, dấu thời gian, mã khóa và unicode của nó, cũng như các khóa sửa đổi như ALT, CTRL hoặc Shift. Phím Meta thường là phím WIN hoặc CMD, tùy thuộc vào bố cục bàn phím của bạn.Alt, Ctrl, or Shift. The meta key is usually the Win or Cmd key, depending on your keyboard layout.

Tương tự, một sự kiện chuột có thể trông như thế này:

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

5

Tuy nhiên, thay vì chìa khóa, có vị trí con trỏ chuột và một trường bit mã hóa các nút chuột được nhấn trong sự kiện. Một trường bit bằng 0 chỉ ra rằng không nhấn nút.

Ngay sau khi khách hàng tạo kết nối, nó sẽ bắt đầu tràn vào máy chủ bằng các tin nhắn. Giao thức giành chiến thắng bao gồm bất kỳ cái bắt tay, nhịp tim, tắt máy duyên dáng, đăng ký chủ đề hoặc tin nhắn kiểm soát. Bạn có thể viết mã này trong JavaScript bằng cách đăng ký Trình xử lý sự kiện và tạo đối tượng

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

49 trong ít hơn năm mươi dòng mã.

Tuy nhiên, việc thực hiện khách hàng không phải là điểm của bài tập này. Vì bạn không cần phải hiểu nó, chỉ cần mở rộng phần thu gọn bên dưới để tiết lộ mã HTML với JavaScript nhúng và lưu nó trong một tệp có tên bất cứ điều gì bạn thích.

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

6

Máy khách kết nối với máy chủ cục bộ nghe trên cổng 8000. Khi bạn lưu mã HTML trong một tệp, bạn sẽ có thể mở nó bằng trình duyệt web yêu thích của bạn. Nhưng trước đó, bạn sẽ cần phải triển khai máy chủ.

Python không đi kèm với hỗ trợ WebSocket, nhưng bạn có thể cài đặt thư viện

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

50 vào môi trường ảo đang hoạt động của mình. Bạn cũng sẽ cần

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

22 sau, vì vậy, đó là một khoảnh khắc tốt để cài đặt cả hai phụ thuộc trong một lần:

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

7

Cuối cùng, bạn có thể giàn giáo một máy chủ web không đồng bộ tối thiểu:

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

8

Khi bạn khởi động máy chủ và mở tệp HTML đã lưu trong trình duyệt web, bạn sẽ thấy các thông báo XML xuất hiện trong đầu ra tiêu chuẩn để đáp ứng với các động tác chuyển động và nhấn phím của bạn. Bạn có thể mở máy khách trong nhiều tab hoặc thậm chí nhiều trình duyệt cùng một lúc!

Xác định các mô hình với các biểu thức XPath

Ngay bây giờ, tin nhắn của bạn đến định dạng chuỗi đơn giản. Nó không thuận tiện để làm việc với các tin nhắn ở định dạng này. May mắn thay, bạn có thể biến chúng thành các đối tượng Python hợp chất với một dòng mã duy nhất bằng mô -đun

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52:

>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True

9

Miễn là phân tích cú pháp XML thành công, bạn có thể kiểm tra các thuộc tính thông thường của phần tử gốc, chẳng hạn như tên thẻ, thuộc tính, văn bản bên trong, v.v. Bạn có thể sử dụng toán tử DOT để điều hướng sâu vào cây phần tử. Trong hầu hết các trường hợp, thư viện sẽ nhận ra kiểu dữ liệu Python phù hợp và chuyển đổi giá trị cho bạn.

Sau khi lưu các thay đổi đó và khởi động lại máy chủ, bạn sẽ cần tải lại trang trong trình duyệt web của mình để tạo kết nối WebSocket mới. Ở đây, một đầu ra mẫu của chương trình đã sửa đổi:

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

0

Đôi khi, XML có thể chứa các tên thẻ aren nhận dạng Python hợp lệ hoặc bạn có thể muốn điều chỉnh cấu trúc tin nhắn để phù hợp với mô hình dữ liệu của bạn. Trong trường hợp như vậy, một tùy chọn thú vị sẽ xác định các lớp mô hình tùy chỉnh với các mô tả tuyên bố cách tra cứu thông tin bằng cách sử dụng các biểu thức XPath. Đó là phần bắt đầu giống với các mô hình Django hoặc định nghĩa lược đồ Pydantic.model classes with descriptors that declare how to look up information using XPath expressions. That’s the part that starts to resemble Django models or Pydantic schema definitions.

Bạn sẽ sử dụng bộ mô tả

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

53 tùy chỉnh và lớp

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

54 đi kèm, cung cấp các thuộc tính có thể tái sử dụng cho các mô hình dữ liệu của bạn. Bộ mô tả mong đợi một biểu thức XPath để tra cứu phần tử trong thông báo nhận được. Việc triển khai cơ bản là một chút nâng cao, vì vậy hãy thoải mái sao chép mã từ phần thu gọn bên dưới.

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

1

Giả sử bạn đã có bộ mô tả

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

53 mong muốn và lớp cơ sở trừu tượng

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

54 Trong mô -đun của bạn, bạn có thể sử dụng chúng để xác định các loại tin nhắn

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

47 và

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

48 cùng với các khối xây dựng có thể tái sử dụng để tránh lặp lại. Có nhiều cách để làm như vậy, nhưng ở đây, một ví dụ:

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

2

Bộ mô tả

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

53 cho phép đánh giá lười biếng để các yếu tố của các thông điệp XML chỉ được tra cứu khi được yêu cầu. Cụ thể hơn, họ chỉ tìm kiếm khi bạn truy cập một thuộc tính trên đối tượng sự kiện. Hơn nữa, kết quả được lưu trữ để tránh chạy cùng một truy vấn XPath nhiều lần. Bộ mô tả cũng tôn trọng các chú thích loại và tự động chuyển đổi dữ liệu thành loại python bên phải.lazy evaluation so that elements of the XML messages are looked up only when requested. More specifically, they’re only looked up when you access a property on the event object. Moreover, the results are cached to avoid running the same XPath query more than once. The descriptor also respects type annotations and converts deserialized data to the right Python type automatically.

Sử dụng các đối tượng sự kiện đó không khác biệt nhiều so với các đối tượng được tạo tự động bởi

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 trước đó:

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

3

Có một bước bổ sung để tạo các đối tượng mới thuộc loại sự kiện cụ thể. Nhưng khác với điều đó, nó mang lại cho bạn sự linh hoạt hơn về mặt cấu trúc mô hình của bạn một cách độc lập với giao thức XML. Ngoài ra, nó có thể lấy các thuộc tính mô hình mới dựa trên các thuộc tính trong các tin nhắn nhận được và thêm nhiều phương thức hơn trên đó.

Tạo các mô hình từ lược đồ XML

Thực hiện các lớp mô hình là một nhiệm vụ tẻ nhạt và dễ bị lỗi. Tuy nhiên, miễn là mô hình của bạn phản ánh các thông báo XML, bạn có thể tận dụng một công cụ tự động để tạo mã cần thiết cho bạn dựa trên lược đồ XML. Nhược điểm của mã như vậy là nó thường không thể đọc được hơn nếu được viết bằng tay.

Một trong những mô-đun bên thứ ba lâu đời nhất cho phép PYXB, bắt chước thư viện JAXB phổ biến của Java. Thật không may, nó đã được phát hành lần cuối vài năm trước và đang nhắm mục tiêu các phiên bản Python Legacy. Bạn có thể xem xét một giải pháp thay thế

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

61 được duy trì tích cực nhưng tương tự, tạo ra các cấu trúc dữ liệu từ lược đồ XML.

Hãy nói rằng bạn có tệp lược đồ

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

62 mô tả thông báo

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

47 của bạn:

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

4

Một lược đồ nói với trình phân tích cú pháp XML những yếu tố mong đợi, thứ tự và mức độ của chúng trong cây. Nó cũng hạn chế các giá trị được phép cho các thuộc tính XML. Bất kỳ sự khác biệt nào giữa các khai báo này và một tài liệu XML thực tế sẽ khiến nó không hợp lệ và làm cho trình phân tích cú pháp từ chối tài liệu.

Ngoài ra, một số công cụ có thể tận dụng thông tin này để tạo ra một đoạn mã ẩn các chi tiết của phân tích cú pháp XML từ bạn. Sau khi cài đặt thư viện, bạn sẽ có thể chạy lệnh

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

61 trong môi trường ảo đang hoạt động của mình:

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

5

Nó sẽ tạo một tệp mới có tên

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

65 trong cùng một thư mục với mã nguồn Python được tạo. Sau đó, bạn có thể nhập mô -đun đó và sử dụng nó để phân tích các tin nhắn đến:

>>>

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

6

Nó trông tương tự như ví dụ

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 được hiển thị trước đó. Sự khác biệt là việc sử dụng liên kết dữ liệu thực thi việc tuân thủ lược đồ, trong khi

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 tạo ra các đối tượng tự động cho dù chúng có đúng về mặt ngữ nghĩa hay không.

Xử lý quả bom XML với các trình phân tích cú pháp an toàn

Các trình phân tích cú pháp XML trong thư viện tiêu chuẩn Python, dễ bị tổn thương bởi một loạt các mối đe dọa bảo mật có thể dẫn đến từ chối dịch vụ (DOS) hoặc mất dữ liệu, tốt nhất. Đó là lỗi của họ, công bằng. Họ chỉ tuân theo đặc điểm kỹ thuật của tiêu chuẩn XML, phức tạp và mạnh mẽ hơn hầu hết mọi người biết.

Một trong những cuộc tấn công phổ biến nhất là bom XML, còn được gọi là cuộc tấn công hàng tỷ người. Cuộc tấn công khai thác sự mở rộng của thực thể trong DTD để làm nổ tung bộ nhớ và chiếm CPU càng lâu càng tốt. Tất cả những gì bạn cần để ngăn một máy chủ web không được bảo vệ nhận lưu lượng truy cập mới là một vài dòng mã XML này:XML Bomb, also known as the billion laughs attack. The attack exploits entity expansion in DTD to blow up the memory and occupy the CPU for as long as possible. All you need to stop an unprotected web server from receiving new traffic are these few lines of XML code:

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

7

Một trình phân tích cú pháp ngây thơ sẽ cố gắng giải quyết thực thể tùy chỉnh

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

68 được đặt trong gốc tài liệu bằng cách kiểm tra DTD. Tuy nhiên, thực thể đó đề cập đến một thực thể khác nhiều lần, trong đó đề cập đến một thực thể khác, v.v. Khi bạn chạy tập lệnh ở trên, bạn sẽ nhận thấy điều gì đó làm phiền về bộ nhớ của bạn và đơn vị xử lý:

Hãy xem làm thế nào bộ nhớ chính và phân vùng hoán đổi đã cạn kiệt chỉ trong vài giây trong khi một trong những CPU hoạt động với 100% công suất của nó. Việc ghi lại dừng đột ngột khi bộ nhớ hệ thống trở nên đầy và sau đó tiếp tục sau khi quá trình Python bị giết.

Một loại tấn công phổ biến khác được gọi là XXE tận dụng các thực thể bên ngoài chung để đọc các tệp cục bộ và thực hiện các yêu cầu mạng. Tuy nhiên, bắt đầu từ Python 3.7.1, tính năng này đã bị vô hiệu hóa theo mặc định để tăng bảo mật. Nếu bạn tin tưởng dữ liệu của mình, thì bạn có thể nói với trình phân tích cú pháp SAX để xử lý các thực thể bên ngoài dù sao:general external entities to read local files and make network requests. Nevertheless, starting from Python 3.7.1, this feature has been disabled by default to increase security. If you trust your data, then you can tell the SAX parser to process external entities anyway:

>>>

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

8

Nó trông tương tự như ví dụ

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 được hiển thị trước đó. Sự khác biệt là việc sử dụng liên kết dữ liệu thực thi việc tuân thủ lược đồ, trong khi

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 tạo ra các đối tượng tự động cho dù chúng có đúng về mặt ngữ nghĩa hay không.

>>>

>>> from xml.dom.minidom import parse, Node

>>> def set_id_attribute(parent, attribute_name="id"):
...     if parent.nodeType == Node.ELEMENT_NODE:
...         if parent.hasAttribute(attribute_name):
...             parent.setIdAttribute(attribute_name)
...     for child in parent.childNodes:
...         set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)

9

Nó trông tương tự như ví dụ

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 được hiển thị trước đó. Sự khác biệt là việc sử dụng liên kết dữ liệu thực thi việc tuân thủ lược đồ, trong khi

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 tạo ra các đối tượng tự động cho dù chúng có đúng về mặt ngữ nghĩa hay không.

Xử lý quả bom XML với các trình phân tích cú pháp an toàndrop-in replacement for all the parsers in the standard library.

Các trình phân tích cú pháp XML trong thư viện tiêu chuẩn Python, dễ bị tổn thương bởi một loạt các mối đe dọa bảo mật có thể dẫn đến từ chối dịch vụ (DOS) hoặc mất dữ liệu, tốt nhất. Đó là lỗi của họ, công bằng. Họ chỉ tuân theo đặc điểm kỹ thuật của tiêu chuẩn XML, phức tạp và mạnh mẽ hơn hầu hết mọi người biết.

>>>

>>> document.getElementById("skin")


>>> document.getElementById("smiley")

0

Nó trông tương tự như ví dụ

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 được hiển thị trước đó. Sự khác biệt là việc sử dụng liên kết dữ liệu thực thi việc tuân thủ lược đồ, trong khi

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

52 tạo ra các đối tượng tự động cho dù chúng có đúng về mặt ngữ nghĩa hay không.

Xử lý quả bom XML với các trình phân tích cú pháp an toàn

Các trình phân tích cú pháp XML trong thư viện tiêu chuẩn Python, dễ bị tổn thương bởi một loạt các mối đe dọa bảo mật có thể dẫn đến từ chối dịch vụ (DOS) hoặc mất dữ liệu, tốt nhất. Đó là lỗi của họ, công bằng. Họ chỉ tuân theo đặc điểm kỹ thuật của tiêu chuẩn XML, phức tạp và mạnh mẽ hơn hầu hết mọi người biết.sweet spot between performance, security, compliance, and convenience.

Hướng dẫn này đặt một lộ trình chi tiết trong tay bạn để điều hướng mê cung khó hiểu của các trình phân tích cú pháp XML trong Python.Bạn biết nơi để sử dụng các phím tắt và làm thế nào để tránh ngõ cụt, tiết kiệm cho bạn rất nhiều thời gian.roadmap in your hand to navigate the confusing maze of XML parsers in Python. You know where to take the shortcuts and how to avoid dead ends, saving you lots of time.

Trong hướng dẫn này, bạn đã học được cách:

Chọn mô hình phân tích cú pháp XML phù hợpparsing model
Sử dụng trình phân tích cú pháp XML trong thư viện tiêu chuẩnstandard library
Sử dụng các thư viện phân tích cú pháp XML chínhXML parsing libraries
Parse XML Tài liệu khai báo bằng cách sử dụng liên kết dữ liệudata binding
Sử dụng trình phân tích cú pháp XML an toàn để loại bỏ các lỗ hổng bảo mậtsecurity vulnerabilities

Bây giờ, bạn hiểu các chiến lược khác nhau để phân tích các tài liệu XML cũng như điểm mạnh và điểm yếu của chúng.Với kiến thức này, bạn có thể chọn trình phân tích cú pháp XML phù hợp nhất cho trường hợp sử dụng cụ thể của bạn và thậm chí kết hợp nhiều hơn một để đọc các tệp XML đa gigabyte nhanh hơn.combine more than one to read multi-gigabyte XML files faster.

programming python Python stream XML Python XML writer

Hướng dẫn python fastest xml parser - trình phân tích cú pháp xml nhanh nhất của python

Trong hướng dẫn này, bạn sẽ học cách:

API đơn giản cho XML (Sax)

Truyền phát API cho XML (STAX)

Tìm hiểu về các trình phân tích cú pháp XML trong Thư viện tiêu chuẩn Python

>>> document.getElementById("skin") >>> document.getElementById("smiley") 5: Thực hiện DOM tối thiểu

>>> document.getElementById("skin") >>> document.getElementById("smiley") 8: Trình phân tích cú pháp kéo phát trực tuyến

Khám phá thư viện trình phân tích cú pháp XML của bên thứ ba

Xác định các mô hình với các biểu thức XPath

Tạo các mô hình từ lược đồ XML

Xử lý quả bom XML với các trình phân tích cú pháp an toàn

Xử lý quả bom XML với các trình phân tích cú pháp an toàn

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội