Hướng dẫn find string in column python - tìm chuỗi trong cột python

Question

Python là một ngôn ngữ tuyệt vời để phân tích dữ liệu, chủ yếu là do hệ sinh thái tuyệt vời của các gói Python tập trung vào dữ liệu. Pandas là một trong những gói đó và giúp nhập và phân tích dữ liệu dễ dàng hơn nhiều.Pandas str.find () phương thức được sử dụng để tìm kiếm một chuỗi con trong mỗi chuỗi có trong một chuỗi. Nếu chuỗi được tìm thấy, nó sẽ trả về chỉ số thấp nhất của sự xuất hiện của nó. Nếu chuỗi không được tìm thấy, nó sẽ trả về -1. & Nbsp; điểm bắt đầu và điểm cuối cũng có thể được truyền để tìm kiếm một phần cụ thể của chuỗi cho ký tự được truyền hoặc chuỗi con. & NBSP;Pandas is one of those packages and makes importing and analyzing data much easier.
Pandas str.find() method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1.
Start and end points can also be passed to search a specific part of string for the passed character or substring.

Nội dung chính Show

(Cuối cùng) s.str.contains('foo|bar', na=False) 0 True 1 True 2 False 3 True 4 False 5 False dtype: bool 0
Tuyên bố miễn trừ trách nhiệm thân thiện, đây là bài viết dài.
Nếu bạn có khung dữ liệu với các cột hỗn hợp và chỉ muốn chọn các cột đối tượng/chuỗi, hãy xem df2 = pd.concat([df1] * 1000, ignore_index=True) %timeit df2[df2['col'].str.contains('foo')] %timeit df2[df2['col'].str.contains('foo', regex=False)] 6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.
Phù hợp với toàn bộ (các) từ
Nhiều tìm kiếm toàn bộ từ
Một sự thay thế tuyệt vời: Sử dụng toàn bộ danh sách!
Nếu "col" có nans, thì thay vì
Thông tin thêm về s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123]) s.str.contains('foo|bar') 0 True 1 True 2 NaN 3 True 4 False 5 NaN dtype: object s[s.str.contains('foo|bar')] # --------------------------------------------------------------------------- # ValueError Traceback (most recent call last) 6 và s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123]) s.str.contains('foo|bar') 0 True 1 True 2 NaN 3 True 4 False 5 NaN dtype: object s[s.str.contains('foo|bar')] # --------------------------------------------------------------------------- # ValueError Traceback (most recent call last) 7 Các phương pháp có thể được tìm thấy khi tự động đánh giá một biểu thức từ một công thức trong gấu trúc.

Cú pháp: series.str.find (sub, start = 0, end = none) tham số: & nbsp; sub: chuỗi hoặc ký tự được tìm kiếm trong giá trị văn bản trong sê -ri & nbsp; bắt đầu: int value, start point of searching. Mặc định là 0 có nghĩa là từ đầu chuỗi & nbsp; end: int value, điểm cuối mà tìm kiếm cần phải dừng. Mặc định là không có. Return loại: Sê -ri với vị trí chỉ mục xuất hiện chuỗi con & nbsp; & nbsp;Series.str.find(sub, start=0, end=None)
Parameters:
sub: String or character to be searched in the text value in series
start: int value, start point of searching. Default is 0 which means from the beginning of string
end: int value, end point where the search needs to be stopped. Default is None.
Return type: Series with index position of substring occurrence

Để tải xuống CSV được sử dụng trong mã, bấm vào đây. Trong các ví dụ sau, khung dữ liệu được sử dụng chứa dữ liệu của một số người chơi NBA. Hình ảnh của khung dữ liệu trước khi bất kỳ hoạt động nào được đính kèm bên dưới. & Nbsp; & nbsp;
In the following examples, the data frame used contains data of some NBA players. The image of data frame before any operations is attached below.

Hướng dẫn find string in column python - tìm chuỗi trong cột python

& nbsp; & nbsp; Ví dụ #1: Tìm một ký tự đơn ví dụ này, một ký tự duy nhất ‘A, được tìm kiếm trong mỗi chuỗi cột tên bằng phương thức str.find (). Tham số bắt đầu và kết thúc được giữ mặc định. Chuỗi được trả về được lưu trữ trong một cột mới để các chỉ mục có thể được so sánh bằng cách nhìn trực tiếp. Trước khi áp dụng phương pháp này, các hàng null được bỏ bằng .dropna () để tránh lỗi. & Nbsp;
Example #1: Finding single character
In this example, a single character ‘a’ is searched in each string of Name column using str.find() method. Start and end parameters are kept default. The returned series is stored in a new column so that the indexes can be compared by looking directly. Before applying this method, null rows are dropped using .dropna() to avoid errors.

Python3

Đầu ra: & nbsp; Như được hiển thị trong hình ảnh đầu ra, sự xuất hiện của chỉ mục trong cột chỉ mục bằng vị trí xuất hiện đầu tiên của ký tự trong chuỗi. Nếu phần phụ không tồn tại trong văn bản, -1 được trả về. Nó cũng có thể được nhìn thấy bằng cách nhìn vào hàng đầu tiên rằng ’một người đã được xem xét chứng minh phương pháp này là trường hợp nhạy cảm. & Nbsp; & nbsp;
As shown in the output image, the occurrence of index in the Indexes column is equal to the position first occurrence of character in the string. If the substring doesn’t exist in the text, -1 is returned. It can also be seen by looking at the first row itself that ‘A’ wasn’t considered which proves this method is case sensitive.

& nbsp; & nbsp; Ví dụ #2: Tìm kiếm chuỗi con (nhiều hơn một ký tự) trong ví dụ này, Substring er er sẽ được tìm kiếm trong cột tên của khung dữ liệu. Tham số bắt đầu được giữ 2 để bắt đầu tìm kiếm từ phần tử thứ 3 (vị trí chỉ mục 2). & Nbsp;
Example #2: Searching substring (More than one character)
In this example, ‘er’ substring will be searched in the Name column of data frame. The start parameter is kept 2 to start search from 3rd(index position 2) element.

Python3

Đầu ra: & nbsp; Như được hiển thị trong hình ảnh đầu ra, chỉ số lest về sự xuất hiện của chuỗi con được trả về. Nhưng có thể thấy, trong trường hợp Terry Rozier (hàng 9 trong khung dữ liệu), thay vì lần đầu tiên xảy ra ‘er, 10 đã được trả lại. Điều này là do tham số bắt đầu được giữ 2 và lần đầu tiên xảy ra trước đó. & Nbsp; & nbsp;
As shown in the output image, the lest index of occurrence of substring is returned. But it can be seen, in case of Terry Rozier(Row 9 in data frame), instead of first occurrence of ‘er’, 10 was returned. This is because the start parameter was kept 2 and the first ‘er’ occurs before that.

(Cuối cùng) s.str.contains('foo|bar', na=False) 0 True 1 True 2 False 3 True 4 False 5 False dtype: bool 0

Làm thế nào để tôi chọn bởi chuỗi một phần từ gấu trúc DataFrame?

Bài đăng này dành cho những độc giả muốn

Tìm kiếm một chuỗi con trong cột chuỗi (trường hợp đơn giản nhất) như trong

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

8

Tìm kiếm cho nhiều chất nền (tương tự như

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

9), ví dụ: với

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

0

khớp với toàn bộ từ từ văn bản (ví dụ: "màu xanh" nên khớp với "bầu trời là màu xanh" nhưng không phải là "bluejay"), ví dụ: với

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1

khớp với nhiều từ toàn bộ

Hiểu lý do đằng sau "ValueError: Không thể lập chỉ mục với vectơ chứa các giá trị Na / NAN" và sửa nó bằng

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2

... và muốn biết thêm về những phương pháp nào nên được ưu tiên hơn người khác.

(P.S .: Tôi đã thấy rất nhiều câu hỏi về các chủ đề tương tự, tôi nghĩ sẽ rất tốt nếu để lại điều này ở đây.), this is post is long.

Tuyên bố miễn trừ trách nhiệm thân thiện, đây là bài viết dài.

# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

Tìm kiếm cơ bản

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3 có thể được sử dụng để thực hiện tìm kiếm phụ hoặc tìm kiếm dựa trên regex. Tìm kiếm mặc định dựa trên regex trừ khi bạn vô hiệu hóa rõ ràng nó.

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

Đây là một ví dụ về tìm kiếm dựa trên regex,

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

Đôi khi không cần tìm kiếm regex, vì vậy chỉ định

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

4 để vô hiệu hóa nó.

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Hiệu suất khôn ngoan, tìm kiếm regex chậm hơn so với tìm kiếm chuỗi con:

Tránh sử dụng tìm kiếm dựa trên regex nếu bạn không cần nó.
Sometimes, performing a substring search and filtering on the result will result in

ValueError: cannot index with vector containing NA / NaN values

Giải quyết các

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

5s đôi khi, thực hiện tìm kiếm và lọc trong kết quả sẽ dẫn đến

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

Điều này thường là do dữ liệu hỗn hợp hoặc nans trong cột đối tượng của bạn,

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

Bất cứ điều gì không phải là một chuỗi không thể có các phương thức chuỗi được áp dụng trên nó, vì vậy kết quả là NAN (một cách tự nhiên). Trong trường hợp này, chỉ định

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

6 để bỏ qua dữ liệu không chuỗi,
The answer is in the question. Use df2 = pd.concat([df1] * 1000, ignore_index=True) %timeit df2[df2['col'].str.contains('foo')] %timeit df2[df2['col'].str.contains('foo', regex=False)] 6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7:

# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)

       A      B
0   True   True
1   True  False
2  False   True
3   True  False
4  False  False
5  False  False

Làm cách nào để áp dụng điều này cho nhiều cột cùng một lúc? Câu trả lời là trong câu hỏi. Sử dụng

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

7:

Tất cả các giải pháp dưới đây có thể được "áp dụng" cho nhiều cột bằng phương pháp

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

8 theo cột (không ổn trong cuốn sách của tôi, miễn là bạn không có quá nhiều cột).

Nếu bạn có khung dữ liệu với các cột hỗn hợp và chỉ muốn chọn các cột đối tượng/chuỗi, hãy xem df2 = pd.concat([df1] * 1000, ignore_index=True) %timeit df2[df2['col'].str.contains('foo')] %timeit df2[df2['col'].str.contains('foo', regex=False)] 6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.

Tìm kiếm nhiều chuỗi con

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

Điều này dễ dàng đạt được thông qua tìm kiếm regex bằng cách sử dụng regex hoặc ống.

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

Bạn cũng có thể tạo một danh sách các điều khoản, sau đó tham gia với chúng:

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

0

Đôi khi, thật khôn ngoan khi thoát khỏi các điều khoản của bạn trong trường hợp họ có các nhân vật có thể được hiểu là những người điều khiển Regex. Nếu các điều khoản của bạn chứa bất kỳ ký tự nào sau đây ...

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

1

Sau đó, bạn sẽ cần sử dụng

ValueError: cannot index with vector containing NA / NaN values

0 để thoát khỏi chúng:

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

2

Phù hợp với toàn bộ (các) từ

Theo mặc định, tìm kiếm tìm kiếm chuỗi con cho chuỗi con/mẫu được chỉ định bất kể nó có đầy đủ hay không. Để chỉ khớp với các từ đầy đủ, chúng ta sẽ cần sử dụng các biểu thức thông thường ở đây, cụ thể, mẫu của chúng ta sẽ cần chỉ định các ranh giới từ (

ValueError: cannot index with vector containing NA / NaN values

2).

Ví dụ,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

3

Bây giờ hãy xem xét,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

4

v/s

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

5

Nhiều tìm kiếm toàn bộ từ

Tương tự như ở trên, ngoại trừ chúng tôi thêm một ranh giới từ (

ValueError: cannot index with vector containing NA / NaN values

2) vào mẫu được nối.

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

6

Nơi

ValueError: cannot index with vector containing NA / NaN values

4 trông như thế này,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

7

Một sự thay thế tuyệt vời: Sử dụng toàn bộ danh sách!

Bởi vì bạn có thể! Và bạn nên! Chúng thường nhanh hơn một chút so với các phương thức chuỗi, bởi vì các phương thức chuỗi rất khó để vectorise và thường có các triển khai vòng lặp.

Thay vì,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

8

Sử dụng toán tử

ValueError: cannot index with vector containing NA / NaN values

5 bên trong danh sách comp,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

9

Thay vì,

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

0

Sử dụng toán tử

ValueError: cannot index with vector containing NA / NaN values

5 bên trong danh sách comp,

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

1

Sử dụng

ValueError: cannot index with vector containing NA / NaN values

6 (để lưu trữ regex của bạn) +

ValueError: cannot index with vector containing NA / NaN values

7 bên trong danh sách comp,

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

2

Use,

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

3

Nếu "col" có nans, thì thay vì

Thêm tùy chọn để khớp chuỗi một phần:

ValueError: cannot index with vector containing NA / NaN values

8,

ValueError: cannot index with vector containing NA / NaN values

9,

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

0.

Ngoài

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3 và danh sách toàn diện, bạn cũng có thể sử dụng các lựa chọn thay thế sau.
Supports substring searches (read: no regex) only.

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

4

ValueError: cannot index with vector containing NA / NaN values

8 hỗ trợ các tìm kiếm chuỗi con (chỉ đọc: không có regex).
This is a wrapper around a loop, but with lesser overhead than most pandas

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

4 methods.

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

5

ValueError: cannot index with vector containing NA / NaN values

9 Đây là một trình bao bọc xung quanh một vòng lặp, nhưng với chi phí thấp hơn hầu hết các phương thức

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

4.

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

6

Giải pháp Regex có thể:
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

7

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

0 hỗ trợ các phương thức chuỗi thông qua động cơ Python. Điều này không cung cấp lợi ích hiệu suất rõ ràng, nhưng dù sao cũng hữu ích để biết nếu bạn cần tự động tạo ra các truy vấn của mình.

Thông tin thêm về s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123]) s.str.contains('foo|bar') 0 True 1 True 2 NaN 3 True 4 False 5 NaN dtype: object s[s.str.contains('foo|bar')] # --------------------------------------------------------------------------- # ValueError Traceback (most recent call last) 6 và s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123]) s.str.contains('foo|bar') 0 True 1 True 2 NaN 3 True 4 False 5 NaN dtype: object s[s.str.contains('foo|bar')] # --------------------------------------------------------------------------- # ValueError Traceback (most recent call last) 7 Các phương pháp có thể được tìm thấy khi tự động đánh giá một biểu thức từ một công thức trong gấu trúc.

Đề nghị sử dụng ưu tiên

(Đầu tiên)

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3, vì sự đơn giản và dễ dàng xử lý NAN và dữ liệu hỗn hợp

ValueError: cannot index with vector containing NA / NaN values

9

Danh sách toàn diện, cho hiệu suất của nó (đặc biệt là nếu dữ liệu của bạn hoàn toàn là chuỗi)

programming python Regex pandas Count row pandas Str contains

Hướng dẫn find string in column python - tìm chuỗi trong cột python

Python3

Python3

(Cuối cùng) s.str.contains('foo|bar', na=False) 0 True 1 True 2 False 3 True 4 False 5 False dtype: bool 0

Tuyên bố miễn trừ trách nhiệm thân thiện, đây là bài viết dài.

Phù hợp với toàn bộ (các) từ

Nhiều tìm kiếm toàn bộ từ

Một sự thay thế tuyệt vời: Sử dụng toàn bộ danh sách!

Nếu "col" có nans, thì thay vì

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội