# Pandas DataFrame 快速教程 本教程介绍了 DataFrames,这是 pandas API 中的核心数据结构。这不是一个全面的 DataFrames 教程。相反,这个实验室提供了一个非常快速的介绍,涵盖了在机器学习速成课程中进行其他练习所需的 DataFrames 的部分内容。 DataFrame 类似于内存中的电子表格。与电子表格类似: - DataFrame 将数据存储在单元格中。 - DataFrame 有命名列(通常)和编号的行。 ## 导入 NumPy和pandas 运行以下代码单元以导入 NumPy 和 pandas 模块。 ```@CodeBlock styleID: 0 codeblock: projectId: '' title: '' name: '' language: python_wasm code: |- import numpy as np import pandas as pd outputMode: autoParacraft output: '' ``` ## 创建 DataFrame 以下代码单元创建一个简单的 DataFrame,包含 10 个单元格,组织如下: * 5 行 * 2 列,一列名为 temperature,另一列名为 activity 以下代码单元实例化了一个 pd.DataFrame 类来生成一个 DataFrame。该类接受两个参数: - 第一个参数提供数据以填充这 10 个单元格。代码单元调用 np.array 来生成 5x2 的 NumPy 数组。 - 第二个参数标识两列的名称。 note:请勿在以下代码单元中重新定义变量。后续的代码单元会使用这些变量。 ```@CodeBlock styleID: 0 codeblock: projectId: '' title: '' name: '' language: python_wasm code: |- # Create and populate a 5x2 NumPy array. my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]]) # Create a Python list that holds the names of the two columns. my_column_names = ['temperature', 'activity'] # Create a DataFrame. my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names) # Print the entire DataFrame print(my_dataframe) outputMode: autoParacraft output: |2 temperature activity 0 0 3 1 10 7 2 20 9 3 30 14 4 40 15 ``` ## 向 DataFrame 添加新列 你可以通过为新列名称赋值来向现有的 pandas DataFrame 中添加新列。例如,以下代码创建了一个名为 adjusted 的第三列,加入到 my_dataframe 中: ```@CodeBlock styleID: 0 codeblock: projectId: '' title: '' name: '' language: python_wasm code: |- # Create a new column named adjusted. my_dataframe["adjusted"] = my_dataframe["activity"] + 2 # Print the entire DataFrame print(my_dataframe) outputMode: autoParacraft output: |2 temperature activity adjusted 0 0 3 5 1 10 7 9 2 20 9 11 3 30 14 16 4 40 15 17 ``` ## 指定 DataFrame 的子集 Pandas 提供了多种方法来隔离 DataFrame 中的特定行、列、切片或单元格。 ```@CodeBlock styleID: 0 codeblock: projectId: '' title: '' name: '' language: python_wasm code: |- print("Rows #0, #1, and #2:") print(my_dataframe.head(3), '\n') print("Row #2:") print(my_dataframe.iloc[[2]], '\n') print("Rows #1, #2, and #3:") print(my_dataframe[1:4], '\n') print("Column 'temperature':") print(my_dataframe['temperature']) outputMode: autoParacraft output: | Rows #0, #1, and #2: temperature activity adjusted 0 0 3 5 1 10 7 9 2 20 9 11 Row #2: temperature activity adjusted 2 20 9 11 Rows #1, #2, and #3: temperature activity adjusted 1 10 7 9 2 20 9 11 3 30 14 16 Column 'temperature': 0 0 1 10 2 20 3 30 4 40 Name: temperature, dtype: int32 ``` ## Task 1: Create a DataFrame * 列表项 * 列表项 Do the following: 1. Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named `Eleanor`, `Chidi`, `Tahani`, and `Jason`. Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive. 2. Output the following: * the entire DataFrame * the value in the cell of row #1 of the `Eleanor` column 3. Create a fifth column named `Janet`, which is populated with the row-by-row sums of `Tahani` and `Jason`. To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial. ```@CodeBlock styleID: 0 codeblock: projectId: '' title: '' name: '' language: python_wasm code: '# Write your code here.' outputMode: autoParacraft output: '' ``` Double-click for a solution to Task 1. ```@CodeBlock styleID: 0 codeblock: projectId: '' title: '' name: '' language: python_wasm code: |- #@title Double-click for a solution to Task 1. # Create a Python list that holds the names of the four columns. my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason'] # Create a 3x4 numpy array, each cell populated with a random integer. my_data = np.random.randint(low=0, high=101, size=(3, 4)) # Create a DataFrame. df = pd.DataFrame(data=my_data, columns=my_column_names) # Print the entire DataFrame print(df) # Print the value in row #1 of the Eleanor column. print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1]) # Create a column named Janet whose contents are the sum # of two other columns. df['Janet'] = df['Tahani'] + df['Jason'] # Print the enhanced DataFrame print(df) outputMode: autoParacraft output: '' ``` ## Copying a DataFrame (optional) Pandas provides two different ways to duplicate a DataFrame: * **Referencing.** If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other. * **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy. Changes to the original DataFrame or to the copy will not be reflected in the other. The difference is subtle, but important. ```@CodeBlock styleID: 0 codeblock: projectId: '' title: '' name: '' language: jupyter_server_container code: >- # Create a reference by assigning my_dataframe to a new variable. print("Experiment with a reference:") reference_to_df = df # Print the starting value of a particular cell. print(" Starting value of df: %d" % df['Jason'][1]) print(" Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1]) # Modify a cell in df. df.at[1, 'Jason'] = df['Jason'][1] + 5 print(" Updated df: %d" % df['Jason'][1]) print(" Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1]) # Create a true copy of my_dataframe print("Experiment with a true copy:") copy_of_my_dataframe = my_dataframe.copy() # Print the starting value of a particular cell. print(" Starting value of my_dataframe: %d" % my_dataframe['activity'][1]) print(" Starting value of copy_of_my_dataframe: %d\n" % copy_of_my_dataframe['activity'][1]) # Modify a cell in df. my_dataframe.at[1, 'activity'] = my_dataframe['activity'][1] + 3 print(" Updated my_dataframe: %d" % my_dataframe['activity'][1]) print(" copy_of_my_dataframe does not get updated: %d" % copy_of_my_dataframe['activity'][1]) outputMode: autoParacraft output: '' ```
## 数据科学
借助平台,您可以充分利用常用 Python 库的强大功能来分析和可视化数据。下方的代码单元格使用 NumPy 生成一些随机数据,并使用 Matplotlib 可视化这些数据。要修改代码,只需点击单元格,然后开始修改。 ```@CodeBlock styleID: 0 codeblock: projectId: '' title: '' name: '' language: python_wasm code: |- import numpy as np from matplotlib import pyplot as plt ys = 200 + np.random.randn(100) x = [x for x in range(len(ys))] plt.plot(x, ys, '-') plt.fill_between(x, ys, 195, where=(ys > 195), facecolor='g', alpha=0.6) plt.title("Sample Visualization") plt.show() outputMode: autoParacraft output: '' output_image: '' ```