User Guide
사용자 가이드는 주제 영역별로 모든 pandas를 다룹니다.
각 하위 섹션에서는 주제(예: "working with missing data")를 소개하고 팬더가 문제에 접근하는 방법을 전체 예제와 함께 논의합니다.
Pandas를 처음 하는 사용자는 10 minutes to pandas부터 시작해야 합니다.
Pandas 기본사항에 대한 상위수준 요약은 데이터구조 소개 및 필수 기본기능을 참조하세요.
특정 메서드에 대한 추가 정보는 API 참조에서 확인할 수 있습니다.
The User Guide covers all of pandas by topic area. Each of the subsections introduces a topic (such as “working with missing data”), and discusses how pandas approaches the problem, with many examples throughout.
Users brand-new to pandas should start with 10 minutes to pandas.
For a high level summary of the pandas fundamentals, see Intro to data structures and Essential basic functionality.
Further information on any specific method can be obtained in the API reference.
How to read these guides
이 가이드를 읽는 방법
이 가이드에서는 다음 같은 코드블록 내부의 입력코드를 볼 수 있습니다:
In these guides you will see input code inside code blocks such as:
import pandas as pd
pd.DataFrame({'A': [1, 2, 3]})
A
0 1
1 2
2 3
or:
In [1]: import pandas as pd
In [2]: pd.DataFrame({'A': [1, 2, 3]})
A
0 1
1 2
2 3
첫 번째 블록은 표준 Python 입력이고,
두 번째 블록의 In [1]:은 입력이 노트북 내부에 있음을 나타냅니다.
Jupyter Notebook에서는 마지막 줄이 인쇄되고 플롯이 인라인으로 표시됩니다.
The first block is a standard python input, while in the second the In [1]: indicates the input is inside a notebook. In Jupyter Notebooks the last line is printed and plots are shown inline.
예를 들어 For example:
In [3]: a = 1
In [4]: a
Out[4]: 1
은 아래와 같음
is equivalent to:
a = 1
print(a)
Guides#
10 minutes to pandas
이는 주로 신규 사용자를 대상으로 한 pandas에 대한 간략한 소개
Cookbook에서 더 복잡한 레시피를 보실 수 있습니다.
일반적으로 다음과 같이 import 합니다.
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.
Customarily, we import as follows:
In [1]: import numpy as np
In [2]: import pandas as pd
Basic data structures in pandas
팬더의 기본 데이터구조
Pandas는 데이터처리를 위해 2가지 타입의 클래스를 제공:
Pandas provides two types of classes for handling data:
Series: 모든 타입의 데이터를 보유하는 1차원 레이블 배열
정수, 문자열, Python 객체 등
a one-dimensional labeled array holding data of any type
such as integers, strings, Python objects etc.
DataFrame: 2차원 배열이나 행과 열이 있는 테이블 같은 데이터를 보유하는 2차원 데이터구조
a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.
Object creation
객체 생성
Intro to data structures section을 참조하세요.
See the Intro to data structures section.
값 list을 전달하여 Series를 생성하고, pandas가 default RangeIndex를 생성하게 합니다.
Creating a Series by passing a list of values, letting pandas create a default RangeIndex.
In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
date_range() 와 labeled columns을 사용, datetime index 있는 NumPy 배열을 전달하여 DataFrame을 생성
Creating a DataFrame by passing a NumPy array with a datetime index using date_range() and labeled columns:
In [5]: dates = pd.date_range("20130101", periods=6)
In [6]: dates
Out[6]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
In [8]: df
Out[8]: A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
keys가 column labels이고, 값이 column values인 객체 사전을 전달하여 DataFrame을 만듭니다.
Creating a DataFrame by passing a dictionary of objects where the keys are the column labels and the values are the column values.
In [9]: df2 = pd.DataFrame(
...: {
...: "A": 1.0,
...: "B": pd.Timestamp("20130102"),
...: "C": pd.Series(1, index=list(range(4)), dtype="float32"),
...: "D": np.array([3] * 4, dtype="int32"),
...: "E": pd.Categorical(["test", "train", "test", "train"]),
...: "F": "foo",
...: }
...: )
...:
In [10]: df2
Out[10]: A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
결과 DataFrame의 열에는 서로 다른 dtypes이 있습니다:
The columns of the resulting DataFrame have different dtypes:
In [11]: df2.dtypes
Out[11]: A float64
B datetime64[s]
C float32
D int32
E category
F object
dtype: object
IPython을 사용하는 경우 열 이름(및 public attributes)에 대한 탭 완성이 자동으로 활성화됩니다.
완료될 속성의 하위집합은 다음과 같습니다:
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
In [12]: df2.<TAB> # noqa: E225, E999
df2.A df2.bool
df2.abs df2.boxplot
df2.add df2.C
df2.add_prefix df2.clip
df2.add_suffix df2.columns
df2.align df2.copy
df2.all df2.combine
df2.append df2.D
df2.apply df2.describe
df2.applymap df2.diff
df2.B df2.duplicated
보시다시피 A, B, C, D 열이 자동으로 탭 완성됩니다.
E와 F도 거기에 있슴; 간결성을 위해 나머지 속성은 잘렸습니다.
As you can see, the columns A, B, C, and D are automatically tab completed. E and F are there as well; the rest of the attributes have been truncated for brevity.
Viewing data
Essentially basics functionality section을 참조하세요.
DataFrame.head() and DataFrame.tail()을 사용하여 각각 프레임의 위쪽 행과 아래쪽 행을 확인:
See the Essentially basics functionality section.
Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame respectively:
In [13]: df.head()
Out[13]: A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01 -03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401
In [14]: df.tail(3)
Out[14]: A B C D
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
DataFrame.index or DataFrame.columns를 표시:
Display the DataFrame.index or DataFrame.columns:
In [15]: df.index
Out[15]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013 -01-05',
'2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')
인덱스나 열 레이블 없이 DataFrame.to_numpy()를 사용하여 기본 데이터의 NumPy 표현을 반환:
Return a NumPy representation of the underlying data with DataFrame.to_numpy() without the index or column labels:
In [17]: df.to_numpy()
Out[17]: array([[ 0.4691, -0.2829, -1.5091, -1.1356],
[ 1.2121, -0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949, 1.0718],
[ 0.7216, -0.7068, -1.0396, 0.2719],
[-0.425 , 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784, 0.525 ]])
Note
NumPy 배열은 전체 배열에 대해 하나의 dtype을 갖는 반면, pandas DataFrame은 열당 하나의 dtype을 갖습니다.
DataFrame.to_numpy()를 호출하면 팬더는 DataFrame의 모든 dtype을 보유할 수 있는 NumPy dtype을 찾습니다.
공통 데이터타입이 객체인 경우, DataFrame.to_numpy()에는 데이터 복사가 필요합니다.
NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. If the common data type is object, DataFrame.to_numpy() will require copying data.
In [18]: df2.dtypes
Out[18]: A float64 B datetime64[s] C float32 D int32 E category F object dtype: object
In [19]: df2.to_numpy()
Out[19]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
describe() 은 데이터의 빠른 통계 요약을 보여줍니다:
describe() shows a quick statistic summary of your data:
In [20]: df.describe()
Out[20]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.073711 -0.431125 -0.687758 -0.233103 std 0.843157 0.922818 0.779887 0.973118 min -0.861849 -2.104569 -1.509059 -1.135632 25% -0.611510 -0.600794 -1.368714 -1.076610 50% 0.022070 -0.228039 -0.767252 -0.386188 75% 0.658444 0.041933 -0.034326 0.461706 max 1.212112 0.567020 0.276232 1.071804
Transposing your data:
In [21]: df.T
Out[21]: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988
DataFrame.sort_index() sorts by an axis:
In [22]: df.sort_index(axis=1, ascending=False)
Out[22]: D C B A 2013-01-01 -1.135632 -1.509059 -0.282863 0.469112 2013-01-02 -1.044236 0.119209 -0.173215 1.212112 2013-01-03 1.071804 -0.494929 -2.104569 -0.861849 2013-01-04 0.271860 -1.039575 -0.706771 0.721555 2013-01-05 -1.087401 0.276232 0.567020 -0.424972 2013-01-06 0.524988 -1.478427 0.113648 -0.673690
DataFrame.sort_values() sorts by values:
In [23]: df.sort_values(by="B")
Out[23]: A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401
Selection
Note
선택 및 설정을 위한 표준 Python/NumPy 표현식은 직관적이고 대화형 작업에 유용하지만, 프로덕션 코드의 경우 최적화된 Pandas 데이터 액세스 방법인 DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc()를 권장.
인덱싱 문서 Indexing and Selecting Data and MultiIndex / Advanced Indexing을 참조.
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().
See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.
Getitem ([])
DataFrame의 경우 단일 레이블을 전달하면 열이 선택되고 df.A에 해당하는 시리즈가 생성됩니다:
For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A:
In [24]: df["A"]
Out[24]: 2013-01-01 0.469112 2013-01-02 1.212112 2013-01-03 -0.861849 2013-01-04 0.721555 2013-01-05 -0.424972 2013-01-06 -0.673690 Freq: D, Name: A, dtype: float64
For a DataFrame, passing a slice : selects matching rows:
In [25]: df[0:3]
Out[25]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 In [26]: df["20130102":"20130104"] Out[26]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860
https://smartstore.naver.com/dopza/products/4569179898
- Missing data
- Operations
- Merge
- Grouping
- Reshaping
- Time series
- Categoricals
- Plotting
- Importing and exporting data
- Gotchas
- Intro to data structures
- Series
- DataFrame
- Essential basic functionality
- Head and tail
- Attributes and underlying data
- Accelerated operations
- Flexible binary operations
- Descriptive statistics
- Function application
- Reindexing and altering labels
- Iteration
- .dt accessor
- Vectorized string methods
- Sorting
- Copying
- dtypes
- Selecting columns based on dtype
- IO tools (text, CSV, HDF5, …)
- CSV & text files
- JSON
- HTML
- LaTeX
- XML
- Excel files
- OpenDocument Spreadsheets
- Binary Excel (.xlsb) files
- Clipboard
- Pickling
- msgpack
- HDF5 (PyTables)
- Feather
- Parquet
- ORC
- SQL queries
- Google BigQuery
- Stata format
- SAS formats
- SPSS formats
- Other file formats
- Performance considerations
- PyArrow Functionality
- Data Structure Integration
- Operations
- I/O Reading
- Indexing and selecting data
- Different choices for indexing
- Basics
- Attribute access
- Slicing ranges
- Selection by label
- Selection by position
- Selection by callable
- Combining positional and label-based indexing
- Selecting random samples
- Setting with enlargement
- Fast scalar value getting and setting
- Boolean indexing
- Indexing with isin
- The where() Method and Masking
- Setting with enlargement conditionally using numpy()
- The query() Method
- Duplicate data
- Dictionary-like get() method
- Looking up values by index/column labels
- Index objects
- Set / reset index
- Returning a view versus a copy
- MultiIndex / advanced indexing
- Hierarchical indexing (MultiIndex)
- Advanced indexing with hierarchical index
- Sorting a MultiIndex
- Take methods
- Index types
- Miscellaneous indexing FAQ
- Copy-on-Write (CoW)
- Previous behavior
- Description
- Chained Assignment
- Read-only NumPy arrays
- Patterns to avoid
- Copy-on-Write optimizations
- How to enable CoW
- Merge, join, concatenate and compare
- Concatenating objects
- Database-style DataFrame or named Series joining/merging
- Timeseries friendly merging
- Comparing objects
- Reshaping and pivot tables
- pivot() and pivot_table()
- stack() and unstack()
- melt() and wide_to_long()
- get_dummies() and from_dummies()
- explode()
- crosstab()
- cut()
- factorize()
- Working with text data
- Text data types
- String methods
- Splitting and replacing strings
- Concatenation
- Indexing with .str
- Extracting substrings
- Testing for strings that match or contain a pattern
- Creating indicator variables
- Method summary
- Working with missing data
- Values considered “missing”
- Inserting missing data
- Calculations with missing data
- Sum/prod of empties/nans
- NA values in GroupBy
- Filling missing values: fillna
- Filling with a PandasObject
- Dropping axis labels with missing data: dropna
- Interpolation
- Replacing generic values
- String/regular expression replacement
- Numeric replacement
- Experimental NA scalar to denote missing values
- Duplicate Labels
- Consequences of Duplicate Labels
- Duplicate Label Detection
- Disallowing Duplicate Labels
- Categorical data
- Object creation
- CategoricalDtype
- Description
- Working with categories
- Sorting and order
- Comparisons
- Operations
- Data munging
- Getting data in/out
- Missing data
- Differences to R’s factor
- Gotchas
- Nullable integer data type
- Construction
- Operations
- Scalar NA Value
- Nullable Boolean data type
- Indexing with NA values
- Kleene logical operations
- Chart visualization
- Basic plotting: plot
- Other plots
- Plotting with missing data
- Plotting tools
- Plot formatting
- Plotting directly with Matplotlib
- Plotting backends
- Table Visualization
- Styler Object and Customising the Display
- Formatting the Display
- Styler Object and HTML
- Methods to Add Styles
- Table Styles
- Setting Classes and Linking to External CSS
- Styler Functions
- Tooltips and Captions
- Finer Control with Slicing
- Optimization
- Builtin Styles
- Sharing styles
- Limitations
- Other Fun and Useful Stuff
- Export to Excel
- Export to LaTeX
- More About CSS and HTML
- Extensibility
- Group by: split-apply-combine
- Splitting an object into groups
- Iterating through groups
- Selecting a group
- Aggregation
- Transformation
- Filtration
- Flexible apply
- Numba Accelerated Routines
- Other useful features
- Examples
- Windowing operations
- Overview
- Rolling window
- Weighted window
- Expanding window
- Exponentially weighted window
- Time series / date functionality
- Overview
- Timestamps vs. time spans
- Converting to timestamps
- Generating ranges of timestamps
- Timestamp limitations
- Indexing
- Time/date components
- DateOffset objects
- Time Series-related instance methods
- Resampling
- Time span representation
- Converting between representations
- Representing out-of-bounds spans
- Time zone handling
- Time deltas
- Parsing
- Operations
- Reductions
- Frequency conversion
- Attributes
- TimedeltaIndex
- Resampling
- Options and settings
- Overview
- Available options
- Getting and setting options
- Setting startup options in Python/IPython environment
- Frequently used options
- Number formatting
- Unicode formatting
- Table schema display
- Enhancing performance
- Cython (writing C extensions for pandas)
- Numba (JIT compilation)
- Expression evaluation via eval()
- Scaling to large datasets
- Load less data
- Use efficient datatypes
- Use chunking
- Use Dask
- Sparse data structures
- SparseArray
- SparseDtype
- Sparse accessor
- Sparse calculation
- Interaction with scipy.sparse
- Frequently Asked Questions (FAQ)
- DataFrame memory usage
- Using if/truth statements with pandas
- Mutating with User Defined Function (UDF) methods
- Missing value representation for NumPy types
- Differences with NumPy
- Thread-safety
- Byte-ordering issues
- Cookbook
- Idioms
- Selection
- Multiindexing
- Missing data
- Grouping
- Timeseries
- Merge
- Plotting
- Data in/out
- Computation
- Timedeltas
- Creating example data
- Constant series
On this page
댓글