from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import PipelineDataFrameTransformer
DataFrameTransformer (transformer=None, input_cols=None, output_cols=None, prev_step=None, append=False, print_input_cols=False, print_output_cols=False, print_out_df_cols=False)
Applies a transformer to a set of columns of pandas DataFrame and it outputs a DataFrame too.
X = pd.DataFrame(
{'city': ['London', 'London', 'Paris', 'Sallisaw'],
'title': ["His Last Bow", "How Watson Learned the Trick",
"A Moveable Feast", "The Grapes of Wrath"],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
X| city | title | expert_rating | user_rating | |
|---|---|---|---|---|
| 0 | London | His Last Bow | 5 | 4 |
| 1 | London | How Watson Learned the Trick | 3 | 5 |
| 2 | Paris | A Moveable Feast | 4 | 4 |
| 3 | Sallisaw | The Grapes of Wrath | 5 | 3 |
The OneHotEncoder expects a two dimensional array as input, so we set the input_cols to a list of columns. DataFrameTransformer uses the
enc_city = DataFrameTransformer(transformer=OneHotEncoder(dtype='int'),
input_cols=['city'],
append=True)
enc_city.fit_transform(X)| city | title | expert_rating | user_rating | city_London | city_Paris | city_Sallisaw | |
|---|---|---|---|---|---|---|---|
| 0 | London | His Last Bow | 5 | 4 | 1 | 0 | 0 |
| 1 | London | How Watson Learned the Trick | 3 | 5 | 1 | 0 | 0 |
| 2 | Paris | A Moveable Feast | 4 | 4 | 0 | 1 | 0 |
| 3 | Sallisaw | The Grapes of Wrath | 5 | 3 | 0 | 0 | 1 |
CountVectorizer expects a one-dimensional array as input so we set input_cols to a string that will retrieve a one-dimensional array from the input DataFrame.
enc_title = DataFrameTransformer(transformer=CountVectorizer(), input_cols='title', append=True)
enc_title.fit_transform(X)| city | title | expert_rating | user_rating | bow | feast | grapes | his | how | last | learned | moveable | of | the | trick | watson | wrath | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | London | His Last Bow | 5 | 4 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | London | How Watson Learned the Trick | 3 | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
| 2 | Paris | A Moveable Feast | 4 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | Sallisaw | The Grapes of Wrath | 5 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
We can chain these two into one Pipeline.
pipe = Pipeline([('enc_city', enc_city), ('enc_title', enc_title)])
pipe.fit_transform(X)| city | title | expert_rating | user_rating | city_London | city_Paris | city_Sallisaw | bow | feast | grapes | his | how | last | learned | moveable | of | the | trick | watson | wrath | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | London | His Last Bow | 5 | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | London | How Watson Learned the Trick | 3 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
| 2 | Paris | A Moveable Feast | 4 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | Sallisaw | The Grapes of Wrath | 5 | 3 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |