Practical 4: Feature Selection¶

logo

Text Mining, Transforming Text into Knowledge (202400006)¶

In this practical, we are going to learn about feature selection methods for text data.

We will use the following libraries, mainly from sklearn. Take care to have them installed!

In [1]:
from sklearn.datasets import load_files
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import mutual_info_classif
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

# for reproducibility
random_state = 321

Let's get started!¶

1. Here we are going to use a news article data set, originating from BBC news website. This dataset was provided for benchmarking machine learning algorithms. The BBC data set consists of 2,225 documents and 5 categories: business, entertainment, politics, sport, and tech. Download the bbc-fulltext.zip file and extract it into your data folder. Now, use the code below to convert the resulting object to a dataframe.

In [2]:
# Load the dataset, handling encoding errors gracefully
data = load_files('data/bbc-fulltext/bbc', encoding='utf-8', decode_error='replace')

# Convert the data into a pandas DataFrame
df = pd.DataFrame(list(zip(data['data'], data['target'])), columns=['text', 'label'])

# Display the first few rows
print(df.head())
                                                text  label
0  Tate & Lyle boss bags top award\n\nTate & Lyle...      0
1  Halo 2 sells five million copies\n\nMicrosoft ...      4
2  MSPs hear renewed climate warning\n\nClimate c...      2
3  Pavey focuses on indoor success\n\nJo Pavey wi...      3
4  Tories reject rethink on axed MP\n\nSacked MP ...      2

2. Print the unique target names in your data and check the number of articles in each category. Then split your data into training (80%) and test (20%) sets.

In [3]:
labels, counts = np.unique(df['label'], return_counts=True) # np.unique(data.target, return_counts=True)
In [4]:
print(dict(zip(data.target_names, counts)))
{'business': np.int64(510), 'entertainment': np.int64(386), 'politics': np.int64(417), 'sport': np.int64(511), 'tech': np.int64(401)}
In [5]:
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=random_state)

3. Use the CountVectorizer from sklearn and convert the text data into a document-term matrix. What is the difference between CountVectorizer and tfidfVectorizer(use_idf=False)?

In [6]:
#tokenizer to remove unwanted elements from out data like symbols
token = RegexpTokenizer(r'[a-zA-Z0-9]+')

# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.
# If you have memory issues, reduce the max_features value so you can continue with the practical
vectorizer = CountVectorizer(lowercase=True,
                             tokenizer=token.tokenize,
                             stop_words='english',
                             ngram_range=(1, 2),
                             analyzer='word',
                             min_df=3,
                             max_features=None)

# fit_transform() does two functions: First, it fits the model and learns the vocabulary;
# second, it transforms our data into feature vectors.
# The input to fit_transform should be a list of strings.
bbc_dtm = vectorizer.fit_transform(X_train)
print(bbc_dtm.shape)
/Users/Kuil0004/Documents/venvs/textmining/lib/python3.13/site-packages/sklearn/feature_extraction/text.py:526: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(
(1780, 25416)

The only difference is that the TfidfVectorizer() returns floats while the CountVectorizer() returns ints. And that’s to be expected – as explained in the documentation quoted above, TfidfVectorizer() assigns a score while CountVectorizer() counts.

4. Print top 20 most frequent words in the training set.

In [7]:
importance = np.argsort(np.asarray(bbc_dtm.sum(axis=0)).ravel())[::-1]
feature_names = np.array(vectorizer.get_feature_names_out())
feature_names[importance[:20]]
Out[7]:
array(['s', 'said', 'mr', 'year', 'people', 'new', 'time', 't', 'world',
       'government', 'uk', 'years', 'best', 'just', 'make', 'told',
       'game', 'like', '1', 'film'], dtype=object)

Filter-based feature selection¶

5. From the feature selection library in sklearn load the SelectKBest function and apply it on the BBC dataset using the chi-squared method. Extract top 20 features.

In [8]:
X_test_vectorized = vectorizer.transform(X_test)
In [9]:
ch2 = SelectKBest(chi2, k=20)
ch2.fit_transform(bbc_dtm, y_train)
Out[9]:
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 4402 stored elements and shape (1780, 20)>
In [10]:
feature_names_chi = [feature_names[i] for i
                         in ch2.get_support(indices=True)]
In [11]:
feature_names_chi
Out[11]:
['best',
 'blair',
 'computer',
 'digital',
 'election',
 'film',
 'government',
 'labour',
 'minister',
 'mobile',
 'mr',
 'mr blair',
 'music',
 'net',
 'online',
 'party',
 'people',
 'software',
 'technology',
 'users']

6. Repeat the analysis in Question 5 with the mutual information feature selection method. Do you get the same list of words as compared to the chi-squared method?

In [12]:
mutual_info = SelectKBest(mutual_info_classif, k=20)
mutual_info.fit_transform(bbc_dtm, y_train)
Out[12]:
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6112 stored elements and shape (1780, 20)>
In [13]:
feature_names_mutual_info = [feature_names[i] for i
                         in mutual_info.get_support(indices=True)]
feature_names_mutual_info
Out[13]:
['blair',
 'computer',
 'election',
 'film',
 'game',
 'government',
 'labour',
 'market',
 'minister',
 'mr',
 'music',
 'party',
 'people',
 'said',
 'secretary',
 'software',
 'technology',
 'tory',
 'users',
 'win']

Now you can build a classifier and train it using the output of these feature selection techniques. We are not going to do this right now, but if you are interested you can transform your training and test set using the selected features and continue with your classifier! Here are some tips:

In [14]:
# X_train = mutual_info.fit_transform(bbc_dtm, y_train)
# X_test = mutual_info.transform(X_test_vectorized)

Embedded feature selection¶

7. One of the functions for embedded feature selection is the SelectFromModel function in sklearn. Use this function with L1 norm SVM and check how many non-zero coefficients left in the model.

In [15]:
print("shape of the matrix before applying the embedded feature selection:", bbc_dtm.shape)

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False)
model = SelectFromModel(lsvc).fit(bbc_dtm, y_train) # you can add threshold=0.18 as another argument to select features that have an importance of more than 0.18
X_new = model.transform(bbc_dtm)
print("shape of the matrix after applying the embedded feature selection:", X_new.shape)
shape of the matrix before applying the embedded feature selection: (1780, 25416)
shape of the matrix after applying the embedded feature selection: (1780, 151)
In [16]:
model
Out[16]:
SelectFromModel(estimator=LinearSVC(C=0.01, dual=False, penalty='l1'))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator estimator: object

The base estimator from which the transformer is built.
This can be both a fitted (if ``prefit`` is set to True)
or a non-fitted estimator. The estimator should have a
``feature_importances_`` or ``coef_`` attribute after fitting.
Otherwise, the ``importance_getter`` parameter should be used.
LinearSVC(C=0... penalty='l1')
threshold threshold: str or float, default=None

The threshold value to use for feature selection. Features whose
absolute importance value is greater or equal are kept while the others
are discarded. If "median" (resp. "mean"), then the ``threshold`` value
is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., "1.25*mean") may also be used. If None and if the
estimator has a parameter penalty set to l1, either explicitly
or implicitly (e.g, Lasso), the threshold used is 1e-5.
Otherwise, "mean" is used by default.
None
prefit prefit: bool, default=False

Whether a prefit model is expected to be passed into the constructor
directly or not.
If `True`, `estimator` must be a fitted estimator.
If `False`, `estimator` is fitted and updated by calling
`fit` and `partial_fit`, respectively.
False
norm_order norm_order: non-zero int, inf, -inf, default=1

Order of the norm used to filter the vectors of coefficients below
``threshold`` in the case where the ``coef_`` attribute of the
estimator is of dimension 2.
1
max_features max_features: int, callable, default=None

The maximum number of features to select.

- If an integer, then it specifies the maximum number of features to
allow.
- If a callable, then it specifies how to calculate the maximum number of
features allowed. The callable will receive `X` as input: `max_features(X)`.
- If `None`, then all features are kept.

To only select based on ``max_features``, set ``threshold=-np.inf``.

.. versionadded:: 0.20
.. versionchanged:: 1.1
`max_features` accepts a callable.
None
importance_getter importance_getter: str or callable, default='auto'

If 'auto', uses the feature importance either through a ``coef_``
attribute or ``feature_importances_`` attribute of estimator.

Also accepts a string that specifies an attribute name/path
for extracting feature importance (implemented with `attrgetter`).
For example, give `regressor_.coef_` in case of
:class:`~sklearn.compose.TransformedTargetRegressor` or
`named_steps.clf.feature_importances_` in case of
:class:`~sklearn.pipeline.Pipeline` with its last step named `clf`.

If `callable`, overrides the default feature importance getter.
The callable is passed with the fitted estimator and it should
return importance for each feature.

.. versionadded:: 0.24
'auto'
LinearSVC(C=0.01, dual=False, penalty='l1')
Parameters
penalty penalty: {'l1', 'l2'}, default='l2'

Specifies the norm used in the penalization. The 'l2'
penalty is the standard used in SVC. The 'l1' leads to ``coef_``
vectors that are sparse.
'l1'
loss loss: {'hinge', 'squared_hinge'}, default='squared_hinge'

Specifies the loss function. 'hinge' is the standard SVM loss
(used e.g. by the SVC class) while 'squared_hinge' is the
square of the hinge loss. The combination of ``penalty='l1'``
and ``loss='hinge'`` is not supported.
'squared_hinge'
dual dual: "auto" or bool, default="auto"

Select the algorithm to either solve the dual or primal
optimization problem. Prefer dual=False when n_samples > n_features.
`dual="auto"` will choose the value of the parameter automatically,
based on the values of `n_samples`, `n_features`, `loss`, `multi_class`
and `penalty`. If `n_samples` < `n_features` and optimizer supports
chosen `loss`, `multi_class` and `penalty`, then dual will be set to True,
otherwise it will be set to False.

.. versionchanged:: 1.3
The `"auto"` option is added in version 1.3 and will be the default
in version 1.5.
False
tol tol: float, default=1e-4

Tolerance for stopping criteria.
0.0001
C C: float, default=1.0

Regularization parameter. The strength of the regularization is
inversely proportional to C. Must be strictly positive.
For an intuitive visualization of the effects of scaling
the regularization parameter C, see
:ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.
0.01
multi_class multi_class: {'ovr', 'crammer_singer'}, default='ovr'

Determines the multi-class strategy if `y` contains more than
two classes.
``"ovr"`` trains n_classes one-vs-rest classifiers, while
``"crammer_singer"`` optimizes a joint objective over all classes.
While `crammer_singer` is interesting from a theoretical perspective
as it is consistent, it is seldom used in practice as it rarely leads
to better accuracy and is more expensive to compute.
If ``"crammer_singer"`` is chosen, the options loss, penalty and dual
will be ignored.
'ovr'
fit_intercept fit_intercept: bool, default=True

Whether or not to fit an intercept. If set to True, the feature vector
is extended to include an intercept term: `[x_1, ..., x_n, 1]`, where
1 corresponds to the intercept. If set to False, no intercept will be
used in calculations (i.e. data is expected to be already centered).
True
intercept_scaling intercept_scaling: float, default=1.0

When `fit_intercept` is True, the instance vector x becomes ``[x_1,
..., x_n, intercept_scaling]``, i.e. a "synthetic" feature with a
constant value equal to `intercept_scaling` is appended to the instance
vector. The intercept becomes intercept_scaling * synthetic feature
weight. Note that liblinear internally penalizes the intercept,
treating it like any other term in the feature vector. To reduce the
impact of the regularization on the intercept, the `intercept_scaling`
parameter can be set to a value greater than 1; the higher the value of
`intercept_scaling`, the lower the impact of regularization on it.
Then, the weights become `[w_x_1, ..., w_x_n,
w_intercept*intercept_scaling]`, where `w_x_1, ..., w_x_n` represent
the feature weights and the intercept weight is scaled by
`intercept_scaling`. This scaling allows the intercept term to have a
different regularization behavior compared to the other features.
1
class_weight class_weight: dict or 'balanced', default=None

Set the parameter C of class i to ``class_weight[i]*C`` for
SVC. If not given, all classes are supposed to have
weight one.
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``.
None
verbose verbose: int, default=0

Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in liblinear that, if enabled, may not work
properly in a multithreaded context.
0
random_state random_state: int, RandomState instance or None, default=None

Controls the pseudo random number generation for shuffling the data for
the dual coordinate descent (if ``dual=True``). When ``dual=False`` the
underlying implementation of :class:`LinearSVC` is not random and
``random_state`` has no effect on the results.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary `.
None
max_iter max_iter: int, default=1000

The maximum number of iterations to be run.
1000
In [17]:
# you can also check the coefficient values
model.estimator_.coef_
Out[17]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(5, 25416))

8. What are the top features according to the SVM model? Tip: Use the function model.get_support() to find these features.

In [18]:
model.get_support()
Out[18]:
array([False, False, False, ..., False, False, False], shape=(25416,))
In [19]:
print("Features selected by SelectFromModel: ", feature_names[model.get_support()])
Features selected by SelectFromModel:  ['000' '1' '2' '2004' '6' 'actor' 'airline' 'airlines' 'album' 'analysts'
 'arsenal' 'athens' 'award' 'band' 'bank' 'bbc' 'best' 'blair' 'book'
 'britain' 'broadband' 'brown' 'business' 'champion' 'chart' 'chelsea'
 'chief' 'children' 'china' 'club' 'coach' 'comedy' 'companies' 'company'
 'computer' 'content' 'council' 'countries' 'cup' 'data' 'deal' 'deficit'
 'digital' 'dollar' 'doping' 'e' 'economic' 'economy' 'election' 'england'
 'european' 'euros' 'film' 'final' 'financial' 'firm' 'firms' 'game'
 'games' 'gaming' 'government' 'growth' 'health' 'high' 'home' 'iaaf'
 'information' 'injury' 'internet' 'jones' 'just' 'labour' 'like'
 'liverpool' 'lord' 'm' 'make' 'market' 'match' 'microsoft' 'million'
 'minister' 'mobile' 'months' 'mps' 'mr' 'music' 'musical' 'net' 'new'
 'number' 'o' 'oil' 'old' 'olympic' 'online' 'open' 'oscar' 'party'
 'people' 'plans' 'play' 'player' 'players' 'police' 'president' 'prices'
 'public' 'race' 'rights' 'rise' 'rugby' 's' 'said' 'said mr' 'sales'
 'say' 'says' 'season' 'secretary' 'series' 'service' 'set' 'shares'
 'singer' 'site' 'software' 'sony' 'star' 'state' 't' 'team' 'technology'
 'time' 'trade' 'tv' 'uk' 'united' 'use' 'users' 'using' 'video' 'virus'
 'wales' 'web' 'win' 'won' 'world' 'year' 'year old' 'years']

Model comparison¶

9. Create a pipeline with the tfidf representation and a random forest classifier.

In [20]:
clf1 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('classification', RandomForestClassifier())
])

10. Fit the pipeline on the training set.

In [21]:
clf1.fit(X_train, y_train)
Out[21]:
Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('feature_extraction', TfidfTransformer()),
                ('classification', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps steps: list of tuples

List of (name of step, estimator) tuples that are to be chained in
sequential order. To be compatible with the scikit-learn API, all steps
must define `fit`. All non-last steps must also define `transform`. See
:ref:`Combining Estimators ` for more details.
[('vectorizer', ...), ('feature_extraction', ...), ...]
transform_input transform_input: list of str, default=None

The names of the :term:`metadata` parameters that should be transformed by the
pipeline before passing it to the step consuming it.

This enables transforming some input arguments to ``fit`` (other than ``X``)
to be transformed by the steps of the pipeline up to the step which requires
them. Requirement is defined via :ref:`metadata routing `.
For instance, this can be used to pass a validation set through the pipeline.

You can only set this if metadata routing is enabled, which you
can enable using ``sklearn.set_config(enable_metadata_routing=True)``.

.. versionadded:: 1.6
None
memory memory: str or object with the joblib.Memory interface, default=None

Used to cache the fitted transformers of the pipeline. The last step
will never be cached, even if it is a transformer. By default, no
caching is performed. If a string is given, it is the path to the
caching directory. Enabling caching triggers a clone of the transformers
before fitting. Therefore, the transformer instance given to the
pipeline cannot be inspected directly. Use the attribute ``named_steps``
or ``steps`` to inspect estimators within the pipeline. Caching the
transformers is advantageous when fitting is time consuming. See
:ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py`
for an example on how to enable caching.
None
verbose verbose: bool, default=False

If True, the time elapsed while fitting each step will be printed as it
is completed.
False
Parameters
input input: {'filename', 'file', 'content'}, default='content'

- If `'filename'`, the sequence passed as an argument to fit is
expected to be a list of filenames that need reading to fetch
the raw content to analyze.

- If `'file'`, the sequence items must have a 'read' method (file-like
object) that is called to fetch the bytes in memory.

- If `'content'`, the input is expected to be a sequence of items that
can be of type string or byte.
'content'
encoding encoding: str, default='utf-8'

If bytes or files are given to analyze, this encoding is used to
decode.
'utf-8'
decode_error decode_error: {'strict', 'ignore', 'replace'}, default='strict'

Instruction on what to do if a byte sequence is given to analyze that
contains characters not of the given `encoding`. By default, it is
'strict', meaning that a UnicodeDecodeError will be raised. Other
values are 'ignore' and 'replace'.
'strict'
strip_accents strip_accents: {'ascii', 'unicode'} or callable, default=None

Remove accents and perform other character normalization
during the preprocessing step.
'ascii' is a fast method that only works on characters that have
a direct ASCII mapping.
'unicode' is a slightly slower method that works on any characters.
None (default) means no character normalization is performed.

Both 'ascii' and 'unicode' use NFKD normalization from
:func:`unicodedata.normalize`.
None
lowercase lowercase: bool, default=True

Convert all characters to lowercase before tokenizing.
True
preprocessor preprocessor: callable, default=None

Override the preprocessing (strip_accents and lowercase) stage while
preserving the tokenizing and n-grams generation steps.
Only applies if ``analyzer`` is not callable.
None
tokenizer tokenizer: callable, default=None

Override the string tokenization step while preserving the
preprocessing and n-grams generation steps.
Only applies if ``analyzer == 'word'``.
None
stop_words stop_words: {'english'}, list, default=None

If 'english', a built-in stop word list for English is used.
There are several known issues with 'english' and you should
consider an alternative (see :ref:`stop_words`).

If a list, that list is assumed to contain stop words, all of which
will be removed from the resulting tokens.
Only applies if ``analyzer == 'word'``.

If None, no stop words will be used. In this case, setting `max_df`
to a higher value, such as in the range (0.7, 1.0), can automatically detect
and filter stop words based on intra corpus document frequency of terms.
None
token_pattern token_pattern: str or None, default=r"(?u)\\b\\w\\w+\\b"

Regular expression denoting what constitutes a "token", only used
if ``analyzer == 'word'``. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).

If there is a capturing group in token_pattern then the
captured group content, not the entire match, becomes the token.
At most one capturing group is permitted.
'(?u)\\b\\w\\w+\\b'
ngram_range ngram_range: tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different
word n-grams or char n-grams to be extracted. All values of n such
such that min_n <= n <= max_n will be used. For example an
``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means
unigrams and bigrams, and ``(2, 2)`` means only bigrams.
Only applies if ``analyzer`` is not callable.
(1, ...)
analyzer analyzer: {'word', 'char', 'char_wb'} or callable, default='word'

Whether the feature should be made of word n-gram or character
n-grams.
Option 'char_wb' creates character n-grams only from text inside
word boundaries; n-grams at the edges of words are padded with space.

If a callable is passed it is used to extract the sequence of features
out of the raw, unprocessed input.

.. versionchanged:: 0.21

Since v0.21, if ``input`` is ``filename`` or ``file``, the data is
first read from the file and then passed to the given callable
analyzer.
'word'
max_df max_df: float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document
frequency strictly higher than the given threshold (corpus-specific
stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1.0
min_df min_df: float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document
frequency strictly lower than the given threshold. This value is also
called cut-off in the literature.
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1
max_features max_features: int, default=None

If not None, build a vocabulary that only consider the top
`max_features` ordered by term frequency across the corpus.
Otherwise, all features are used.

This parameter is ignored if vocabulary is not None.
None
vocabulary vocabulary: Mapping or iterable, default=None

Either a Mapping (e.g., a dict) where keys are terms and values are
indices in the feature matrix, or an iterable over terms. If not
given, a vocabulary is determined from the input documents. Indices
in the mapping should not be repeated and should not have any gap
between 0 and the largest index.
None
binary binary: bool, default=False

If True, all non zero counts are set to 1. This is useful for discrete
probabilistic models that model binary events rather than integer
counts.
False
dtype dtype: dtype, default=np.int64

Type of the matrix returned by fit_transform() or transform().
<class 'numpy.int64'>
Parameters
norm norm: {'l1', 'l2'} or None, default='l2'

Each output row will have unit norm, either:

- 'l2': Sum of squares of vector elements is 1. The cosine
similarity between two vectors is their dot product when l2 norm has
been applied.
- 'l1': Sum of absolute values of vector elements is 1.
See :func:`~sklearn.preprocessing.normalize`.
- None: No normalization.
'l2'
use_idf use_idf: bool, default=True

Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
True
smooth_idf smooth_idf: bool, default=True

Smooth idf weights by adding one to document frequencies, as if an
extra document was seen containing every term in the collection
exactly once. Prevents zero divisions.
True
sublinear_tf sublinear_tf: bool, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
False
Parameters
n_estimators n_estimators: int, default=100

The number of trees in the forest.

.. versionchanged:: 0.22
The default value of ``n_estimators`` changed from 10 to 100
in 0.22.
100
criterion criterion: {"gini", "entropy", "log_loss"}, default="gini"

The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "log_loss" and "entropy" both for the
Shannon information gain, see :ref:`tree_mathematical_formulation`.
Note: This parameter is tree-specific.
'gini'
max_depth max_depth: int, default=None

The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
None
min_samples_split min_samples_split: int or float, default=2

The minimum number of samples required to split an internal node:

- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a fraction and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.

.. versionchanged:: 0.18
Added float values for fractions.
2
min_samples_leaf min_samples_leaf: int or float, default=1

The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it leaves at
least ``min_samples_leaf`` training samples in each of the left and
right branches. This may have the effect of smoothing the model,
especially in regression.

- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.

.. versionchanged:: 0.18
Added float values for fractions.
1
min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0

The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
0.0
max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt"

The number of features to consider when looking for the best split:

- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a fraction and
`max(1, int(max_features * n_features_in_))` features are considered at each
split.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.

.. versionchanged:: 1.1
The default of `max_features` changed from `"auto"` to `"sqrt"`.

Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
'sqrt'
max_leaf_nodes max_leaf_nodes: int, default=None

Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
None
min_impurity_decrease min_impurity_decrease: float, default=0.0

A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.

The weighted impurity decrease equation is the following::

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where ``N`` is the total number of samples, ``N_t`` is the number of
samples at the current node, ``N_t_L`` is the number of samples in the
left child, and ``N_t_R`` is the number of samples in the right child.

``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
if ``sample_weight`` is passed.

.. versionadded:: 0.19
0.0
bootstrap bootstrap: bool, default=True

Whether bootstrap samples are used when building trees. If False, the
whole dataset is used to build each tree.
True
oob_score oob_score: bool or callable, default=False

Whether to use out-of-bag samples to estimate the generalization score.
By default, :func:`~sklearn.metrics.accuracy_score` is used.
Provide a callable with signature `metric(y_true, y_pred)` to use a
custom metric. Only available if `bootstrap=True`.

For an illustration of out-of-bag (OOB) error estimation, see the example
:ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.
False
n_jobs n_jobs: int, default=None

The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`,
:meth:`decision_path` and :meth:`apply` are all parallelized over the
trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend`
context. ``-1`` means using all processors. See :term:`Glossary
` for more details.
None
random_state random_state: int, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used
when building trees (if ``bootstrap=True``) and the sampling of the
features to consider when looking for the best split at each node
(if ``max_features < n_features``).
See :term:`Glossary ` for details.
None
verbose verbose: int, default=0

Controls the verbosity when fitting and predicting.
0
warm_start warm_start: bool, default=False

When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary ` and
:ref:`tree_ensemble_warm_start` for details.
False
class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None

Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

Note that for multioutput (including multilabel) weights should be
defined for each class of every column in its own dict. For example,
for four-class multilabel classification weights should be
[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].

The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``

The "balanced_subsample" mode is the same as "balanced" except that
weights are computed based on the bootstrap sample for every tree
grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.
None
ccp_alpha ccp_alpha: non-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The
subtree with the largest cost complexity that is smaller than
``ccp_alpha`` will be chosen. By default, no pruning is performed. See
:ref:`minimal_cost_complexity_pruning` for details. See
:ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py`
for an example of such pruning.

.. versionadded:: 0.22
0.0
max_samples max_samples: int or float, default=None

If bootstrap is True, the number of samples to draw from X
to train each base estimator.

- If None (default), then draw `X.shape[0]` samples.
- If int, then draw `max_samples` samples.
- If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus,
`max_samples` should be in the interval `(0.0, 1.0]`.

.. versionadded:: 0.22
None
monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None

Indicates the monotonicity constraint to enforce on each feature.
- 1: monotonic increase
- 0: no constraint
- -1: monotonic decrease

If monotonic_cst is None, no constraints are applied.

Monotonicity constraints are not supported for:
- multiclass classifications (i.e. when `n_classes > 2`),
- multioutput classifications (i.e. when `n_outputs_ > 1`),
- classifications trained on data with missing values.

The constraints hold over the probability of the positive class.

Read more in the :ref:`User Guide `.

.. versionadded:: 1.4
None

11. Use the pipeline to predict the outcome variable on your test set. Evaluate the performance of the pipeline using the classification_report function on the test subset. How do you interpret your results?

In [22]:
y_pred1 = clf1.predict(X_test)
print(metrics.classification_report(y_test, y_pred1, target_names=data.target_names))
               precision    recall  f1-score   support

     business       0.93      0.99      0.96       103
entertainment       1.00      0.94      0.97        85
     politics       0.99      0.95      0.97        86
        sport       0.96      1.00      0.98        92
         tech       0.97      0.94      0.95        79

     accuracy                           0.97       445
    macro avg       0.97      0.96      0.97       445
 weighted avg       0.97      0.97      0.97       445

12. Create your second pipeline with the tfidf representation and a random forest classifier with the addition of an embedded feature selection using the SVM classification method with L1 penalty. Fit the pipeline on your training set and test it with the test set. How does the performance change?

In [23]:
clf2 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
    ('classification', RandomForestClassifier())
])
In [24]:
clf2.fit(X_train, y_train)
Out[24]:
Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('feature_extraction', TfidfTransformer()),
                ('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classification', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps steps: list of tuples

List of (name of step, estimator) tuples that are to be chained in
sequential order. To be compatible with the scikit-learn API, all steps
must define `fit`. All non-last steps must also define `transform`. See
:ref:`Combining Estimators ` for more details.
[('vectorizer', ...), ('feature_extraction', ...), ...]
transform_input transform_input: list of str, default=None

The names of the :term:`metadata` parameters that should be transformed by the
pipeline before passing it to the step consuming it.

This enables transforming some input arguments to ``fit`` (other than ``X``)
to be transformed by the steps of the pipeline up to the step which requires
them. Requirement is defined via :ref:`metadata routing `.
For instance, this can be used to pass a validation set through the pipeline.

You can only set this if metadata routing is enabled, which you
can enable using ``sklearn.set_config(enable_metadata_routing=True)``.

.. versionadded:: 1.6
None
memory memory: str or object with the joblib.Memory interface, default=None

Used to cache the fitted transformers of the pipeline. The last step
will never be cached, even if it is a transformer. By default, no
caching is performed. If a string is given, it is the path to the
caching directory. Enabling caching triggers a clone of the transformers
before fitting. Therefore, the transformer instance given to the
pipeline cannot be inspected directly. Use the attribute ``named_steps``
or ``steps`` to inspect estimators within the pipeline. Caching the
transformers is advantageous when fitting is time consuming. See
:ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py`
for an example on how to enable caching.
None
verbose verbose: bool, default=False

If True, the time elapsed while fitting each step will be printed as it
is completed.
False
Parameters
input input: {'filename', 'file', 'content'}, default='content'

- If `'filename'`, the sequence passed as an argument to fit is
expected to be a list of filenames that need reading to fetch
the raw content to analyze.

- If `'file'`, the sequence items must have a 'read' method (file-like
object) that is called to fetch the bytes in memory.

- If `'content'`, the input is expected to be a sequence of items that
can be of type string or byte.
'content'
encoding encoding: str, default='utf-8'

If bytes or files are given to analyze, this encoding is used to
decode.
'utf-8'
decode_error decode_error: {'strict', 'ignore', 'replace'}, default='strict'

Instruction on what to do if a byte sequence is given to analyze that
contains characters not of the given `encoding`. By default, it is
'strict', meaning that a UnicodeDecodeError will be raised. Other
values are 'ignore' and 'replace'.
'strict'
strip_accents strip_accents: {'ascii', 'unicode'} or callable, default=None

Remove accents and perform other character normalization
during the preprocessing step.
'ascii' is a fast method that only works on characters that have
a direct ASCII mapping.
'unicode' is a slightly slower method that works on any characters.
None (default) means no character normalization is performed.

Both 'ascii' and 'unicode' use NFKD normalization from
:func:`unicodedata.normalize`.
None
lowercase lowercase: bool, default=True

Convert all characters to lowercase before tokenizing.
True
preprocessor preprocessor: callable, default=None

Override the preprocessing (strip_accents and lowercase) stage while
preserving the tokenizing and n-grams generation steps.
Only applies if ``analyzer`` is not callable.
None
tokenizer tokenizer: callable, default=None

Override the string tokenization step while preserving the
preprocessing and n-grams generation steps.
Only applies if ``analyzer == 'word'``.
None
stop_words stop_words: {'english'}, list, default=None

If 'english', a built-in stop word list for English is used.
There are several known issues with 'english' and you should
consider an alternative (see :ref:`stop_words`).

If a list, that list is assumed to contain stop words, all of which
will be removed from the resulting tokens.
Only applies if ``analyzer == 'word'``.

If None, no stop words will be used. In this case, setting `max_df`
to a higher value, such as in the range (0.7, 1.0), can automatically detect
and filter stop words based on intra corpus document frequency of terms.
None
token_pattern token_pattern: str or None, default=r"(?u)\\b\\w\\w+\\b"

Regular expression denoting what constitutes a "token", only used
if ``analyzer == 'word'``. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).

If there is a capturing group in token_pattern then the
captured group content, not the entire match, becomes the token.
At most one capturing group is permitted.
'(?u)\\b\\w\\w+\\b'
ngram_range ngram_range: tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different
word n-grams or char n-grams to be extracted. All values of n such
such that min_n <= n <= max_n will be used. For example an
``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means
unigrams and bigrams, and ``(2, 2)`` means only bigrams.
Only applies if ``analyzer`` is not callable.
(1, ...)
analyzer analyzer: {'word', 'char', 'char_wb'} or callable, default='word'

Whether the feature should be made of word n-gram or character
n-grams.
Option 'char_wb' creates character n-grams only from text inside
word boundaries; n-grams at the edges of words are padded with space.

If a callable is passed it is used to extract the sequence of features
out of the raw, unprocessed input.

.. versionchanged:: 0.21

Since v0.21, if ``input`` is ``filename`` or ``file``, the data is
first read from the file and then passed to the given callable
analyzer.
'word'
max_df max_df: float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document
frequency strictly higher than the given threshold (corpus-specific
stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1.0
min_df min_df: float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document
frequency strictly lower than the given threshold. This value is also
called cut-off in the literature.
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1
max_features max_features: int, default=None

If not None, build a vocabulary that only consider the top
`max_features` ordered by term frequency across the corpus.
Otherwise, all features are used.

This parameter is ignored if vocabulary is not None.
None
vocabulary vocabulary: Mapping or iterable, default=None

Either a Mapping (e.g., a dict) where keys are terms and values are
indices in the feature matrix, or an iterable over terms. If not
given, a vocabulary is determined from the input documents. Indices
in the mapping should not be repeated and should not have any gap
between 0 and the largest index.
None
binary binary: bool, default=False

If True, all non zero counts are set to 1. This is useful for discrete
probabilistic models that model binary events rather than integer
counts.
False
dtype dtype: dtype, default=np.int64

Type of the matrix returned by fit_transform() or transform().
<class 'numpy.int64'>
Parameters
norm norm: {'l1', 'l2'} or None, default='l2'

Each output row will have unit norm, either:

- 'l2': Sum of squares of vector elements is 1. The cosine
similarity between two vectors is their dot product when l2 norm has
been applied.
- 'l1': Sum of absolute values of vector elements is 1.
See :func:`~sklearn.preprocessing.normalize`.
- None: No normalization.
'l2'
use_idf use_idf: bool, default=True

Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
True
smooth_idf smooth_idf: bool, default=True

Smooth idf weights by adding one to document frequencies, as if an
extra document was seen containing every term in the collection
exactly once. Prevents zero divisions.
True
sublinear_tf sublinear_tf: bool, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
False
Parameters
estimator estimator: object

The base estimator from which the transformer is built.
This can be both a fitted (if ``prefit`` is set to True)
or a non-fitted estimator. The estimator should have a
``feature_importances_`` or ``coef_`` attribute after fitting.
Otherwise, the ``importance_getter`` parameter should be used.
LinearSVC(dua... penalty='l1')
threshold threshold: str or float, default=None

The threshold value to use for feature selection. Features whose
absolute importance value is greater or equal are kept while the others
are discarded. If "median" (resp. "mean"), then the ``threshold`` value
is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., "1.25*mean") may also be used. If None and if the
estimator has a parameter penalty set to l1, either explicitly
or implicitly (e.g, Lasso), the threshold used is 1e-5.
Otherwise, "mean" is used by default.
None
prefit prefit: bool, default=False

Whether a prefit model is expected to be passed into the constructor
directly or not.
If `True`, `estimator` must be a fitted estimator.
If `False`, `estimator` is fitted and updated by calling
`fit` and `partial_fit`, respectively.
False
norm_order norm_order: non-zero int, inf, -inf, default=1

Order of the norm used to filter the vectors of coefficients below
``threshold`` in the case where the ``coef_`` attribute of the
estimator is of dimension 2.
1
max_features max_features: int, callable, default=None

The maximum number of features to select.

- If an integer, then it specifies the maximum number of features to
allow.
- If a callable, then it specifies how to calculate the maximum number of
features allowed. The callable will receive `X` as input: `max_features(X)`.
- If `None`, then all features are kept.

To only select based on ``max_features``, set ``threshold=-np.inf``.

.. versionadded:: 0.20
.. versionchanged:: 1.1
`max_features` accepts a callable.
None
importance_getter importance_getter: str or callable, default='auto'

If 'auto', uses the feature importance either through a ``coef_``
attribute or ``feature_importances_`` attribute of estimator.

Also accepts a string that specifies an attribute name/path
for extracting feature importance (implemented with `attrgetter`).
For example, give `regressor_.coef_` in case of
:class:`~sklearn.compose.TransformedTargetRegressor` or
`named_steps.clf.feature_importances_` in case of
:class:`~sklearn.pipeline.Pipeline` with its last step named `clf`.

If `callable`, overrides the default feature importance getter.
The callable is passed with the fitted estimator and it should
return importance for each feature.

.. versionadded:: 0.24
'auto'
LinearSVC(dual=False, penalty='l1')
Parameters
penalty penalty: {'l1', 'l2'}, default='l2'

Specifies the norm used in the penalization. The 'l2'
penalty is the standard used in SVC. The 'l1' leads to ``coef_``
vectors that are sparse.
'l1'
loss loss: {'hinge', 'squared_hinge'}, default='squared_hinge'

Specifies the loss function. 'hinge' is the standard SVM loss
(used e.g. by the SVC class) while 'squared_hinge' is the
square of the hinge loss. The combination of ``penalty='l1'``
and ``loss='hinge'`` is not supported.
'squared_hinge'
dual dual: "auto" or bool, default="auto"

Select the algorithm to either solve the dual or primal
optimization problem. Prefer dual=False when n_samples > n_features.
`dual="auto"` will choose the value of the parameter automatically,
based on the values of `n_samples`, `n_features`, `loss`, `multi_class`
and `penalty`. If `n_samples` < `n_features` and optimizer supports
chosen `loss`, `multi_class` and `penalty`, then dual will be set to True,
otherwise it will be set to False.

.. versionchanged:: 1.3
The `"auto"` option is added in version 1.3 and will be the default
in version 1.5.
False
tol tol: float, default=1e-4

Tolerance for stopping criteria.
0.0001
C C: float, default=1.0

Regularization parameter. The strength of the regularization is
inversely proportional to C. Must be strictly positive.
For an intuitive visualization of the effects of scaling
the regularization parameter C, see
:ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.
1.0
multi_class multi_class: {'ovr', 'crammer_singer'}, default='ovr'

Determines the multi-class strategy if `y` contains more than
two classes.
``"ovr"`` trains n_classes one-vs-rest classifiers, while
``"crammer_singer"`` optimizes a joint objective over all classes.
While `crammer_singer` is interesting from a theoretical perspective
as it is consistent, it is seldom used in practice as it rarely leads
to better accuracy and is more expensive to compute.
If ``"crammer_singer"`` is chosen, the options loss, penalty and dual
will be ignored.
'ovr'
fit_intercept fit_intercept: bool, default=True

Whether or not to fit an intercept. If set to True, the feature vector
is extended to include an intercept term: `[x_1, ..., x_n, 1]`, where
1 corresponds to the intercept. If set to False, no intercept will be
used in calculations (i.e. data is expected to be already centered).
True
intercept_scaling intercept_scaling: float, default=1.0

When `fit_intercept` is True, the instance vector x becomes ``[x_1,
..., x_n, intercept_scaling]``, i.e. a "synthetic" feature with a
constant value equal to `intercept_scaling` is appended to the instance
vector. The intercept becomes intercept_scaling * synthetic feature
weight. Note that liblinear internally penalizes the intercept,
treating it like any other term in the feature vector. To reduce the
impact of the regularization on the intercept, the `intercept_scaling`
parameter can be set to a value greater than 1; the higher the value of
`intercept_scaling`, the lower the impact of regularization on it.
Then, the weights become `[w_x_1, ..., w_x_n,
w_intercept*intercept_scaling]`, where `w_x_1, ..., w_x_n` represent
the feature weights and the intercept weight is scaled by
`intercept_scaling`. This scaling allows the intercept term to have a
different regularization behavior compared to the other features.
1
class_weight class_weight: dict or 'balanced', default=None

Set the parameter C of class i to ``class_weight[i]*C`` for
SVC. If not given, all classes are supposed to have
weight one.
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``.
None
verbose verbose: int, default=0

Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in liblinear that, if enabled, may not work
properly in a multithreaded context.
0
random_state random_state: int, RandomState instance or None, default=None

Controls the pseudo random number generation for shuffling the data for
the dual coordinate descent (if ``dual=True``). When ``dual=False`` the
underlying implementation of :class:`LinearSVC` is not random and
``random_state`` has no effect on the results.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary `.
None
max_iter max_iter: int, default=1000

The maximum number of iterations to be run.
1000
Parameters
n_estimators n_estimators: int, default=100

The number of trees in the forest.

.. versionchanged:: 0.22
The default value of ``n_estimators`` changed from 10 to 100
in 0.22.
100
criterion criterion: {"gini", "entropy", "log_loss"}, default="gini"

The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "log_loss" and "entropy" both for the
Shannon information gain, see :ref:`tree_mathematical_formulation`.
Note: This parameter is tree-specific.
'gini'
max_depth max_depth: int, default=None

The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
None
min_samples_split min_samples_split: int or float, default=2

The minimum number of samples required to split an internal node:

- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a fraction and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.

.. versionchanged:: 0.18
Added float values for fractions.
2
min_samples_leaf min_samples_leaf: int or float, default=1

The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it leaves at
least ``min_samples_leaf`` training samples in each of the left and
right branches. This may have the effect of smoothing the model,
especially in regression.

- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.

.. versionchanged:: 0.18
Added float values for fractions.
1
min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0

The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
0.0
max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt"

The number of features to consider when looking for the best split:

- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a fraction and
`max(1, int(max_features * n_features_in_))` features are considered at each
split.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.

.. versionchanged:: 1.1
The default of `max_features` changed from `"auto"` to `"sqrt"`.

Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
'sqrt'
max_leaf_nodes max_leaf_nodes: int, default=None

Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
None
min_impurity_decrease min_impurity_decrease: float, default=0.0

A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.

The weighted impurity decrease equation is the following::

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where ``N`` is the total number of samples, ``N_t`` is the number of
samples at the current node, ``N_t_L`` is the number of samples in the
left child, and ``N_t_R`` is the number of samples in the right child.

``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
if ``sample_weight`` is passed.

.. versionadded:: 0.19
0.0
bootstrap bootstrap: bool, default=True

Whether bootstrap samples are used when building trees. If False, the
whole dataset is used to build each tree.
True
oob_score oob_score: bool or callable, default=False

Whether to use out-of-bag samples to estimate the generalization score.
By default, :func:`~sklearn.metrics.accuracy_score` is used.
Provide a callable with signature `metric(y_true, y_pred)` to use a
custom metric. Only available if `bootstrap=True`.

For an illustration of out-of-bag (OOB) error estimation, see the example
:ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.
False
n_jobs n_jobs: int, default=None

The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`,
:meth:`decision_path` and :meth:`apply` are all parallelized over the
trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend`
context. ``-1`` means using all processors. See :term:`Glossary
` for more details.
None
random_state random_state: int, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used
when building trees (if ``bootstrap=True``) and the sampling of the
features to consider when looking for the best split at each node
(if ``max_features < n_features``).
See :term:`Glossary ` for details.
None
verbose verbose: int, default=0

Controls the verbosity when fitting and predicting.
0
warm_start warm_start: bool, default=False

When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary ` and
:ref:`tree_ensemble_warm_start` for details.
False
class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None

Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

Note that for multioutput (including multilabel) weights should be
defined for each class of every column in its own dict. For example,
for four-class multilabel classification weights should be
[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].

The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``

The "balanced_subsample" mode is the same as "balanced" except that
weights are computed based on the bootstrap sample for every tree
grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.
None
ccp_alpha ccp_alpha: non-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The
subtree with the largest cost complexity that is smaller than
``ccp_alpha`` will be chosen. By default, no pruning is performed. See
:ref:`minimal_cost_complexity_pruning` for details. See
:ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py`
for an example of such pruning.

.. versionadded:: 0.22
0.0
max_samples max_samples: int or float, default=None

If bootstrap is True, the number of samples to draw from X
to train each base estimator.

- If None (default), then draw `X.shape[0]` samples.
- If int, then draw `max_samples` samples.
- If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus,
`max_samples` should be in the interval `(0.0, 1.0]`.

.. versionadded:: 0.22
None
monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None

Indicates the monotonicity constraint to enforce on each feature.
- 1: monotonic increase
- 0: no constraint
- -1: monotonic decrease

If monotonic_cst is None, no constraints are applied.

Monotonicity constraints are not supported for:
- multiclass classifications (i.e. when `n_classes > 2`),
- multioutput classifications (i.e. when `n_outputs_ > 1`),
- classifications trained on data with missing values.

The constraints hold over the probability of the positive class.

Read more in the :ref:`User Guide `.

.. versionadded:: 1.4
None
In [25]:
y_pred2 = clf2.predict(X_test)
In [26]:
print(metrics.classification_report(y_test, y_pred2, target_names=data.target_names))
               precision    recall  f1-score   support

     business       0.93      0.96      0.94       103
entertainment       1.00      0.93      0.96        85
     politics       0.96      0.94      0.95        86
        sport       0.95      1.00      0.97        92
         tech       0.95      0.94      0.94        79

     accuracy                           0.96       445
    macro avg       0.96      0.95      0.96       445
 weighted avg       0.96      0.96      0.96       445

13. Create your third and forth pipelines with the tfidf representation, a chi2 feature selection (with 20 and 200 features for clf3 and clf4, respectively), and a random forest classifier.

In [27]:
clf3 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', SelectKBest(chi2, k=20)),
    ('classification', RandomForestClassifier())
])
In [28]:
clf3.fit(X_train, y_train)
Out[28]:
Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('feature_extraction', TfidfTransformer()),
                ('feature_selection',
                 SelectKBest(k=20, score_func=<function chi2 at 0x1182a7f60>)),
                ('classification', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps steps: list of tuples

List of (name of step, estimator) tuples that are to be chained in
sequential order. To be compatible with the scikit-learn API, all steps
must define `fit`. All non-last steps must also define `transform`. See
:ref:`Combining Estimators ` for more details.
[('vectorizer', ...), ('feature_extraction', ...), ...]
transform_input transform_input: list of str, default=None

The names of the :term:`metadata` parameters that should be transformed by the
pipeline before passing it to the step consuming it.

This enables transforming some input arguments to ``fit`` (other than ``X``)
to be transformed by the steps of the pipeline up to the step which requires
them. Requirement is defined via :ref:`metadata routing `.
For instance, this can be used to pass a validation set through the pipeline.

You can only set this if metadata routing is enabled, which you
can enable using ``sklearn.set_config(enable_metadata_routing=True)``.

.. versionadded:: 1.6
None
memory memory: str or object with the joblib.Memory interface, default=None

Used to cache the fitted transformers of the pipeline. The last step
will never be cached, even if it is a transformer. By default, no
caching is performed. If a string is given, it is the path to the
caching directory. Enabling caching triggers a clone of the transformers
before fitting. Therefore, the transformer instance given to the
pipeline cannot be inspected directly. Use the attribute ``named_steps``
or ``steps`` to inspect estimators within the pipeline. Caching the
transformers is advantageous when fitting is time consuming. See
:ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py`
for an example on how to enable caching.
None
verbose verbose: bool, default=False

If True, the time elapsed while fitting each step will be printed as it
is completed.
False
Parameters
input input: {'filename', 'file', 'content'}, default='content'

- If `'filename'`, the sequence passed as an argument to fit is
expected to be a list of filenames that need reading to fetch
the raw content to analyze.

- If `'file'`, the sequence items must have a 'read' method (file-like
object) that is called to fetch the bytes in memory.

- If `'content'`, the input is expected to be a sequence of items that
can be of type string or byte.
'content'
encoding encoding: str, default='utf-8'

If bytes or files are given to analyze, this encoding is used to
decode.
'utf-8'
decode_error decode_error: {'strict', 'ignore', 'replace'}, default='strict'

Instruction on what to do if a byte sequence is given to analyze that
contains characters not of the given `encoding`. By default, it is
'strict', meaning that a UnicodeDecodeError will be raised. Other
values are 'ignore' and 'replace'.
'strict'
strip_accents strip_accents: {'ascii', 'unicode'} or callable, default=None

Remove accents and perform other character normalization
during the preprocessing step.
'ascii' is a fast method that only works on characters that have
a direct ASCII mapping.
'unicode' is a slightly slower method that works on any characters.
None (default) means no character normalization is performed.

Both 'ascii' and 'unicode' use NFKD normalization from
:func:`unicodedata.normalize`.
None
lowercase lowercase: bool, default=True

Convert all characters to lowercase before tokenizing.
True
preprocessor preprocessor: callable, default=None

Override the preprocessing (strip_accents and lowercase) stage while
preserving the tokenizing and n-grams generation steps.
Only applies if ``analyzer`` is not callable.
None
tokenizer tokenizer: callable, default=None

Override the string tokenization step while preserving the
preprocessing and n-grams generation steps.
Only applies if ``analyzer == 'word'``.
None
stop_words stop_words: {'english'}, list, default=None

If 'english', a built-in stop word list for English is used.
There are several known issues with 'english' and you should
consider an alternative (see :ref:`stop_words`).

If a list, that list is assumed to contain stop words, all of which
will be removed from the resulting tokens.
Only applies if ``analyzer == 'word'``.

If None, no stop words will be used. In this case, setting `max_df`
to a higher value, such as in the range (0.7, 1.0), can automatically detect
and filter stop words based on intra corpus document frequency of terms.
None
token_pattern token_pattern: str or None, default=r"(?u)\\b\\w\\w+\\b"

Regular expression denoting what constitutes a "token", only used
if ``analyzer == 'word'``. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).

If there is a capturing group in token_pattern then the
captured group content, not the entire match, becomes the token.
At most one capturing group is permitted.
'(?u)\\b\\w\\w+\\b'
ngram_range ngram_range: tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different
word n-grams or char n-grams to be extracted. All values of n such
such that min_n <= n <= max_n will be used. For example an
``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means
unigrams and bigrams, and ``(2, 2)`` means only bigrams.
Only applies if ``analyzer`` is not callable.
(1, ...)
analyzer analyzer: {'word', 'char', 'char_wb'} or callable, default='word'

Whether the feature should be made of word n-gram or character
n-grams.
Option 'char_wb' creates character n-grams only from text inside
word boundaries; n-grams at the edges of words are padded with space.

If a callable is passed it is used to extract the sequence of features
out of the raw, unprocessed input.

.. versionchanged:: 0.21

Since v0.21, if ``input`` is ``filename`` or ``file``, the data is
first read from the file and then passed to the given callable
analyzer.
'word'
max_df max_df: float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document
frequency strictly higher than the given threshold (corpus-specific
stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1.0
min_df min_df: float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document
frequency strictly lower than the given threshold. This value is also
called cut-off in the literature.
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1
max_features max_features: int, default=None

If not None, build a vocabulary that only consider the top
`max_features` ordered by term frequency across the corpus.
Otherwise, all features are used.

This parameter is ignored if vocabulary is not None.
None
vocabulary vocabulary: Mapping or iterable, default=None

Either a Mapping (e.g., a dict) where keys are terms and values are
indices in the feature matrix, or an iterable over terms. If not
given, a vocabulary is determined from the input documents. Indices
in the mapping should not be repeated and should not have any gap
between 0 and the largest index.
None
binary binary: bool, default=False

If True, all non zero counts are set to 1. This is useful for discrete
probabilistic models that model binary events rather than integer
counts.
False
dtype dtype: dtype, default=np.int64

Type of the matrix returned by fit_transform() or transform().
<class 'numpy.int64'>
Parameters
norm norm: {'l1', 'l2'} or None, default='l2'

Each output row will have unit norm, either:

- 'l2': Sum of squares of vector elements is 1. The cosine
similarity between two vectors is their dot product when l2 norm has
been applied.
- 'l1': Sum of absolute values of vector elements is 1.
See :func:`~sklearn.preprocessing.normalize`.
- None: No normalization.
'l2'
use_idf use_idf: bool, default=True

Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
True
smooth_idf smooth_idf: bool, default=True

Smooth idf weights by adding one to document frequencies, as if an
extra document was seen containing every term in the collection
exactly once. Prevents zero divisions.
True
sublinear_tf sublinear_tf: bool, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
False
Parameters
score_func score_func: callable, default=f_classif

Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues) or a single array with scores.
Default is f_classif (see below "See Also"). The default function only
works with classification tasks.

.. versionadded:: 0.18
<function chi2 at 0x1182a7f60>
k k: int or "all", default=10

Number of top features to select.
The "all" option bypasses selection, for use in a parameter search.
20
Parameters
n_estimators n_estimators: int, default=100

The number of trees in the forest.

.. versionchanged:: 0.22
The default value of ``n_estimators`` changed from 10 to 100
in 0.22.
100
criterion criterion: {"gini", "entropy", "log_loss"}, default="gini"

The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "log_loss" and "entropy" both for the
Shannon information gain, see :ref:`tree_mathematical_formulation`.
Note: This parameter is tree-specific.
'gini'
max_depth max_depth: int, default=None

The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
None
min_samples_split min_samples_split: int or float, default=2

The minimum number of samples required to split an internal node:

- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a fraction and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.

.. versionchanged:: 0.18
Added float values for fractions.
2
min_samples_leaf min_samples_leaf: int or float, default=1

The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it leaves at
least ``min_samples_leaf`` training samples in each of the left and
right branches. This may have the effect of smoothing the model,
especially in regression.

- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.

.. versionchanged:: 0.18
Added float values for fractions.
1
min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0

The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
0.0
max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt"

The number of features to consider when looking for the best split:

- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a fraction and
`max(1, int(max_features * n_features_in_))` features are considered at each
split.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.

.. versionchanged:: 1.1
The default of `max_features` changed from `"auto"` to `"sqrt"`.

Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
'sqrt'
max_leaf_nodes max_leaf_nodes: int, default=None

Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
None
min_impurity_decrease min_impurity_decrease: float, default=0.0

A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.

The weighted impurity decrease equation is the following::

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where ``N`` is the total number of samples, ``N_t`` is the number of
samples at the current node, ``N_t_L`` is the number of samples in the
left child, and ``N_t_R`` is the number of samples in the right child.

``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
if ``sample_weight`` is passed.

.. versionadded:: 0.19
0.0
bootstrap bootstrap: bool, default=True

Whether bootstrap samples are used when building trees. If False, the
whole dataset is used to build each tree.
True
oob_score oob_score: bool or callable, default=False

Whether to use out-of-bag samples to estimate the generalization score.
By default, :func:`~sklearn.metrics.accuracy_score` is used.
Provide a callable with signature `metric(y_true, y_pred)` to use a
custom metric. Only available if `bootstrap=True`.

For an illustration of out-of-bag (OOB) error estimation, see the example
:ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.
False
n_jobs n_jobs: int, default=None

The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`,
:meth:`decision_path` and :meth:`apply` are all parallelized over the
trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend`
context. ``-1`` means using all processors. See :term:`Glossary
` for more details.
None
random_state random_state: int, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used
when building trees (if ``bootstrap=True``) and the sampling of the
features to consider when looking for the best split at each node
(if ``max_features < n_features``).
See :term:`Glossary ` for details.
None
verbose verbose: int, default=0

Controls the verbosity when fitting and predicting.
0
warm_start warm_start: bool, default=False

When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary ` and
:ref:`tree_ensemble_warm_start` for details.
False
class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None

Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

Note that for multioutput (including multilabel) weights should be
defined for each class of every column in its own dict. For example,
for four-class multilabel classification weights should be
[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].

The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``

The "balanced_subsample" mode is the same as "balanced" except that
weights are computed based on the bootstrap sample for every tree
grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.
None
ccp_alpha ccp_alpha: non-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The
subtree with the largest cost complexity that is smaller than
``ccp_alpha`` will be chosen. By default, no pruning is performed. See
:ref:`minimal_cost_complexity_pruning` for details. See
:ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py`
for an example of such pruning.

.. versionadded:: 0.22
0.0
max_samples max_samples: int or float, default=None

If bootstrap is True, the number of samples to draw from X
to train each base estimator.

- If None (default), then draw `X.shape[0]` samples.
- If int, then draw `max_samples` samples.
- If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus,
`max_samples` should be in the interval `(0.0, 1.0]`.

.. versionadded:: 0.22
None
monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None

Indicates the monotonicity constraint to enforce on each feature.
- 1: monotonic increase
- 0: no constraint
- -1: monotonic decrease

If monotonic_cst is None, no constraints are applied.

Monotonicity constraints are not supported for:
- multiclass classifications (i.e. when `n_classes > 2`),
- multioutput classifications (i.e. when `n_outputs_ > 1`),
- classifications trained on data with missing values.

The constraints hold over the probability of the positive class.

Read more in the :ref:`User Guide `.

.. versionadded:: 1.4
None
In [29]:
y_pred3 = clf3.predict(X_test)
print(metrics.classification_report(y_test, y_pred3, target_names=data.target_names))
               precision    recall  f1-score   support

     business       0.68      0.41      0.51       103
entertainment       0.92      0.78      0.84        85
     politics       0.79      0.78      0.78        86
        sport       0.61      1.00      0.76        92
         tech       0.89      0.85      0.87        79

     accuracy                           0.75       445
    macro avg       0.78      0.76      0.75       445
 weighted avg       0.77      0.75      0.74       445

In [30]:
clf4 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', SelectKBest(chi2, k=200)),
    ('classification', RandomForestClassifier())
])
In [31]:
clf4.fit(X_train, y_train)
Out[31]:
Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('feature_extraction', TfidfTransformer()),
                ('feature_selection',
                 SelectKBest(k=200, score_func=<function chi2 at 0x1182a7f60>)),
                ('classification', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps steps: list of tuples

List of (name of step, estimator) tuples that are to be chained in
sequential order. To be compatible with the scikit-learn API, all steps
must define `fit`. All non-last steps must also define `transform`. See
:ref:`Combining Estimators ` for more details.
[('vectorizer', ...), ('feature_extraction', ...), ...]
transform_input transform_input: list of str, default=None

The names of the :term:`metadata` parameters that should be transformed by the
pipeline before passing it to the step consuming it.

This enables transforming some input arguments to ``fit`` (other than ``X``)
to be transformed by the steps of the pipeline up to the step which requires
them. Requirement is defined via :ref:`metadata routing `.
For instance, this can be used to pass a validation set through the pipeline.

You can only set this if metadata routing is enabled, which you
can enable using ``sklearn.set_config(enable_metadata_routing=True)``.

.. versionadded:: 1.6
None
memory memory: str or object with the joblib.Memory interface, default=None

Used to cache the fitted transformers of the pipeline. The last step
will never be cached, even if it is a transformer. By default, no
caching is performed. If a string is given, it is the path to the
caching directory. Enabling caching triggers a clone of the transformers
before fitting. Therefore, the transformer instance given to the
pipeline cannot be inspected directly. Use the attribute ``named_steps``
or ``steps`` to inspect estimators within the pipeline. Caching the
transformers is advantageous when fitting is time consuming. See
:ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py`
for an example on how to enable caching.
None
verbose verbose: bool, default=False

If True, the time elapsed while fitting each step will be printed as it
is completed.
False
Parameters
input input: {'filename', 'file', 'content'}, default='content'

- If `'filename'`, the sequence passed as an argument to fit is
expected to be a list of filenames that need reading to fetch
the raw content to analyze.

- If `'file'`, the sequence items must have a 'read' method (file-like
object) that is called to fetch the bytes in memory.

- If `'content'`, the input is expected to be a sequence of items that
can be of type string or byte.
'content'
encoding encoding: str, default='utf-8'

If bytes or files are given to analyze, this encoding is used to
decode.
'utf-8'
decode_error decode_error: {'strict', 'ignore', 'replace'}, default='strict'

Instruction on what to do if a byte sequence is given to analyze that
contains characters not of the given `encoding`. By default, it is
'strict', meaning that a UnicodeDecodeError will be raised. Other
values are 'ignore' and 'replace'.
'strict'
strip_accents strip_accents: {'ascii', 'unicode'} or callable, default=None

Remove accents and perform other character normalization
during the preprocessing step.
'ascii' is a fast method that only works on characters that have
a direct ASCII mapping.
'unicode' is a slightly slower method that works on any characters.
None (default) means no character normalization is performed.

Both 'ascii' and 'unicode' use NFKD normalization from
:func:`unicodedata.normalize`.
None
lowercase lowercase: bool, default=True

Convert all characters to lowercase before tokenizing.
True
preprocessor preprocessor: callable, default=None

Override the preprocessing (strip_accents and lowercase) stage while
preserving the tokenizing and n-grams generation steps.
Only applies if ``analyzer`` is not callable.
None
tokenizer tokenizer: callable, default=None

Override the string tokenization step while preserving the
preprocessing and n-grams generation steps.
Only applies if ``analyzer == 'word'``.
None
stop_words stop_words: {'english'}, list, default=None

If 'english', a built-in stop word list for English is used.
There are several known issues with 'english' and you should
consider an alternative (see :ref:`stop_words`).

If a list, that list is assumed to contain stop words, all of which
will be removed from the resulting tokens.
Only applies if ``analyzer == 'word'``.

If None, no stop words will be used. In this case, setting `max_df`
to a higher value, such as in the range (0.7, 1.0), can automatically detect
and filter stop words based on intra corpus document frequency of terms.
None
token_pattern token_pattern: str or None, default=r"(?u)\\b\\w\\w+\\b"

Regular expression denoting what constitutes a "token", only used
if ``analyzer == 'word'``. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).

If there is a capturing group in token_pattern then the
captured group content, not the entire match, becomes the token.
At most one capturing group is permitted.
'(?u)\\b\\w\\w+\\b'
ngram_range ngram_range: tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different
word n-grams or char n-grams to be extracted. All values of n such
such that min_n <= n <= max_n will be used. For example an
``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means
unigrams and bigrams, and ``(2, 2)`` means only bigrams.
Only applies if ``analyzer`` is not callable.
(1, ...)
analyzer analyzer: {'word', 'char', 'char_wb'} or callable, default='word'

Whether the feature should be made of word n-gram or character
n-grams.
Option 'char_wb' creates character n-grams only from text inside
word boundaries; n-grams at the edges of words are padded with space.

If a callable is passed it is used to extract the sequence of features
out of the raw, unprocessed input.

.. versionchanged:: 0.21

Since v0.21, if ``input`` is ``filename`` or ``file``, the data is
first read from the file and then passed to the given callable
analyzer.
'word'
max_df max_df: float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document
frequency strictly higher than the given threshold (corpus-specific
stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1.0
min_df min_df: float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document
frequency strictly lower than the given threshold. This value is also
called cut-off in the literature.
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1
max_features max_features: int, default=None

If not None, build a vocabulary that only consider the top
`max_features` ordered by term frequency across the corpus.
Otherwise, all features are used.

This parameter is ignored if vocabulary is not None.
None
vocabulary vocabulary: Mapping or iterable, default=None

Either a Mapping (e.g., a dict) where keys are terms and values are
indices in the feature matrix, or an iterable over terms. If not
given, a vocabulary is determined from the input documents. Indices
in the mapping should not be repeated and should not have any gap
between 0 and the largest index.
None
binary binary: bool, default=False

If True, all non zero counts are set to 1. This is useful for discrete
probabilistic models that model binary events rather than integer
counts.
False
dtype dtype: dtype, default=np.int64

Type of the matrix returned by fit_transform() or transform().
<class 'numpy.int64'>
Parameters
norm norm: {'l1', 'l2'} or None, default='l2'

Each output row will have unit norm, either:

- 'l2': Sum of squares of vector elements is 1. The cosine
similarity between two vectors is their dot product when l2 norm has
been applied.
- 'l1': Sum of absolute values of vector elements is 1.
See :func:`~sklearn.preprocessing.normalize`.
- None: No normalization.
'l2'
use_idf use_idf: bool, default=True

Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
True
smooth_idf smooth_idf: bool, default=True

Smooth idf weights by adding one to document frequencies, as if an
extra document was seen containing every term in the collection
exactly once. Prevents zero divisions.
True
sublinear_tf sublinear_tf: bool, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
False
Parameters
score_func score_func: callable, default=f_classif

Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues) or a single array with scores.
Default is f_classif (see below "See Also"). The default function only
works with classification tasks.

.. versionadded:: 0.18
<function chi2 at 0x1182a7f60>
k k: int or "all", default=10

Number of top features to select.
The "all" option bypasses selection, for use in a parameter search.
200
Parameters
n_estimators n_estimators: int, default=100

The number of trees in the forest.

.. versionchanged:: 0.22
The default value of ``n_estimators`` changed from 10 to 100
in 0.22.
100
criterion criterion: {"gini", "entropy", "log_loss"}, default="gini"

The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "log_loss" and "entropy" both for the
Shannon information gain, see :ref:`tree_mathematical_formulation`.
Note: This parameter is tree-specific.
'gini'
max_depth max_depth: int, default=None

The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
None
min_samples_split min_samples_split: int or float, default=2

The minimum number of samples required to split an internal node:

- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a fraction and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.

.. versionchanged:: 0.18
Added float values for fractions.
2
min_samples_leaf min_samples_leaf: int or float, default=1

The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it leaves at
least ``min_samples_leaf`` training samples in each of the left and
right branches. This may have the effect of smoothing the model,
especially in regression.

- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.

.. versionchanged:: 0.18
Added float values for fractions.
1
min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0

The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
0.0
max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt"

The number of features to consider when looking for the best split:

- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a fraction and
`max(1, int(max_features * n_features_in_))` features are considered at each
split.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.

.. versionchanged:: 1.1
The default of `max_features` changed from `"auto"` to `"sqrt"`.

Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
'sqrt'
max_leaf_nodes max_leaf_nodes: int, default=None

Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
None
min_impurity_decrease min_impurity_decrease: float, default=0.0

A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.

The weighted impurity decrease equation is the following::

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where ``N`` is the total number of samples, ``N_t`` is the number of
samples at the current node, ``N_t_L`` is the number of samples in the
left child, and ``N_t_R`` is the number of samples in the right child.

``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
if ``sample_weight`` is passed.

.. versionadded:: 0.19
0.0
bootstrap bootstrap: bool, default=True

Whether bootstrap samples are used when building trees. If False, the
whole dataset is used to build each tree.
True
oob_score oob_score: bool or callable, default=False

Whether to use out-of-bag samples to estimate the generalization score.
By default, :func:`~sklearn.metrics.accuracy_score` is used.
Provide a callable with signature `metric(y_true, y_pred)` to use a
custom metric. Only available if `bootstrap=True`.

For an illustration of out-of-bag (OOB) error estimation, see the example
:ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.
False
n_jobs n_jobs: int, default=None

The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`,
:meth:`decision_path` and :meth:`apply` are all parallelized over the
trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend`
context. ``-1`` means using all processors. See :term:`Glossary
` for more details.
None
random_state random_state: int, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used
when building trees (if ``bootstrap=True``) and the sampling of the
features to consider when looking for the best split at each node
(if ``max_features < n_features``).
See :term:`Glossary ` for details.
None
verbose verbose: int, default=0

Controls the verbosity when fitting and predicting.
0
warm_start warm_start: bool, default=False

When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary ` and
:ref:`tree_ensemble_warm_start` for details.
False
class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None

Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

Note that for multioutput (including multilabel) weights should be
defined for each class of every column in its own dict. For example,
for four-class multilabel classification weights should be
[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].

The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``

The "balanced_subsample" mode is the same as "balanced" except that
weights are computed based on the bootstrap sample for every tree
grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.
None
ccp_alpha ccp_alpha: non-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The
subtree with the largest cost complexity that is smaller than
``ccp_alpha`` will be chosen. By default, no pruning is performed. See
:ref:`minimal_cost_complexity_pruning` for details. See
:ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py`
for an example of such pruning.

.. versionadded:: 0.22
0.0
max_samples max_samples: int or float, default=None

If bootstrap is True, the number of samples to draw from X
to train each base estimator.

- If None (default), then draw `X.shape[0]` samples.
- If int, then draw `max_samples` samples.
- If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus,
`max_samples` should be in the interval `(0.0, 1.0]`.

.. versionadded:: 0.22
None
monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None

Indicates the monotonicity constraint to enforce on each feature.
- 1: monotonic increase
- 0: no constraint
- -1: monotonic decrease

If monotonic_cst is None, no constraints are applied.

Monotonicity constraints are not supported for:
- multiclass classifications (i.e. when `n_classes > 2`),
- multioutput classifications (i.e. when `n_outputs_ > 1`),
- classifications trained on data with missing values.

The constraints hold over the probability of the positive class.

Read more in the :ref:`User Guide `.

.. versionadded:: 1.4
None
In [32]:
y_pred4 = clf4.predict(X_test)
print(metrics.classification_report(y_test, y_pred4, target_names=data.target_names))
               precision    recall  f1-score   support

     business       0.90      0.95      0.92       103
entertainment       0.99      0.96      0.98        85
     politics       0.99      0.92      0.95        86
        sport       0.93      0.99      0.96        92
         tech       0.97      0.92      0.95        79

     accuracy                           0.95       445
    macro avg       0.96      0.95      0.95       445
 weighted avg       0.95      0.95      0.95       445

14. We can change the learner by simply plugging a different classifier object into our pipeline. Create your fifth pipeline with L1 norm SVM for the feature selection method and naive Bayes for the classifier. Compare your results on the test set with the previous pipelines.

In [33]:
clf5 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
    ('classification', MultinomialNB(alpha=0.01))
])
In [34]:
clf5.fit(X_train, y_train)
Out[34]:
Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('feature_extraction', TfidfTransformer()),
                ('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classification', MultinomialNB(alpha=0.01))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps steps: list of tuples

List of (name of step, estimator) tuples that are to be chained in
sequential order. To be compatible with the scikit-learn API, all steps
must define `fit`. All non-last steps must also define `transform`. See
:ref:`Combining Estimators ` for more details.
[('vectorizer', ...), ('feature_extraction', ...), ...]
transform_input transform_input: list of str, default=None

The names of the :term:`metadata` parameters that should be transformed by the
pipeline before passing it to the step consuming it.

This enables transforming some input arguments to ``fit`` (other than ``X``)
to be transformed by the steps of the pipeline up to the step which requires
them. Requirement is defined via :ref:`metadata routing `.
For instance, this can be used to pass a validation set through the pipeline.

You can only set this if metadata routing is enabled, which you
can enable using ``sklearn.set_config(enable_metadata_routing=True)``.

.. versionadded:: 1.6
None
memory memory: str or object with the joblib.Memory interface, default=None

Used to cache the fitted transformers of the pipeline. The last step
will never be cached, even if it is a transformer. By default, no
caching is performed. If a string is given, it is the path to the
caching directory. Enabling caching triggers a clone of the transformers
before fitting. Therefore, the transformer instance given to the
pipeline cannot be inspected directly. Use the attribute ``named_steps``
or ``steps`` to inspect estimators within the pipeline. Caching the
transformers is advantageous when fitting is time consuming. See
:ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py`
for an example on how to enable caching.
None
verbose verbose: bool, default=False

If True, the time elapsed while fitting each step will be printed as it
is completed.
False
Parameters
input input: {'filename', 'file', 'content'}, default='content'

- If `'filename'`, the sequence passed as an argument to fit is
expected to be a list of filenames that need reading to fetch
the raw content to analyze.

- If `'file'`, the sequence items must have a 'read' method (file-like
object) that is called to fetch the bytes in memory.

- If `'content'`, the input is expected to be a sequence of items that
can be of type string or byte.
'content'
encoding encoding: str, default='utf-8'

If bytes or files are given to analyze, this encoding is used to
decode.
'utf-8'
decode_error decode_error: {'strict', 'ignore', 'replace'}, default='strict'

Instruction on what to do if a byte sequence is given to analyze that
contains characters not of the given `encoding`. By default, it is
'strict', meaning that a UnicodeDecodeError will be raised. Other
values are 'ignore' and 'replace'.
'strict'
strip_accents strip_accents: {'ascii', 'unicode'} or callable, default=None

Remove accents and perform other character normalization
during the preprocessing step.
'ascii' is a fast method that only works on characters that have
a direct ASCII mapping.
'unicode' is a slightly slower method that works on any characters.
None (default) means no character normalization is performed.

Both 'ascii' and 'unicode' use NFKD normalization from
:func:`unicodedata.normalize`.
None
lowercase lowercase: bool, default=True

Convert all characters to lowercase before tokenizing.
True
preprocessor preprocessor: callable, default=None

Override the preprocessing (strip_accents and lowercase) stage while
preserving the tokenizing and n-grams generation steps.
Only applies if ``analyzer`` is not callable.
None
tokenizer tokenizer: callable, default=None

Override the string tokenization step while preserving the
preprocessing and n-grams generation steps.
Only applies if ``analyzer == 'word'``.
None
stop_words stop_words: {'english'}, list, default=None

If 'english', a built-in stop word list for English is used.
There are several known issues with 'english' and you should
consider an alternative (see :ref:`stop_words`).

If a list, that list is assumed to contain stop words, all of which
will be removed from the resulting tokens.
Only applies if ``analyzer == 'word'``.

If None, no stop words will be used. In this case, setting `max_df`
to a higher value, such as in the range (0.7, 1.0), can automatically detect
and filter stop words based on intra corpus document frequency of terms.
None
token_pattern token_pattern: str or None, default=r"(?u)\\b\\w\\w+\\b"

Regular expression denoting what constitutes a "token", only used
if ``analyzer == 'word'``. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).

If there is a capturing group in token_pattern then the
captured group content, not the entire match, becomes the token.
At most one capturing group is permitted.
'(?u)\\b\\w\\w+\\b'
ngram_range ngram_range: tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different
word n-grams or char n-grams to be extracted. All values of n such
such that min_n <= n <= max_n will be used. For example an
``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means
unigrams and bigrams, and ``(2, 2)`` means only bigrams.
Only applies if ``analyzer`` is not callable.
(1, ...)
analyzer analyzer: {'word', 'char', 'char_wb'} or callable, default='word'

Whether the feature should be made of word n-gram or character
n-grams.
Option 'char_wb' creates character n-grams only from text inside
word boundaries; n-grams at the edges of words are padded with space.

If a callable is passed it is used to extract the sequence of features
out of the raw, unprocessed input.

.. versionchanged:: 0.21

Since v0.21, if ``input`` is ``filename`` or ``file``, the data is
first read from the file and then passed to the given callable
analyzer.
'word'
max_df max_df: float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document
frequency strictly higher than the given threshold (corpus-specific
stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1.0
min_df min_df: float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document
frequency strictly lower than the given threshold. This value is also
called cut-off in the literature.
If float, the parameter represents a proportion of documents, integer
absolute counts.
This parameter is ignored if vocabulary is not None.
1
max_features max_features: int, default=None

If not None, build a vocabulary that only consider the top
`max_features` ordered by term frequency across the corpus.
Otherwise, all features are used.

This parameter is ignored if vocabulary is not None.
None
vocabulary vocabulary: Mapping or iterable, default=None

Either a Mapping (e.g., a dict) where keys are terms and values are
indices in the feature matrix, or an iterable over terms. If not
given, a vocabulary is determined from the input documents. Indices
in the mapping should not be repeated and should not have any gap
between 0 and the largest index.
None
binary binary: bool, default=False

If True, all non zero counts are set to 1. This is useful for discrete
probabilistic models that model binary events rather than integer
counts.
False
dtype dtype: dtype, default=np.int64

Type of the matrix returned by fit_transform() or transform().
<class 'numpy.int64'>
Parameters
norm norm: {'l1', 'l2'} or None, default='l2'

Each output row will have unit norm, either:

- 'l2': Sum of squares of vector elements is 1. The cosine
similarity between two vectors is their dot product when l2 norm has
been applied.
- 'l1': Sum of absolute values of vector elements is 1.
See :func:`~sklearn.preprocessing.normalize`.
- None: No normalization.
'l2'
use_idf use_idf: bool, default=True

Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
True
smooth_idf smooth_idf: bool, default=True

Smooth idf weights by adding one to document frequencies, as if an
extra document was seen containing every term in the collection
exactly once. Prevents zero divisions.
True
sublinear_tf sublinear_tf: bool, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
False
Parameters
estimator estimator: object

The base estimator from which the transformer is built.
This can be both a fitted (if ``prefit`` is set to True)
or a non-fitted estimator. The estimator should have a
``feature_importances_`` or ``coef_`` attribute after fitting.
Otherwise, the ``importance_getter`` parameter should be used.
LinearSVC(dua... penalty='l1')
threshold threshold: str or float, default=None

The threshold value to use for feature selection. Features whose
absolute importance value is greater or equal are kept while the others
are discarded. If "median" (resp. "mean"), then the ``threshold`` value
is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., "1.25*mean") may also be used. If None and if the
estimator has a parameter penalty set to l1, either explicitly
or implicitly (e.g, Lasso), the threshold used is 1e-5.
Otherwise, "mean" is used by default.
None
prefit prefit: bool, default=False

Whether a prefit model is expected to be passed into the constructor
directly or not.
If `True`, `estimator` must be a fitted estimator.
If `False`, `estimator` is fitted and updated by calling
`fit` and `partial_fit`, respectively.
False
norm_order norm_order: non-zero int, inf, -inf, default=1

Order of the norm used to filter the vectors of coefficients below
``threshold`` in the case where the ``coef_`` attribute of the
estimator is of dimension 2.
1
max_features max_features: int, callable, default=None

The maximum number of features to select.

- If an integer, then it specifies the maximum number of features to
allow.
- If a callable, then it specifies how to calculate the maximum number of
features allowed. The callable will receive `X` as input: `max_features(X)`.
- If `None`, then all features are kept.

To only select based on ``max_features``, set ``threshold=-np.inf``.

.. versionadded:: 0.20
.. versionchanged:: 1.1
`max_features` accepts a callable.
None
importance_getter importance_getter: str or callable, default='auto'

If 'auto', uses the feature importance either through a ``coef_``
attribute or ``feature_importances_`` attribute of estimator.

Also accepts a string that specifies an attribute name/path
for extracting feature importance (implemented with `attrgetter`).
For example, give `regressor_.coef_` in case of
:class:`~sklearn.compose.TransformedTargetRegressor` or
`named_steps.clf.feature_importances_` in case of
:class:`~sklearn.pipeline.Pipeline` with its last step named `clf`.

If `callable`, overrides the default feature importance getter.
The callable is passed with the fitted estimator and it should
return importance for each feature.

.. versionadded:: 0.24
'auto'
LinearSVC(dual=False, penalty='l1')
Parameters
penalty penalty: {'l1', 'l2'}, default='l2'

Specifies the norm used in the penalization. The 'l2'
penalty is the standard used in SVC. The 'l1' leads to ``coef_``
vectors that are sparse.
'l1'
loss loss: {'hinge', 'squared_hinge'}, default='squared_hinge'

Specifies the loss function. 'hinge' is the standard SVM loss
(used e.g. by the SVC class) while 'squared_hinge' is the
square of the hinge loss. The combination of ``penalty='l1'``
and ``loss='hinge'`` is not supported.
'squared_hinge'
dual dual: "auto" or bool, default="auto"

Select the algorithm to either solve the dual or primal
optimization problem. Prefer dual=False when n_samples > n_features.
`dual="auto"` will choose the value of the parameter automatically,
based on the values of `n_samples`, `n_features`, `loss`, `multi_class`
and `penalty`. If `n_samples` < `n_features` and optimizer supports
chosen `loss`, `multi_class` and `penalty`, then dual will be set to True,
otherwise it will be set to False.

.. versionchanged:: 1.3
The `"auto"` option is added in version 1.3 and will be the default
in version 1.5.
False
tol tol: float, default=1e-4

Tolerance for stopping criteria.
0.0001
C C: float, default=1.0

Regularization parameter. The strength of the regularization is
inversely proportional to C. Must be strictly positive.
For an intuitive visualization of the effects of scaling
the regularization parameter C, see
:ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.
1.0
multi_class multi_class: {'ovr', 'crammer_singer'}, default='ovr'

Determines the multi-class strategy if `y` contains more than
two classes.
``"ovr"`` trains n_classes one-vs-rest classifiers, while
``"crammer_singer"`` optimizes a joint objective over all classes.
While `crammer_singer` is interesting from a theoretical perspective
as it is consistent, it is seldom used in practice as it rarely leads
to better accuracy and is more expensive to compute.
If ``"crammer_singer"`` is chosen, the options loss, penalty and dual
will be ignored.
'ovr'
fit_intercept fit_intercept: bool, default=True

Whether or not to fit an intercept. If set to True, the feature vector
is extended to include an intercept term: `[x_1, ..., x_n, 1]`, where
1 corresponds to the intercept. If set to False, no intercept will be
used in calculations (i.e. data is expected to be already centered).
True
intercept_scaling intercept_scaling: float, default=1.0

When `fit_intercept` is True, the instance vector x becomes ``[x_1,
..., x_n, intercept_scaling]``, i.e. a "synthetic" feature with a
constant value equal to `intercept_scaling` is appended to the instance
vector. The intercept becomes intercept_scaling * synthetic feature
weight. Note that liblinear internally penalizes the intercept,
treating it like any other term in the feature vector. To reduce the
impact of the regularization on the intercept, the `intercept_scaling`
parameter can be set to a value greater than 1; the higher the value of
`intercept_scaling`, the lower the impact of regularization on it.
Then, the weights become `[w_x_1, ..., w_x_n,
w_intercept*intercept_scaling]`, where `w_x_1, ..., w_x_n` represent
the feature weights and the intercept weight is scaled by
`intercept_scaling`. This scaling allows the intercept term to have a
different regularization behavior compared to the other features.
1
class_weight class_weight: dict or 'balanced', default=None

Set the parameter C of class i to ``class_weight[i]*C`` for
SVC. If not given, all classes are supposed to have
weight one.
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``.
None
verbose verbose: int, default=0

Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in liblinear that, if enabled, may not work
properly in a multithreaded context.
0
random_state random_state: int, RandomState instance or None, default=None

Controls the pseudo random number generation for shuffling the data for
the dual coordinate descent (if ``dual=True``). When ``dual=False`` the
underlying implementation of :class:`LinearSVC` is not random and
``random_state`` has no effect on the results.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary `.
None
max_iter max_iter: int, default=1000

The maximum number of iterations to be run.
1000
Parameters
alpha alpha: float or array-like of shape (n_features,), default=1.0

Additive (Laplace/Lidstone) smoothing parameter
(set alpha=0 and force_alpha=True, for no smoothing).
0.01
force_alpha force_alpha: bool, default=True

If False and alpha is less than 1e-10, it will set alpha to
1e-10. If True, alpha will remain unchanged. This may cause
numerical errors if alpha is too close to 0.

.. versionadded:: 1.2
.. versionchanged:: 1.4
The default value of `force_alpha` changed to `True`.
True
fit_prior fit_prior: bool, default=True

Whether to learn class prior probabilities or not.
If false, a uniform prior will be used.
True
class_prior class_prior: array-like of shape (n_classes,), default=None

Prior probabilities of the classes. If specified, the priors are not
adjusted according to the data.
None
In [35]:
y_pred5 = clf5.predict(X_test)
print(metrics.classification_report(y_test, y_pred5, target_names=data.target_names))
               precision    recall  f1-score   support

     business       0.97      0.97      0.97       103
entertainment       0.99      0.98      0.98        85
     politics       0.99      1.00      0.99        86
        sport       0.97      1.00      0.98        92
         tech       0.95      0.91      0.93        79

     accuracy                           0.97       445
    macro avg       0.97      0.97      0.97       445
 weighted avg       0.97      0.97      0.97       445