PMLB Classification Datasets#

Loading classification datasets#

First, load a trained agent and get PMLB classification datasets names list. Although there are hundreds of datasets available, let’s sample 10% from the list to demonstrate the agents capabilities.

import random
import pmlb
from IPython.display import Markdown
from ostatslib.agents import PPOAgent

SAMPLE_FRACTION = 0.1
sample_size = int(len(pmlb.classification_dataset_names) * SAMPLE_FRACTION)
sampled_dataset_names = random.sample(pmlb.classification_dataset_names, sample_size)

AGENT_FILE = '../trained_ppo_model.zip'
agent = PPOAgent(AGENT_FILE)

Markdown(f'Sampled {sample_size} classification datasets: {", ".join(sampled_dataset_names)}.')

Sampled 16 classification datasets: mnist, clean2, agaricus_lepiota, satimage, parity5, corral, analcatdata_creditscore, lupus, schizo, australian, analcatdata_boxing2, balance_scale, biomed, GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1, mux6, hypothyroid.

Analyses#

Next step is to fetch data and analyze each selected dataset. PMLB provides a function to fetch data from their repo. It’s also required to add to the initial state which variable is the target.

%%capture
from ostatslib.states import State

results = []

for name in sampled_dataset_names:
    data = pmlb.fetch_data(name, local_cache_dir='.pmlb_cache/')
    initial_state = State()
    initial_state.set('response_variable_label', 'target')
    analysis = agent.analyze(data, initial_state)
    results.append({"name": name, "analysis": analysis})

Results#

from IPython.display import display

for result in results:
    display(Markdown(f"### {result['name']}"))
    print(result['analysis'].summary())

mnist

Analysis executed at 2024-12-23 23:53:16.283468
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.00014285714285714287
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  -----------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.000142857
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

clean2

Analysis executed at 2024-12-23 23:53:24.989608
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.0003031221582297666
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  -----------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.000303122
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

agaricus_lepiota

Analysis executed at 2024-12-23 23:53:26.109863
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.00024554941682013506
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  -----------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.000245549
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

satimage

Analysis executed at 2024-12-23 23:53:28.633382
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.0009324009324009324
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  -----------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.000932401
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

parity5

Analysis executed at 2024-12-23 23:53:29.586091
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.0625
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.0625
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

corral

Analysis executed at 2024-12-23 23:53:30.617077
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.0125
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.0125
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

analcatdata_creditscore

Analysis executed at 2024-12-23 23:53:33.525891
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.02
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ----------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.02
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

lupus

Analysis executed at 2024-12-23 23:53:36.827557
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.022988505747126436
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ---------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.0229885
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

schizo

Analysis executed at 2024-12-23 23:53:38.080557
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.008823529411764706
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ----------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.00882353
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

australian

Analysis executed at 2024-12-23 23:53:42.062402
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.002898550724637681
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ----------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.00289855
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

analcatdata_boxing2

Analysis executed at 2024-12-23 23:53:44.237954
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.015151515151515152
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ---------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.0151515
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

balance_scale

Analysis executed at 2024-12-23 23:53:45.504031
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.0048
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.0048
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

biomed

Analysis executed at 2024-12-23 23:53:49.798742
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.009569377990430622
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  ----------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.00956938
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1

Analysis executed at 2024-12-23 23:53:51.130657
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.00125
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  -------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.00125
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

mux6

Analysis executed at 2024-12-23 23:53:51.651679
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.015625
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  --------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.015625
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1

hypothyroid

Analysis executed at 2024-12-23 23:53:54.824063
Final status is Not Complete
Initial State known features:
response_variable_label           target
time_convertible_variable
response_unique_values_ratio      0.0006323110970597534
response_inferred_dtype           integer
is_response_discrete              1
is_response_positive_values_only  1
Steps:
  Order  Step                                Reward  State Change
-------  --------------------------------  --------  -----------------------------------------
      1  Is Response Positive Values Only       0.1
      2  Time Convertible Variable Search       0.1  time_convertible_variable
      3  Infer Response DType                   0.1  response_inferred_dtype  integer
      4  Is Response Discrete                   0.1  is_response_discrete  1
      5  Response Unique Values Ratio           0.1  response_unique_values_ratio  0.000632311
      6  Response Unique Values Ratio          -1
      7  Response Unique Values Ratio          -1
      8  Response Unique Values Ratio          -1
      9  Response Unique Values Ratio          -1
     10  Response Unique Values Ratio          -1
     11  Response Unique Values Ratio          -1
     12  Response Unique Values Ratio          -1
     13  Response Unique Values Ratio          -1
     14  Response Unique Values Ratio          -1
     15  Response Unique Values Ratio          -1