data_manipulate

This file implements some useful functions used to manipulate the data features or labels.

s3l.datasets.data_manipulate.inductive_split(X=None, y=None, instance_indexes=None, test_ratio=0.3, initial_label_rate=0.05, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]

Provided one of X, y or instance_indexes to execute the inductive split.

Return the indexs for train/test data, and labled/unlabeled data in train ones for each split. If X, y are both provided, the lengths of them should be the same.

Parameters:
  • X (array-like, optional) – Data matrix with [n_instances, n_features]
  • y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
  • instance_indexes (list, optional (default=None)) – List contains instances’ names, used for image datasets, or provide index list instead of data matrix. Must provide one of [instance_names, X, y]
  • test_ratio (float, optional (default=0.3)) – Ratio of test set
  • initial_label_rate (float, optional (default=0.05)) – Ratio of initial label set e.g. Initial_labelset*(1-test_ratio)*n_instances
  • split_count (int, optional (default=10)) – Random split data _split_count times
  • all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
  • save_file (boolean, optional (default=False)) –
  • saving_path (str, optional (default='.')) – Giving None to disable saving.
  • name (str, optional (default=None)) – Dataset name.
Returns:

  • train_idx (list) – index of training set, shape like [n_split_count, n_training_indexes]
  • test_idx (list) – index of testing set, shape like [n_split_count, n_testing_indexes]
  • label_idx (list) – index of labeling set, shape like [n_split_count, n_labeling_indexes]
  • unlabel_idx (list) – index of unlabeling set, shape like [n_split_count, n_unlabeling_indexes]

s3l.datasets.data_manipulate.ratio_split(X=None, y=None, instance_indexes=None, unlabel_ratio=0.3, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]

Split the data into labeled and unlabeled set with given ratio.

Provide one of X, y or instance_indexes to execute the transductive split. If X, y are both provided, the lengths of them should be the same. If X, instance_indexes are both provided, the instance_indexes is used for split.

Parameters:
  • X (array-like, optional) – Data matrix with [n_instances, n_features]
  • y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
  • instance_indexes (list, optional (default=None)) – List contains instances’ names, used for image datasets, or provide index list instead of data matrix. Must provide one of [instance_names, X, y]
  • unlabel_ratio (float, optional (default=0.3)) – Ratio of test set
  • split_count (int, optional (default=10)) – Random split data _split_count times
  • all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
  • save_file (boolean, optional (default=False)) –
  • saving_path (str, optional (default='.')) – Giving None to disable saving.
  • name (str, optional (default=None)) – Dataset name.
Returns:

  • train_idxs (list) – index of training set, shape like [n_split_count, n_training_indexes]
  • test_idxs (list) – index of testing set, shape like [n_split_count, n_testing_indexes]

s3l.datasets.data_manipulate.cv_split(X=None, y=None, instance_indexes=None, k=3, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]

Split the data into labeled and unlabeled set with given ratio.

Provide one of X, y or instance_indexes to execute the transductive split. Use instance_indexes firstly.

Note

  1. For multi-label task, set all_class = False.
  2. For classification, the label must not be float type
Parameters:
  • X (array-like, optional) – Data matrix with [n_instances, n_features]
  • y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
  • instance_indexes (list, optional (default=None)) – List provides index list instead of X. Must provide one of [instance_names, X, y]
  • k (int, optional (default=3)) – Parameter for k-fold split. k should be small enough when we have few label data.
  • split_count (int, optional (default=10)) – Random split data _split_count times
  • all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
  • save_file (boolean, optional (default=False)) – A flag indicates whether to save the splits.
  • saving_path (str, optional (default='.')) – Giving None to disable saving.
  • name (str, optional (default=None)) – Dataset name.
Returns:

  • train_idx (list) – index of training set, shape like [n_split_count, n_training_indexes]
  • test_idx (list) – index of testing set, shape like [n_split_count, n_testing_indexes]

s3l.datasets.data_manipulate.split_load(path, name)[source]

Load split from path.

Parameters:
  • path (str) – Absolute path to a dir which contains train_idx.txt, test_idx.txt, label_idx.txt, unlabel_idx.txt.
  • name (str) – The name of dataset. The file is stored as ‘XXX_train/test_idx.txt/npy’
Returns:

  • train_idx (list) – index of training set, shape like [n_split_count, n_training_samples]
  • test_idx (list) – index of testing set, shape like [n_split_count, n_testing_samples]
  • label_idx (list) – index of labeling set, shape like [n_split_count, n_labeling_samples]
  • unlabel_idx (list) – index of unlabeling set, shape like [n_split_count, n_unlabeling_samples]

s3l.datasets.data_manipulate.check_y(y, binary=True)[source]

Transform label vector to proba matrix. Use for binary and multi-class tasks.

Parameters:
  • y (np.ndarray) – Original label vector.
  • binary (boolean (default=True)) – Indicate different tasks.
Returns:

  • labels (1-D np.ndarray) – A vector store the original labels. The labels are sorted as in y_t.
  • y_t (np.ndarray) – When binary == True, y_t is 1-D vector with {1,-1}. When binary == False, y_t is a matrix in the shape n_samples, n_classes.

s3l.datasets.data_manipulate.check_inputs(X, y, binary=True)[source]

Transform the input label vector to proba matrix; Encode the str feature.

Parameters:
  • X (np.ndarray) – Features
  • y (np.ndarray) – Labels
s3l.datasets.data_manipulate.modify_y(y, ind, n_labels, binary=True)[source]

This function is the reverse function of check_y, which transfer the prediction from inner results to the origin labels.

Parameters:
  • y (np.ndarray) – Prediction
  • ind (np.ndarray) – Index
  • n_labels (1-D np.ndarray) – A vector store the original labels. The labels are sorted as in y.