data_manipulate¶

This file implements some useful functions used to manipulate the data features or labels.

s3l.datasets.data_manipulate.inductive_split(X=None, y=None, instance_indexes=None, test_ratio=0.3, initial_label_rate=0.05, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]¶

Provided one of X, y or instance_indexes to execute the inductive split.

Return the indexs for train/test data, and labled/unlabeled data in train ones for each split. If X, y are both provided, the lengths of them should be the same.

Parameters:

X (array-like, optional) – Data matrix with [n_instances, n_features]
y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
instance_indexes (list, optional (default=None)) – List contains instances’ names, used for image datasets, or provide index list instead of data matrix. Must provide one of [instance_names, X, y]
test_ratio (float, optional (default=0.3)) – Ratio of test set
initial_label_rate (float, optional (default=0.05)) – Ratio of initial label set e.g. Initial_labelset*(1-test_ratio)*n_instances
split_count (int, optional (default=10)) – Random split data _split_count times
all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
save_file (boolean, optional (default=False)) –
saving_path (str, optional (default='.')) – Giving None to disable saving.
name (str, optional (default=None)) – Dataset name.

Returns:

train_idx (list) – index of training set, shape like [n_split_count, n_training_indexes]
test_idx (list) – index of testing set, shape like [n_split_count, n_testing_indexes]
label_idx (list) – index of labeling set, shape like [n_split_count, n_labeling_indexes]
unlabel_idx (list) – index of unlabeling set, shape like [n_split_count, n_unlabeling_indexes]

s3l.datasets.data_manipulate.ratio_split(X=None, y=None, instance_indexes=None, unlabel_ratio=0.3, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]¶

Split the data into labeled and unlabeled set with given ratio.

Provide one of X, y or instance_indexes to execute the transductive split. If X, y are both provided, the lengths of them should be the same. If X, instance_indexes are both provided, the instance_indexes is used for split.

Parameters:

X (array-like, optional) – Data matrix with [n_instances, n_features]
y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
instance_indexes (list, optional (default=None)) – List contains instances’ names, used for image datasets, or provide index list instead of data matrix. Must provide one of [instance_names, X, y]
unlabel_ratio (float, optional (default=0.3)) – Ratio of test set
split_count (int, optional (default=10)) – Random split data _split_count times
all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
save_file (boolean, optional (default=False)) –
saving_path (str, optional (default='.')) – Giving None to disable saving.
name (str, optional (default=None)) – Dataset name.

Returns:

train_idxs (list) – index of training set, shape like [n_split_count, n_training_indexes]
test_idxs (list) – index of testing set, shape like [n_split_count, n_testing_indexes]

s3l.datasets.data_manipulate.cv_split(X=None, y=None, instance_indexes=None, k=3, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]¶

Split the data into labeled and unlabeled set with given ratio.

Provide one of X, y or instance_indexes to execute the transductive split. Use instance_indexes firstly.

Note

For multi-label task, set all_class = False.
For classification, the label must not be float type

Parameters:

X (array-like, optional) – Data matrix with [n_instances, n_features]
y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
instance_indexes (list, optional (default=None)) – List provides index list instead of X. Must provide one of [instance_names, X, y]
k (int, optional (default=3)) – Parameter for k-fold split. k should be small enough when we have few label data.
split_count (int, optional (default=10)) – Random split data _split_count times
all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
save_file (boolean, optional (default=False)) – A flag indicates whether to save the splits.
saving_path (str, optional (default='.')) – Giving None to disable saving.
name (str, optional (default=None)) – Dataset name.

Returns:

train_idx (list) – index of training set, shape like [n_split_count, n_training_indexes]
test_idx (list) – index of testing set, shape like [n_split_count, n_testing_indexes]

s3l.datasets.data_manipulate.split_load(path, name)[source]¶

Load split from path.

Parameters:

path (str) – Absolute path to a dir which contains train_idx.txt, test_idx.txt, label_idx.txt, unlabel_idx.txt.
name (str) – The name of dataset. The file is stored as ‘XXX_train/test_idx.txt/npy’

Returns:

train_idx (list) – index of training set, shape like [n_split_count, n_training_samples]
test_idx (list) – index of testing set, shape like [n_split_count, n_testing_samples]
label_idx (list) – index of labeling set, shape like [n_split_count, n_labeling_samples]
unlabel_idx (list) – index of unlabeling set, shape like [n_split_count, n_unlabeling_samples]

s3l.datasets.data_manipulate.check_y(y, binary=True)[source]¶

Transform label vector to proba matrix. Use for binary and multi-class tasks.

Parameters:

y (np.ndarray) – Original label vector.
binary (boolean (default=True)) – Indicate different tasks.

Returns:

labels (1-D np.ndarray) – A vector store the original labels. The labels are sorted as in y_t.
y_t (np.ndarray) – When binary == True, y_t is 1-D vector with {1,-1}. When binary == False, y_t is a matrix in the shape n_samples, n_classes.

s3l.datasets.data_manipulate.check_inputs(X, y, binary=True)[source]¶

Transform the input label vector to proba matrix; Encode the str feature.

Parameters:	X (np.ndarray) – Features y (np.ndarray) – Labels

s3l.datasets.data_manipulate.modify_y(y, ind, n_labels, binary=True)[source]¶

This function is the reverse function of check_y, which transfer the prediction from inner results to the origin labels.

Parameters:	y (np.ndarray) – Prediction ind (np.ndarray) – Index n_labels (1-D np.ndarray) – A vector store the original labels. The labels are sorted as in y.