data_manipulate¶
This file implements some useful functions used to manipulate the data features or labels.
-
s3l.datasets.data_manipulate.
inductive_split
(X=None, y=None, instance_indexes=None, test_ratio=0.3, initial_label_rate=0.05, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]¶ Provided one of X, y or instance_indexes to execute the inductive split.
Return the indexs for train/test data, and labled/unlabeled data in train ones for each split. If X, y are both provided, the lengths of them should be the same.
Parameters: - X (array-like, optional) – Data matrix with [n_instances, n_features]
- y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
- instance_indexes (list, optional (default=None)) – List contains instances’ names, used for image datasets, or provide index list instead of data matrix. Must provide one of [instance_names, X, y]
- test_ratio (float, optional (default=0.3)) – Ratio of test set
- initial_label_rate (float, optional (default=0.05)) – Ratio of initial label set e.g. Initial_labelset*(1-test_ratio)*n_instances
- split_count (int, optional (default=10)) – Random split data _split_count times
- all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
- save_file (boolean, optional (default=False)) –
- saving_path (str, optional (default='.')) – Giving None to disable saving.
- name (str, optional (default=None)) – Dataset name.
Returns: - train_idx (list) – index of training set, shape like [n_split_count, n_training_indexes]
- test_idx (list) – index of testing set, shape like [n_split_count, n_testing_indexes]
- label_idx (list) – index of labeling set, shape like [n_split_count, n_labeling_indexes]
- unlabel_idx (list) – index of unlabeling set, shape like [n_split_count, n_unlabeling_indexes]
-
s3l.datasets.data_manipulate.
ratio_split
(X=None, y=None, instance_indexes=None, unlabel_ratio=0.3, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]¶ Split the data into labeled and unlabeled set with given ratio.
Provide one of X, y or instance_indexes to execute the transductive split. If X, y are both provided, the lengths of them should be the same. If X, instance_indexes are both provided, the instance_indexes is used for split.
Parameters: - X (array-like, optional) – Data matrix with [n_instances, n_features]
- y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
- instance_indexes (list, optional (default=None)) – List contains instances’ names, used for image datasets, or provide index list instead of data matrix. Must provide one of [instance_names, X, y]
- unlabel_ratio (float, optional (default=0.3)) – Ratio of test set
- split_count (int, optional (default=10)) – Random split data _split_count times
- all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
- save_file (boolean, optional (default=False)) –
- saving_path (str, optional (default='.')) – Giving None to disable saving.
- name (str, optional (default=None)) – Dataset name.
Returns: - train_idxs (list) – index of training set, shape like [n_split_count, n_training_indexes]
- test_idxs (list) – index of testing set, shape like [n_split_count, n_testing_indexes]
-
s3l.datasets.data_manipulate.
cv_split
(X=None, y=None, instance_indexes=None, k=3, split_count=10, all_class=True, save_file=False, saving_path=None, name=None)[source]¶ Split the data into labeled and unlabeled set with given ratio.
Provide one of X, y or instance_indexes to execute the transductive split. Use instance_indexes firstly.
Note
- For multi-label task, set all_class = False.
- For classification, the label must not be float type
Parameters: - X (array-like, optional) – Data matrix with [n_instances, n_features]
- y (array-like, optional) – labels of given data [n_instances, n_labels] or [n_instances]
- instance_indexes (list, optional (default=None)) – List provides index list instead of X. Must provide one of [instance_names, X, y]
- k (int, optional (default=3)) – Parameter for k-fold split. k should be small enough when we have few label data.
- split_count (int, optional (default=10)) – Random split data _split_count times
- all_class (bool, optional (default=True)) – Whether each split will contain at least one instance for each class. If False, a totally random split will be performed.
- save_file (boolean, optional (default=False)) – A flag indicates whether to save the splits.
- saving_path (str, optional (default='.')) – Giving None to disable saving.
- name (str, optional (default=None)) – Dataset name.
Returns: - train_idx (list) – index of training set, shape like [n_split_count, n_training_indexes]
- test_idx (list) – index of testing set, shape like [n_split_count, n_testing_indexes]
-
s3l.datasets.data_manipulate.
split_load
(path, name)[source]¶ Load split from path.
Parameters: Returns: - train_idx (list) – index of training set, shape like [n_split_count, n_training_samples]
- test_idx (list) – index of testing set, shape like [n_split_count, n_testing_samples]
- label_idx (list) – index of labeling set, shape like [n_split_count, n_labeling_samples]
- unlabel_idx (list) – index of unlabeling set, shape like [n_split_count, n_unlabeling_samples]
-
s3l.datasets.data_manipulate.
check_y
(y, binary=True)[source]¶ Transform label vector to proba matrix. Use for binary and multi-class tasks.
Parameters: - y (np.ndarray) – Original label vector.
- binary (boolean (default=True)) – Indicate different tasks.
Returns: - labels (1-D np.ndarray) – A vector store the original labels. The labels are sorted as in y_t.
- y_t (np.ndarray) – When binary == True, y_t is 1-D vector with {1,-1}. When binary == False, y_t is a matrix in the shape n_samples, n_classes.
-
s3l.datasets.data_manipulate.
check_inputs
(X, y, binary=True)[source]¶ Transform the input label vector to proba matrix; Encode the str feature.
Parameters: - X (np.ndarray) – Features
- y (np.ndarray) – Labels
-
s3l.datasets.data_manipulate.
modify_y
(y, ind, n_labels, binary=True)[source]¶ This function is the reverse function of check_y, which transfer the prediction from inner results to the origin labels.
Parameters: - y (np.ndarray) – Prediction
- ind (np.ndarray) – Index
- n_labels (1-D np.ndarray) – A vector store the original labels. The labels are sorted as in y.