2727
2828
2929class TextDataset (datasets ._Dataset ):
30- """Managed text dataset resource for Vertex AI."""
30+ """A managed text dataset resource for Vertex AI.
31+
32+ Use this class to work with a managed text dataset. To create a managed
33+ text dataset, you need a datasource file in CSV format and a schema file in
34+ YAML format. A schema is optional for a custom model. The CSV file and the
35+ schema are accessed in Cloud Storage buckets.
36+
37+ Use text data for the following objectives:
38+
39+ * Classification. For more information, see
40+ [Prepare text training data for classification](https://cloud.google.com/vertex-ai/docs/text-data/classification/prepare-data).
41+ * Entity extraction. For more information, see
42+ [Prepare text training data for entity extraction](https://cloud.google.com/vertex-ai/docs/text-data/entity-extraction/prepare-data).
43+ * Sentiment analysis. For more information, see
44+ [Prepare text training data for sentiment analysis](Prepare text training data for sentiment analysis).
45+
46+ The following code shows you how to create and import a text dataset with
47+ a CSV datasource file and a YAML schema file. The schema file you use
48+ depends on whether your text dataset is used for single-label
49+ classification, multi-label classification, or object detection.
50+
51+ ```py
52+ my_dataset = aiplatform.TextDataset.create(
53+ display_name="my-text-dataset",
54+ gcs_source=['gs://path/to/my/text-dataset.csv'],
55+ import_schema_uri=['gs://path/to/my/schema.yaml'],
56+ )
57+ ```
58+ """
3159
3260 _supported_metadata_schema_uris : Optional [Tuple [str ]] = (
3361 schema .dataset .metadata .text ,
@@ -49,91 +77,97 @@ def create(
4977 sync : bool = True ,
5078 create_request_timeout : Optional [float ] = None ,
5179 ) -> "TextDataset" :
52- """Creates a new text dataset and optionally imports data into dataset
53- when source and import_schema_uri are passed.
80+ """Creates a new text dataset.
81+
82+ Optionally imports data into this dataset when a source and
83+ `import_schema_uri` are passed in. The following is an example of how
84+ this method is used:
5485
55- Example Usage:
56- ds = aiplatform.TextDataset.create(
57- display_name='my-dataset',
58- gcs_source='gs://my-bucket/dataset.csv',
59- import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
60- )
86+ ```py
87+ ds = aiplatform.TextDataset.create(
88+ display_name='my-dataset',
89+ gcs_source='gs://my-bucket/dataset.csv',
90+ import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
91+ )
92+ ```
6193
6294 Args:
6395 display_name (str):
64- Optional. The user-defined name of the Dataset.
65- The name can be up to 128 characters long and can be consist
66- of any UTF-8 characters.
96+ Optional. The user-defined name of the dataset. The name must
97+ contain 128 or fewer UTF-8 characters.
6798 gcs_source (Union[str, Sequence[str]]):
68- Google Cloud Storage URI(-s) to the
69- input file(s).
70-
71- Examples:
72- str: "gs://bucket/file.csv"
73- Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
99+ Optional. The URI to one or more Google Cloud Storage buckets
100+ that contain your datasets. For example, `str:
101+ "gs://bucket/file.csv"` or `Sequence[str]:
102+ ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]`.
74103 import_schema_uri (str):
75- Points to a YAML file stored on Google Cloud
76- Storage describing the import format. Validation will be
77- done against the schema. The schema is defined as an
78- `OpenAPI 3.0.2 Schema
79- Object <https://tinyurl.com/y538mdwt>`__.
104+ Optional. A URI for a YAML file stored in Cloud Storage that
105+ describes the import schema used to validate the
106+ dataset. The schema is an
107+ [OpenAPI 3.0.2 Schema](https://tinyurl.com/y538mdwt) object.
80108 data_item_labels (Dict):
81- Labels that will be applied to newly imported DataItems. If
82- an identical DataItem as one being imported already exists
83- in the Dataset, then these labels will be appended to these
84- of the already existing one, and if labels with identical
85- key is imported before, the old label value will be
86- overwritten. If two DataItems are identical in the same
87- import data operation, the labels will be combined and if
88- key collision happens in this case, one of the values will
89- be picked randomly. Two DataItems are considered identical
90- if their content bytes are identical (e.g. image bytes or
91- pdf bytes). These labels will be overridden by Annotation
92- labels specified inside index file referenced by
93- ``import_schema_uri``,
94- e.g. jsonl file.
109+ Optional. A dictionary of label information. Each dictionary
110+ item contains a label and a label key. Each item in the dataset
111+ includes one dictionary of label information. If a data item is
112+ added or merged into a dataset, and that data item contains an
113+ image that's identical to an image that’s already in the
114+ dataset, then the data items are merged. If two identical labels
115+ are detected during the merge, each with a different label key,
116+ then one of the label and label key dictionary items is randomly
117+ chosen to be into the merged data item. Data items are
118+ compared using their binary data (bytes), not on their content.
119+ If annotation labels are referenced in a schema specified by the
120+ `import_schema_url` parameter, then the labels in the
121+ `data_item_labels` dictionary are overriden by the annotations.
95122 project (str):
96- Project to upload this dataset to. Overrides project set in
97- aiplatform.init.
123+ Optional. The name of the Google Cloud project to which this
124+ `TextDataset` is uploaded. This overrides the project that
125+ was set by `aiplatform.init`.
98126 location (str):
99- Location to upload this dataset to. Overrides location set in
100- aiplatform.init.
127+ Optional. The Google Cloud region where this dataset is uploaded. This
128+ region overrides the region that was set by ` aiplatform.init` .
101129 credentials (auth_credentials.Credentials):
102- Custom credentials to use to upload this dataset. Overrides
103- credentials set in aiplatform.init.
130+ Optional. The credentials that are used to upload the `TextDataset`.
131+ These credentials override the credentials set by
132+ `aiplatform.init`.
104133 request_metadata (Sequence[Tuple[str, str]]):
105- Strings which should be sent along with the request as metadata .
134+ Optional. Strings that contain metadata that's sent with the request.
106135 labels (Dict[str, str]):
107- Optional. Labels with user-defined metadata to organize your Tensorboards.
108- Label keys and values can be no longer than 64 characters
109- (Unicode codepoints), can only contain lowercase letters, numeric
110- characters, underscores and dashes. International characters are allowed.
111- No more than 64 user labels can be associated with one Tensorboard
112- (System labels are excluded).
113- See https://goo.gl/xmQnxf for more information and examples of labels.
114- System reserved label keys are prefixed with "aiplatform.googleapis.com/"
115- and are immutable.
136+ Optional. Labels with user-defined metadata to organize your
137+ Vertex AI Tensorboards. The maximum length of a key and of a
138+ value is 64 unicode characters. Labels and keys can contain only
139+ lowercase letters, numeric characters, underscores, and dashes.
140+ International characters are allowed. No more than 64 user
141+ labels can be associated with one Tensorboard (system labels are
142+ excluded). For more information and examples of using labels, see
143+ [Using labels to organize Google Cloud Platform resources](https://goo.gl/xmQnxf).
144+ System reserved label keys are prefixed with
145+ `aiplatform.googleapis.com/` and are immutable.
116146 encryption_spec_key_name (Optional[str]):
117147 Optional. The Cloud KMS resource identifier of the customer
118- managed encryption key used to protect the dataset. Has the
119- form:
120- `` projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key` `.
148+ managed encryption key that's used to protect the dataset. The
149+ format of the key is
150+ `projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key`.
121151 The key needs to be in the same region as where the compute
122152 resource is created.
123153
124- If set, this Dataset and all sub-resources of this Dataset will be secured by this key.
154+ If `encryption_spec_key_name` is set, this `TextDataset` and
155+ all of its sub-resources are secured by this key.
125156
126- Overrides encryption_spec_key_name set in aiplatform.init.
127- create_request_timeout (float):
128- Optional. The timeout for the create request in seconds.
157+ This `encryption_spec_key_name` overrides the
158+ `encryption_spec_key_name` set by `aiplatform.init`.
129159 sync (bool):
130- Whether to execute this method synchronously. If False, this method
131- will be executed in concurrent Future and any downstream object will
132- be immediately returned and synced when the Future has completed.
160+ If `true`, the `create` method creates a text dataset
161+ synchronously. If `false`, the `create` method creates a text
162+ dataset asynchronously.
163+ create_request_timeout (float):
164+ Optional. The number of seconds for the timeout of the create
165+ request.
133166
134167 Returns:
135168 text_dataset (TextDataset):
136- Instantiated representation of the managed text dataset resource.
169+ An instantiated representation of the managed `TextDataset`
170+ resource.
137171 """
138172 if not display_name :
139173 display_name = cls ._generate_display_name ()
0 commit comments