{"id":89673,"date":"2021-01-26T14:31:28","date_gmt":"2021-01-26T14:31:28","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=89673"},"modified":"2021-01-26T14:31:28","modified_gmt":"2021-01-26T14:31:28","slug":"handling-textual-unstructured-data-using-datastage","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/handling-textual-unstructured-data-using-datastage\/","title":{"rendered":"Handling Textual Unstructured Data using DataStage"},"content":{"rendered":"<h2>What is Unstructured Data?<\/h2>\n<p>Structured data is represented by tables, rows, columns, attributes mostly like relational database. Many of us might have spent the better part of our careers working on structured data and it is heavily used for analytical purposes.<\/p>\n<p>Unstructured data is often best described in contrast to structured data. It is an information that does not have a predefined data model and does not fit well into relational tables. It is classified into two types Non &#8211; Textual unstructured data and Textual unstructured data.<\/p>\n<p>We going to discuss how to handle Textual unstructured data using DataStage.<\/p>\n<h2>What is Textual Unstructured Data?<\/h2>\n<p>Textual Unstructured Data is a textual data found in reports, emails, financial records, medical records, spreadsheets etc.<\/p>\n<h2>Why can\u2019t we use Sequential file stage to handle Unstructured Data?<\/h2>\n<p>Flat flies are commonly used as source\/target file and using sequential file stage, we process flat flies. We can use sequential file stage to read the data in excel file, but only if the excel contains data in a single sheet.<\/p>\n<p>We can use Unstructured Data stage to pull data from excel file, when the data is spread across multiple sheets or even when excel contains data in a single sheet<\/p>\n<h2>Implementation in ETL (DataStage)<\/h2>\n<p>Unstructured Data stage can be used only in parallel job not in server or sequence job. This stage can be found under \u201cFile\u201d section in Palette in the Designer. Data can be written or read from this stage<\/p>\n<h3>Step 1:<\/h3>\n<ul>\n<li>Select New Parallel Job from Toolbar<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"841\" height=\"364\" class=\"wp-image-89674\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-81.png\" \/><\/p>\n<h3>Step 2:<\/h3>\n<ul>\n<li>Select and drag an Unstructured Data stage from the File section of Palette and a target stage to write data and link source and target stages with transformer stage.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"844\" height=\"564\" class=\"wp-image-89675\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-82.png\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"237\" height=\"202\" class=\"wp-image-89676\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-83.png\" \/><\/p>\n<p><strong>Source File:<\/strong> Excel spreadsheet with loan information, \u201cSBA_PPP_MasterList.xlsx\u201d<\/p>\n<p><strong>Sheet 1:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1218\" height=\"289\" class=\"wp-image-89677\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-84.png\" \/><\/p>\n<h3>Step 3:<\/h3>\n<ul>\n<li>Define properties to extract data. Below is the snapshot of properties tab.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"846\" height=\"757\" class=\"wp-image-89678\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-85.png\" \/><\/p>\n<p>By clicking on the Configure button highlighted in the above snapshot, we will get the properties window of configuration.<\/p>\n<p>Below is the snapshot for configure properties<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1246\" height=\"937\" class=\"wp-image-89679\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-86.png\" \/><\/p>\n<p>Give file name of the source file along with the path in the File name field.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"302\" height=\"312\" class=\"wp-image-89680\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-87.png\" \/><\/p>\n<p>In the Range expression field from the Read options, give the range of the data in the sheet<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"280\" height=\"309\" class=\"wp-image-89681\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-88.png\" \/><\/p>\n<p>Before selecting Range expression, you need to select an option in Range option. The Range option property of Unstructured Data stage allows you to specify the data range either by selecting the \u201cSpecify the start row\u201d option or the \u201cSpecify the entire data range\u201d option.<\/p>\n<ul>\n<li>By selecting the Specify the start row, you need to identify the start row and Data stage identifies the end row of the range.<\/li>\n<li>By selecting the Specify the entire data range, you must specify the start and end rows of the range to be extracted.<\/li>\n<\/ul>\n<p>Using Sheet names to skip, we can skip the sheets which are not required.<\/p>\n<p>Once we select the Range option, Range expression, Sheet names to skip and Column header, click on the load button for importing the metadata.<\/p>\n<p>Click on Import to load metadata into Map tab<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"646\" height=\"379\" class=\"wp-image-89682\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-89.png\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"636\" height=\"384\" class=\"wp-image-89683\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-90.png\" \/><\/p>\n<h3>Step 4:<\/h3>\n<ul>\n<li>I have used the Transformer stage connected to the Unstructured Data stage to apply conditions\/transformations and to rename\/drop columns if needed and mapped to target files.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1047\" height=\"567\" class=\"wp-image-89684\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-91.png\" \/><\/p>\n<p>I have filtered the records based on \u201cDisbursement_Type\u201d, \u201cDocuSign_Status\u201d and \u201cWithdraws\u201d. You can find constraints in the above snapshot.<\/p>\n<h3>Step 5:<\/h3>\n<ul>\n<li>There are five target systems for this job design. Out of these five, two are Sequential stages and three are Unstructured Data stages.<\/li>\n<\/ul>\n<p>Job Design:<\/p>\n<p>Here DataStage job has been designed with seven stages.<\/p>\n<ul>\n<li>Unstructured Data Stage (Source)<\/li>\n<li>Transformer<\/li>\n<li>Sequential File Stage (Target)<\/li>\n<li>Unstructured Data Stage (Target)<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"966\" height=\"786\" class=\"wp-image-89685\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-92.png\" \/><\/p>\n<ul>\n<li>Load Data into target Sequential File stage<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"919\" height=\"706\" class=\"wp-image-89686\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-93.png\" \/><\/p>\n<ul>\n<li>Load Data into Unstructured Data stage<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"853\" height=\"802\" class=\"wp-image-89687\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-94.png\" \/><\/p>\n<p>Select an Option from \u201cModify an existing file\u201d or \u201cCreate a file\u201d from Write mode in Properties tab. Then click on \u201cConfigure\u201d to select a file name.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"292\" class=\"wp-image-89688\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-95.png\" \/><\/p>\n<p>Select either \u201cSpecific file\u201d or \u201cGenerate multiple files\u201d from Write method and \u201cOverwrite\u201d or \u201cCreate (Error if exists) from File update mode.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"304\" height=\"259\" class=\"wp-image-89689\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-96.png\" \/> <img loading=\"lazy\" decoding=\"async\" width=\"297\" height=\"274\" class=\"wp-image-89690\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-97.png\" \/><\/p>\n<p>You can set \u201cYes\u201d or \u201cNo\u201d to Adjust column width from Properties tab in Configuration Window and Set the same property for all links. You can also, rename the Sheet name in Sheet order tab<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"901\" height=\"867\" class=\"wp-image-89691\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/01\/word-image-98.png\" \/><\/p>\n<p>An Unstructured Data stage as a target system, allows multiple input links. This property allows us to write data into multiple sheets in one or more MS Excel spreadsheets.<\/p>\n<p>This process can be widely used to process enterprise data especially in financial domain like Banks etc., to join transactional or accounts data from multiple sheets or to load data into multiple sheets for reporting purposes<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is Unstructured Data? Structured data is represented by tables, rows, columns, attributes mostly like relational database. Many of us might have spent the better part of our careers working on structured data and it is heavily used for analytical purposes. Unstructured data is often best described in contrast to structured data. It is an&#8230;&hellip;<\/p>\n","protected":false},"author":335417,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"coauthors":[136303],"class_list":["post-89673","post","type-post","status-publish","format-standard","hentry","category-blogs"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/89673","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/335417"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=89673"}],"version-history":[{"count":3,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/89673\/revisions"}],"predecessor-version":[{"id":89693,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/89673\/revisions\/89693"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=89673"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=89673"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=89673"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=89673"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}