IBM Watson™ Ideas

Welcome to the IBM Watson™ Ideas Portal


We welcome and appreciate your feedback on IBM Watson™ Products to help make them even better than they are today!


If you are looking for troubleshooting help or wondering how to use our products and services, please check the IBM Watson™ documentation. Please do not use the Ideas Portal for reporting bugs - we ask that you report bugs or issues with the product by contacting IBM support.


Before you submit an idea, please perform a search first as a similar idea may have already been reported in the portal.


If a related idea is not yet listed, please create a new idea and include with it a description which includes expected behavior as well as why having this feature would improve the service and how it would address your use case.

Ability to split paragraphs inside a document

Several documents rely on a structure that leverages a single section for a bunch of unrelated paragraphs, we'd like to do document splitting based on paragraph marks at ingestion time.

 

This could be accomplished by allowing other tags (not only H1, H2, Hx...) to split a document.

  • Renato dos Santos Leal
  • Apr 13 2018
  • Attach files
  • Percy Shi commented
    April 20, 2018 23:34

    Using predefined tags as an option to define the desired boundary of a paragraph(passage) will be very helpful to get a self-explainable answer from WDS.

     

    The current passage level function seems much to be based on standard html tag(<p>), and guessing the paragraph/passage boundary by the trailing \r\n and the leading space of the following text line. This approach is not able to reserve the context from the "malformatted" documents(most technology manuals, troubleshooting guidelines, administration guidelines etc.) where natural language and computer language are intermingled, hence the common format of boundary of a paragraph is not achievable.

  • Admin
    Phil Anderson commented
    May 9, 2018 05:20

    This feature already exists: https://console.bluemix.net/docs/services/discovery/building.html#performing-segmentation

  • Renato dos Santos Leal commented
    May 9, 2018 14:13

    Hi Phil, it does exists but only for H1 to H6, it would be helpful to do it for some other HTML tags.

  • Deepak Sekar commented
    May 15, 2018 06:14

    We need paragraph splitting based on certain break rules, the kind of segmentation available in Watson Content Analytics Custom Annotation pipeline / pdfs to paragraphs converter in WEX. The limit being 250 is also a problem for clients having huge documents. In our usecase we have the largest document with 15,000 paragraphs.