The project focuses on developing an application that extracts structured data, including text, images, and tables, from website URLs and PDF files. It employs various parsing techniques to process the data, which is securely stored and displayed in a standardized format for consistency. The implementation integrates open-source and enterprise tools, along with document-linguistic approaches, to assess compatibility and performance. This prototype serves as a scalable framework for testing and validating data extraction capabilities across diverse input formats.
GitHub Repository: ‣
Application : https://webpdfdataextractiontool.streamlit.app/
Hosted APIs (Deployed on Google Cloud Run): https://fastapi-service-rhtrkfwlfq-uc.a.run.app/
