How to extract text from PDF file using iTextSharp with C#

In this tutorial, I am going to explain you how to extract text from PDF file using iTextSharp with C# in ASP.NET. Below is step by step tutorial.

Creating ASP.NET Empty Application

Create an ASP.NET Empty WebForm project as shown below.
Go to FileNewProject. A new window will be open as shown below.
Now go to WebVisual Studio 2012 → select .NET Framework 4.5 → select ASP.NET Empty Web Application and give project name and click on OK.

Creating asp.net 4.5 empty project

Now, an asp.net empty project will be created. Add a new webform to application.

Installing iTextSharp

Now the next step is to add iTextSharp reference to your application. We can add reference by two ways.
First: Download from Internet
Click on the below link to download the dll.
https://github.com/itext/itextsharp Once file is downloaded, extract it, now you will find 6 more .rar file. Again extract itextsharp-dll-core.rar file, after that add reference of itextsharp.dll to your project.
or Second: Nuget Package Manager
Go to TOOLS → Library Package Manager → Manage NuGet Packages for Solution.. and a new window will open. Type and search for iTextSharp and click on Install button as shown below. Once installed successfully, you can check iTextSharp in references folder.

Adding iTextSharp
Installing iTextSharp

You can also install by using Package Manager Console.
Go to TOOLS → Library Package Manager → Package Manager Console → write Install-Package iTextSharp and press enter. This will install iTextSharp in application.

In aspx file

In designer file create two button controls, first button is used to generate pdf file and second button is used to extract text from pdf file. One textbox control to display extracted text from pdf. Designer file look like as shown below.

Aspx designer file
<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="WebForm1.aspx.cs" Inherits="WebApplication1.WebForm1" %>
 
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title></title>
</head>
<body>
<form id="form1" runat="server">
<div>
    <table>
        <tr>
            <td><b>Extract Text from PDF file using iTextSharp</b></td>
        </tr>
        <tr>
            <td>
                <asp:Button ID="btnGeneratePDF" runat="server" Text="Generate PDF File" OnClick="btnGeneratePDF_Click" />
            </td>
        </tr>
        <tr>
            <td>
                <asp:Button ID="btnExtract" runat="server" Text="Extract Text From PDF File" OnClick="btnExtract_Click" />
            </td>
        </tr>
        <tr>
            <td>
                <asp:TextBox ID="TextBox1" runat="server" TextMode="MultiLine" Style="width: 500px; min-height: 150px;"> 
                </asp:TextBox>
            </td>
        </tr>
    </table>
</div>
</form>
</body>
</html>

C# Code

Complete C# code is given below.

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
 
namespace WebApplication1
{
public partial class WebForm1 : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
 
}
 
protected void btnGeneratePDF_Click(object sender, EventArgs e)
{
    if (File.Exists(Server.MapPath("Example.pdf")))
    {
        File.Delete(Server.MapPath("Example.pdf"));
    }
 
    // create pdf file and save it to the root directory of the application 
    FileStream fs = new FileStream(Server.MapPath("Example.pdf"), FileMode.Create);
 
    Document doc = new Document();
 
    PdfWriter.GetInstance(doc, fs);
 
    doc.Open();
 
    Paragraph page = new Paragraph("This is first page (page number 1)");
    doc.Add(page);
 
    Paragraph para1 = new Paragraph();
    Chunk c1 = new Chunk(@"This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph.");
    c1.SetBackground(BaseColor.YELLOW);
    para1.Add(c1);
    doc.Add(para1);
 
    Paragraph para2 = new Paragraph();
    Chunk c2 = new Chunk(@"This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph.");
    c2.SetBackground(BaseColor.GREEN);
    para2.Add(c2);
    doc.Add(para2);
 
    doc.Close();
}
 
protected void btnExtract_Click(object sender, EventArgs e)
{
    //string FilePath = @"H:\\Demo\\WebApplication1\\WebApplication1\\Example.pdf";
 
    string FilePath = Server.MapPath("Example.pdf");
 
    if (File.Exists(FilePath))
    {
        string ExtractedData = string.Empty;
 
        using (PdfReader reader = new PdfReader(FilePath))
        {
            ITextExtractionStrategy strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
 
            // 1. if pdf document has only one page
            //here second parameter is PDF Page number
            ExtractedData = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
 
 
            /*// 2. if pdf ducument has more than one page
            // iterating through all pages
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                ExtractedData = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
            }*/
 
 
            /*// if pdf single page is having more than one paragraph
            // then split paragraph using newline
            ExtractedData = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
            string[] lines = ExtractedData.Split('\n');
            StringBuilder sb = new StringBuilder();
            foreach (string line in lines)
            {
                // 
            }*/
 
        }
        TextBox1.Text = ExtractedData;
    }
}
}
}

When you click on the Generate PDF File button, a PDF will be generated and will be saved at root directory of application. When you open pdf file, you will see 3 paragraph as shown below.

PDF file generated using iTextSharp

Now when you click on Extract Text From PDF File, all the text from page one will be extracted and displayed to the TextBox. You can iterate through all the pages using foor loop. Code is added and commented above.

Extract text from PDF using iTextSharp


Share this

Related Posts

Previous
Next Post »