[Maven] Tesseract 사용법

2022. 4. 8. 11:47

1. 먼저 Tesseract 라이브러리를 불러온다

maven 사이트 접속해서 오늘날짜 기준 가장 최신버전으로 선택했다.

https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j/5.2.0

2. pom.xml에 dependency 추가.

<dependency>
	<groupId>net.sourceforge.tess4j</groupId>
	<artifactId>tess4j</artifactId>
	<version>4.5.2</version>
</dependency>

3. 테스트를 위한 Class 생성 !

import java.io.File;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;


public class OcrTest {
		
	 public static void main(String[] args) {
		 try {
		    // 읽어볼 이미지를 가져온다. 
		    File image = new File("경로상생략/res.jpg");

                    Tesseract tesseract = new Tesseract();
                    tesseract.setDatapath("경로상생략/tessdata"); //** 학습된데이터가 있는 폴더를 지정해준다. 
                    tesseract.setLanguage("eng"); // 언어설정 
                    tesseract.setPageSegMode(1); // 페이지 모드 설정
                    tesseract.setOcrEngineMode(1); 
                    // tesseract.setHocr(true); // html로 그려주는 flag
                    String result = tesseract.doOCR(image);
                    System.out.println(result);

		 
		 } catch (TesseractException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		 }
		 
		 
	 }

}

tesseract.setDatapath("경로상생략/tessdata"); 이 라인은 tesseract가 이미지를 읽었을때

비교할 학습된 데이터 파일의 경로를 가르킨다.

아래 링크에 접속해보면 이미 학습된 데이터를 제공하고 있으므로 내가 하고자 하는 언어를 다운받아서 넣어두면 된다.

https://github.com/tesseract-ocr/tessdata

GitHub - tesseract-ocr/tessdata: Trained models with support for legacy and LSTM OCR engine

Trained models with support for legacy and LSTM OCR engine - GitHub - tesseract-ocr/tessdata: Trained models with support for legacy and LSTM OCR engine

github.com

접속해보면 아래와 같은 파일들이 나오는데 앞 글자가 언어를 가르키고 있다.