PDA

View Full Version : Announcing VobSub2SRT: convert .sub/.idx to .srt on Linux


ruediger.s
3rd October 2010, 22:33
VobSub2SRT is a Linux command line tool to convert Vobsub (.idx/.sub) subtitles into the .srt subtitle format. It is based on mplayer's (http://www.mplayerhq.hu/) vobsub code and uses tesseract (http://code.google.com/p/tesseract-ocr/) for the OCR part.

You can get the source and manual at http://github.com/ruediger/VobSub2SRT

The quality of the OCR depends heavily on the quality of the subtitles. I'm currently planning to add some preprocessing features (like rescaling) to increase the OCR probabilities.

I'm developing VobSub2SRT on Kubuntu (current 10.04) but it should work on other Linux systems as well (and maybe even Mac OS X).

To build vobsub2srt on Ubuntu use

sudo apt-get install libavutil-dev tesseract-ocr-dev tesseract-ocr-eng build-essential cmake
./configure
make
sudo make install


To convert subtitles call

vobsub2srt Filename

with Filename being the file name of the subtitles without the .idx/.sub extension. The result is written to Filename.srt. To get more information use --verbose as parameter. With --lang langcode you can select the language stream (make sure you got the tesseract data for that language installed).

I hope this tool is useful to you and please give me some feedback.

Selur
22nd December 2010, 10:14
Nice! Thanks! Really like the idea of having a command line vobsub to srt converter. (if someone finds the motivation&time to make a windows port of this, I'll integrate it into Hybrid)

Cu Selur

bjrnfrdnnd
26th January 2011, 18:13
Hi,

I am running mac osx 10.6.6 on a macbook pro early 2008.
I have installed macports and have installed tesseract from the repositories.
The version of tesseract seems to be 3.0:

cn-b204-2:ruediger-VobSub2SRT-e46e81a bn$ tesseract -v
tesseract 3.00


The version of your code seems to be e46e81a, if I believe the name of the download.

When configuring , I get the following error:

cn-b204-2:ruediger-VobSub2SRT-e46e81a bn$ ./configure
-- The C compiler identification is GNU
-- The CXX compiler identification is GNU
-- Checking whether C compiler has -isysroot
-- Checking whether C compiler has -isysroot - yes
-- Checking whether C compiler supports OSX deployment target flag
-- Checking whether C compiler supports OSX deployment target flag - yes
-- Check for working C compiler: /opt/local/bin/gcc
-- Check for working C compiler: /opt/local/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Checking whether CXX compiler has -isysroot
-- Checking whether CXX compiler has -isysroot - yes
-- Checking whether CXX compiler supports OSX deployment target flag
-- Checking whether CXX compiler supports OSX deployment target flag - yes
-- Check for working CXX compiler: /opt/local/bin/c++
-- Check for working CXX compiler: /opt/local/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Source: /Users/bn/Movies/ruediger-VobSub2SRT-e46e81a
-- Binary: /Users/bn/Movies/ruediger-VobSub2SRT-e46e81a/build
-- Build type: Debug
-- checking for module 'libavutil'
-- found libavutil, version 50.15.1
-- Found Tesseract: Tesseract_LIBRARIES-NOTFOUND;/opt/local/lib/libtiff.dylib
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
Tesseract_LIBRARIES (ADVANCED)
linked by target "vobsub2srt" in directory /Users/bn/Movies/ruediger-VobSub2SRT-e46e81a/src

-- Configuring incomplete, errors occurred!


I do not know what the tesseract libraries are. The tesseract installation did however come with trainingsdata for english. On my machine, this data is located at
/opt/local/share/tessdata
in a file named eng.traineddata.

Are there any ways to get this compiled on my macbook?

bjrnfrdnnd
26th January 2011, 22:36
Hi,
I am also trying to use your program in order to convert german subtitles.
This is on a linux amd64 machine running ubuntu 10.10,
the version of tesseract is 2.04-2.
German language files are installed and are found under
/usr/share/tesseract-ocr/tessdata

These files start with deu.

When running vobsub2srt with --lang de, I get the following

vobsub2srt --lang de --verbose output

VobSub: Can't open IFO file
vobsub: ignoring size: 720x576
vobsub: ignoring palette: bbe20c, 0ba7cc, 101010, eaeaea, 438143, ec14ed, ebff0b, 0d617a, 7b7b7b, d1d1d1, 7b2a0e, 0d790d, 0ce60b, eaeaea, bc5a38, bbd838
vobsub: ignoring forced subs: OFF
[vobsub] subtitle (vobsubid): 0 language de
Index Count: 1
Id: 0 Lang: <no id>
Selected VOBSUB language: 0 language: de
Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/ger.unicharset


For some reason, the program is looking for files starting with ger.
What could I do in order to make it work? I already tried to install tesseract 3.00, but this seems to be incompatible with vobsub2srt.