Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

Domains: forum.doom9.org / forum.doom9.net / forum.doom9.se

 

Go Back   Doom9's Forum > Announcements and Chat > General Discussion

Reply
 
Thread Tools Display Modes
Old 30th June 2009, 16:44   #1  |  Link
Daveyboyc
Registered User
 
Join Date: Feb 2009
Posts: 11
DVB Subtitles to Text Format

Hi,

I'm trying to rip subtitles from dvb .ts recordings into text.
The reason i need them to be text is because i want to build a database wherby i can search through different tv programs using keyword search terms.

i can rip subpics with Projectx and have tried Subrip, Vobsub and dvdsubedit. However i can never OCR the the files seamlessly.

The fonts on all the recordings appear the same so it should be possible to do this. Can anyone offer any advice like training a certain program to read the font?

Regards

Dave
Daveyboyc is offline   Reply With Quote
Old 30th June 2009, 16:51   #2  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,901
You should be able to decode them from the data. You shouldn't need to do OCR.
Guest is offline   Reply With Quote
Old 1st July 2009, 18:48   #3  |  Link
Ghitulescu
Registered User
 
Ghitulescu's Avatar
 
Join Date: Mar 2009
Location: Germany
Posts: 5,773
ProjectX should give you DVD compliant subtitles. You have the possibility however to save them as text and not compiled (SUP).
Ghitulescu is offline   Reply With Quote
Old 2nd July 2009, 16:39   #4  |  Link
Daveyboyc
Registered User
 
Join Date: Feb 2009
Posts: 11
Unfortunatley projectx cant save them to text - i tried.

i can decode the files, but the reason i need to OCR is i need the files in text format. this is essential to what i'm trying to do..
Daveyboyc is offline   Reply With Quote
Old 2nd July 2009, 20:08   #5  |  Link
Ghitulescu
Registered User
 
Ghitulescu's Avatar
 
Join Date: Mar 2009
Location: Germany
Posts: 5,773
ProjectX can definitively save the teletext as text. I use ProjectX for some years now and this is exactly how I "work" Arte which has 2 languages (FR+DE) and 4 subtitles (2 FR + 2 DE).

I cannot tell you how exactly, because my internet PC is far far away from my video PC which has no internet connection.

There are also subtitles in DVB format which essentially is DVD format. If that's the case then do TS->DVD then rip the subtitles with SubRip.
Ghitulescu is offline   Reply With Quote
Old 3rd July 2009, 10:26   #6  |  Link
multimediaman
MPlayer addict
 
Join Date: Dec 2008
Posts: 33
Avidemux has TS--> srt (OCR) function. Just find PID for it.

DVB subtitles are not text based but also telexet text based subs can be used as well
multimediaman is offline   Reply With Quote
Old 3rd July 2009, 11:40   #7  |  Link
Daveyboyc
Registered User
 
Join Date: Feb 2009
Posts: 11
yeah i've been trying to use subrip but its not without its problems. the fonts and subpics should be tottally ocr-able as they look fine in programs such as subview etc but subrip sometimes cant read the lines properly.

maybe theres a setting or something to make it work as it does work on about half the subs.

i'm not recording teletext subs at the moment as they are anologue and soon to be obosolete. avidemux i dont know much ablout, i tried doing the .ts to .srt earlier and it wouldnt let me.

any ideas?
Daveyboyc is offline   Reply With Quote
Old 3rd July 2009, 12:17   #8  |  Link
multimediaman
MPlayer addict
 
Join Date: Dec 2008
Posts: 33
You could also covert dvb subs into vobsubs with latest (development version) ProjecX
http://www.oozoon.de/main_en.html
Jus check vobsub export box.

Vobsubs can be converted to srt with Avidemux
multimediaman is offline   Reply With Quote
Old 6th July 2009, 13:35   #9  |  Link
Daveyboyc
Registered User
 
Join Date: Feb 2009
Posts: 11
just tried using avidemux after using latest projectx and checkinh idx/vobbsub.

it doesnt load the damn subs. i'm selecting vobsub->srt tool am i doing something wrong? what should happen?
Daveyboyc is offline   Reply With Quote
Old 7th July 2009, 10:44   #10  |  Link
Ghitulescu
Registered User
 
Ghitulescu's Avatar
 
Join Date: Mar 2009
Location: Germany
Posts: 5,773
For the last time:

There are two types of subtitles in a TS (DVBS, DVBT, DVBC etc.): subtitles that appears within teletext (you select page 150, 888 etc.) or DVB specific subtitles that are DVD compatible.

In the first case you have them already in text form (and the font plays no role, you know, like in Notepad and it's absurd and impossible to OCR a text unless you make a BMP of NOTEPAD and use your OCR software upon it. Use projectX for this.

In the second case you use one the methods listed above, since DVB/DVD subtitles are bitmaps and they need to be OCRed.

Just test it for yourself how they are displayed on your TV to know their type.
Ghitulescu is offline   Reply With Quote
Old 7th July 2009, 11:14   #11  |  Link
buzzqw
HDConvertToX author
 
Join Date: Nov 2003
Location: Cesena,Italy
Posts: 6,552
you can always upload a sample of ts with subs

BHH
__________________
HDConvertToX: your tool for BD backup
MultiX264: The quick gui for x264
AutoMen: The Mencoder GUI
AutoWebM: supporting WebM/VP8
buzzqw is offline   Reply With Quote
Old 8th July 2009, 15:06   #12  |  Link
Daveyboyc
Registered User
 
Join Date: Feb 2009
Posts: 11
i know there are 2 types of subtitles (teletext and dvb bitmap). teletext is pretty much soon to be dead in the uk so i'm not really interested in it. i can rip dvb subs with projectx as bitmaps (.sup ans sup/idx) just fine. the problem lies in the OCR, i haven't found the right solution for that yet (with subrip or anything else). the fonts on the subs are all the same for all the recordings i have made so it should be very possible. here is a link to some .ts files.

http://www.mediafire.com/?sharekey=8...eada0a1ae8665a
Daveyboyc is offline   Reply With Quote
Old 8th July 2009, 15:53   #13  |  Link
Ghitulescu
Registered User
 
Ghitulescu's Avatar
 
Join Date: Mar 2009
Location: Germany
Posts: 5,773
Quote:
Originally Posted by Daveyboyc View Post
i know there are 2 types of subtitles (teletext and dvb bitmap). teletext is pretty much soon to be dead in the uk so i'm not really interested in it. i can rip dvb subs with projectx as bitmaps (.sup ans sup/idx) just fine. the problem lies in the OCR, i haven't found the right solution for that yet (with subrip or anything else). the fonts on the subs are all the same for all the recordings i have made so it should be very possible. here is a link to some .ts files.

http://www.mediafire.com/?sharekey=8...eada0a1ae8665a
You still haven't confirmed the type of subtitles you're dealing with, but I safely assume it's the DVB style ie bitmaps.

In this case, if I correctly understand your question, then I assume you never had anything else OCRed, otherwise you'd know by now that every OCR process needs once in a while a helping hand from the human operator No algorithm is perfect, not to mention that errors in transmission may affect the bitmap image the very same way it does with the video image or audio track.

I cannot test the files now, only in WE.
Ghitulescu is offline   Reply With Quote
Old 8th July 2009, 17:34   #14  |  Link
Daveyboyc
Registered User
 
Join Date: Feb 2009
Posts: 11
yeah, there dvb subtitles (bitmaps).
i have actually tried OCR and i can get it to work on some of the subpics but not all and i dont know why. the best solution at the moment is a version of DVDsubedit which reads pretty much all the characters but still not doing a perfect job as it leaves words with too many spaces in between the lettering (e.g "w h ere i s Jo hn Smith etc etc). the author of the program is looking into this for me but i want to know if anybody else as a solution as creating text files from digital tv recordings is something that i really need for what i'm trying to achieve. there must be a way!
Daveyboyc is offline   Reply With Quote
Old 19th July 2009, 01:04   #15  |  Link
Daveyboyc
Registered User
 
Join Date: Feb 2009
Posts: 11
guys, if any of you actually know how i can go about solving this problem of getting dvb (bitmaps) subtitles to text format please let me know. teletext is now being scrapped earlier tha n anticiapted in the uk due to loss in revenues and will be decomissioned next year. this makes it even more important that i find a way of working with digital subtitles.

dave.
Daveyboyc is offline   Reply With Quote
Old 22nd July 2009, 12:58   #16  |  Link
Ghitulescu
Registered User
 
Ghitulescu's Avatar
 
Join Date: Mar 2009
Location: Germany
Posts: 5,773
If your keyword database doesn't include special words like and, or, hey, you, what, because and the like, I think you can spare yourself the effort and input yourself the relevant ones.

Unless, of course, you'd like to search for whole phrases.
Ghitulescu is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 21:48.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2026, vBulletin Solutions Inc.