How to handle huge file processing in video retrieval applications?

Rouhi · 5th September 2011, 07:44

I am researching on video similarity detection. I have written a program in c++ which works on TRECVID videos which contains 8300 video. The program using a huge file i/o transaction (about 26000 files with 100kb size average) and process the data inside the files. A PC can not handle it efficiently. It looks very very slow when i run the program.
I have noticed the file i/o is not the bottle neck because the HDD led does not blinking during process,. When I look at task manager in windows, I noticed that the program use only one CPU core only. It reaches to 100% of one core but the other remain unused. hats your suggestion? Using cluster programming or multi thread multi core programming or Hadoop …… do you have any idea? Did you have same problem?

LoRd_MuldeR · 5th September 2011, 13:22

If your "similarity detection" is not I/O-bound but CPU-bound and you are not using multiple threads yet (CPU load only on one single core), then you should be able to speed up things via multi-threading. In your scenario there should be two ways, maybe a combination of both is possible: Either you can do "coarse grained" multi-threading by processing multiple videos in parallel (given that the videos can be processed independently) or you can do "fine grained" multi-threading by parallelizing the similarity detection algorithm itself. It's impossible to give any advice for the latter without knowing the algorithm in detail, but generally you can think about processing multiple frames in parallel and/or diving each frame into multiple partitions that can be processed in parallel...

Rouhi · 6th September 2011, 04:18

I extract the indexes in offline son the time for extracting the indexes from each video is not critical for me (at the moment). The problem is in search and finding sequence matching. TRECVID has more than 8000 videos and more than 1600 queries. handeling the similarity detection of each query against all 8000 video is very time consuming. I am looking forward a solution for this problem.
Thatnks for mentioning two Coarse-Grained Vs. Fine-Grained Threading but my problem is that i am very new in multi treading programming. Can you introduce any shortcut to upgrade a sequential program to multitreading program? if it is not exist, what is your suggestion for starting multithreading in c++?

6th September 2011, 10:23

As above, only you know your algorithm for the search, nobody can help you multi-thread that without knowing it in detail themselves ...

The first step to multi-threading is identifying parts of your algorithm that are truely independent of any other operations, parts that are partially independent (that might need to be stored in a buffer/variable) and parts that are completely reliant on other parts ...

7ek

LoRd_MuldeR · 7th September 2011, 01:27

So if I understand correctly, you first extract some kind of "features" from the videos in your database. As this is done beforehand, it is not time-critical.

Then you have a bunch of "queries" and for each query you need to find the videos that match the query - which is done by comparing the query to the feature vectors extracted before.

Are your feature vectors stored in a "flat" structure and you simply compare each query to every feature vector?

If so, you might be able to speed this up easily by processing several queries in parallel, because they can be processed independently. Each thread would simply process one query.

Another option would be parallelizing the compare step itself: Divide the list of feature vectors into n sub-lists. Then, for each query, start n threads. Each thread will handle one sub-list.

But, regardless of multi-threading, you should think about organizing your feature vectors in a more "optimized" structure, so you don't have to compare the query against all of them!

I read a paper that suggest mapping the features into a "metric space" and then aggregating the individual feature vectors (indices) into a number of so-called "clusters".

For each "cluster" exactly one feature vector (index) is chosen that represents the cluster best. That one is called a "buoy".

Once you have this structure, you can first compare the query to the buoys to find cluster that contain "suitable" feature vectors. Then do the "in-depth" search in these clusters only...

5th September 2011, 07:44	#1 \| Link
Rouhi Registered User Join Date: Apr 2011 Posts: 64	How to handle huge file processing in video retrieval applications? I am researching on video similarity detection. I have written a program in c++ which works on TRECVID videos which contains 8300 video. The program using a huge file i/o transaction (about 26000 files with 100kb size average) and process the data inside the files. A PC can not handle it efficiently. It looks very very slow when i run the program. I have noticed the file i/o is not the bottle neck because the HDD led does not blinking during process,. When I look at task manager in windows, I noticed that the program use only one CPU core only. It reaches to 100% of one core but the other remain unused. hats your suggestion? Using cluster programming or multi thread multi core programming or Hadoop …… do you have any idea? Did you have same problem?

5th September 2011, 13:22	#2 \| Link
LoRd_MuldeR Software Developer Join Date: Jun 2005 Location: Last House on Slunk Street Posts: 13,248	If your "similarity detection" is not I/O-bound but CPU-bound and you are not using multiple threads yet (CPU load only on one single core), then you should be able to speed up things via multi-threading. In your scenario there should be two ways, maybe a combination of both is possible: Either you can do "coarse grained" multi-threading by processing multiple videos in parallel (given that the videos can be processed independently) or you can do "fine grained" multi-threading by parallelizing the similarity detection algorithm itself. It's impossible to give any advice for the latter without knowing the algorithm in detail, but generally you can think about processing multiple frames in parallel and/or diving each frame into multiple partitions that can be processed in parallel... __________________ Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 5th September 2011 at 13:25.

7th September 2011, 01:27	#5 \| Link
LoRd_MuldeR Software Developer Join Date: Jun 2005 Location: Last House on Slunk Street Posts: 13,248	So if I understand correctly, you first extract some kind of "features" from the videos in your database. As this is done beforehand, it is not time-critical. Then you have a bunch of "queries" and for each query you need to find the videos that match the query - which is done by comparing the query to the feature vectors extracted before. Are your feature vectors stored in a "flat" structure and you simply compare each query to every feature vector? If so, you might be able to speed this up easily by processing several queries in parallel, because they can be processed independently. Each thread would simply process one query. Another option would be parallelizing the compare step itself: Divide the list of feature vectors into n sub-lists. Then, for each query, start n threads. Each thread will handle one sub-list. But, regardless of multi-threading, you should think about organizing your feature vectors in a more "optimized" structure, so you don't have to compare the query against all of them! I read a paper that suggest mapping the features into a "metric space" and then aggregating the individual feature vectors (indices) into a number of so-called "clusters". For each "cluster" exactly one feature vector (index) is chosen that represents the cluster best. That one is called a "buoy". Once you have this structure, you can first compare the query to the buoys to find cluster that contain "suitable" feature vectors. Then do the "in-depth" search in these clusters only... __________________ Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 7th September 2011 at 01:36.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

6th September 2011, 04:18	#3 \| Link
Rouhi Registered User Join Date: Apr 2011 Posts: 64	I extract the indexes in offline son the time for extracting the indexes from each video is not critical for me (at the moment). The problem is in search and finding sequence matching. TRECVID has more than 8000 videos and more than 1600 queries. handeling the similarity detection of each query against all 8000 video is very time consuming. I am looking forward a solution for this problem. Thatnks for mentioning two Coarse-Grained Vs. Fine-Grained Threading but my problem is that i am very new in multi treading programming. Can you introduce any shortcut to upgrade a sequential program to multitreading program? if it is not exist, what is your suggestion for starting multithreading in c++?

6th September 2011, 10:23	#4 \| Link
7ekno Guest Posts: n/a	As above, only you know your algorithm for the search, nobody can help you multi-thread that without knowing it in detail themselves ... The first step to multi-threading is identifying parts of your algorithm that are truely independent of any other operations, parts that are partially independent (that might need to be stored in a buffer/variable) and parts that are completely reliant on other parts ... 7ek