Sift Accordion: A Space-Time Descriptor Applied To Human Action Recognition

Recognizing human action from videos is an active field of research in computer vision and pattern recognition. Human activity recognition has many potential applications such as video surveillance, human machine interaction, sport videos retrieval and robot navigation. Actually, local descriptors a...

Full description

Bibliographic Details
Main Authors: Olfa.Ben Ahmed, Mahmoud. Mejdoub, Chokri. Ben Amar
Format: Text
Language:English
Published: Zenodo 2011
Subjects:
Online Access:https://dx.doi.org/10.5281/zenodo.1082072
https://zenodo.org/record/1082072
Description
Summary:Recognizing human action from videos is an active field of research in computer vision and pattern recognition. Human activity recognition has many potential applications such as video surveillance, human machine interaction, sport videos retrieval and robot navigation. Actually, local descriptors and bag of visuals words models achieve state-of-the-art performance for human action recognition. The main challenge in features description is how to represent efficiently the local motion information. Most of the previous works focus on the extension of 2D local descriptors on 3D ones to describe local information around every interest point. In this paper, we propose a new spatio-temporal descriptor based on a spacetime description of moving points. Our description is focused on an Accordion representation of video which is well-suited to recognize human action from 2D local descriptors without the need to 3D extensions. We use the bag of words approach to represent videos. We quantify 2D local descriptor describing both temporal and spatial features with a good compromise between computational complexity and action recognition rates. We have reached impressive results on publicly available action data set : {"references": ["D. Weinland, R.Ronfard and E. \" A Survey of Vision-Based Methods\nfor Action Representation, Segmentation and Recognition\", Computer\nVision and Image Understanding 2010", "K.Aggarwal and S.Park, \"Human motion: Modeling and recognition of\nactions and interactions\", in 3DPVT-04 Washington, DC, USA: IEEE\nComputer Society, 2004, pp. 640647", "T.B.Moeslund, A.Hilton, and V.Kruger, \" A survey of advances in\nvision-based human motion capture and analysis\", CVIU 2006, 90-126", "A.F. Bobick and J.W. Davis, \"The recognition of human movement\nusing temporal templates\", IEEE T-PAMI, 257-267, 2001", "I. Laptev and T. Lindeberg, \"Space-time interest points\", In ICCV, 2003", "I. Laptev, M. Marsza lek, C. Schmid, and B. Rozenfeld, \"Learning\nrealistic human actions from movies\", In CVPR, 2008", "M. Mejdoub, L. Fonteles, C. BenAmar, and Marc Antonini. \"Embedded\nlattices tree: An Efficient indexing scheme for content based retrieval on\nimage databases\", Journal of Visual Communication and mage\nRepresentation, Elsevier, 2009.", "P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, Behavior recognition\nvia sparse spatio-temporal features, In VS-PETS, 2005", "D. Lowe,\"Distinctive image features from scale-invariant keypoints\",\nIJCV, 91-110,2004\n[10] A. Klaser, M. Marsza lek, and C. Schmid, \"A spatio-temporal descriptor\nbased on 3Dgradients\", In BMVC, 2008\n[11] P. Scovanner, S. Ali, and M. Shah, \"A 3-dimensional SIFT descriptor\nand its application to action recognition\", In MULTIMEDIA, 2007\n[12] A. Klaser, M. Marsza lek, C. Schmid, and A. Zisserman,\"Human\nFocused Action Localization in Video\", in International Workshop on\nSign, Gesture, Activity 2010\n[13] T.Ouni, W.Ayedi and M.Abid, \" New low complexity DCT based video\ncompression method\", In Proceedings of the 16th International\nConference on Telecommunications (ICT-09), 202-207, Piscataway, NJ,\nUSA, 2009, IEEE Press\n[14] T.Ouni, W.Ayedi and Mohamed Abid, \"New Non Predictive\nWaveletBased Video Coder: Performances Analysis\", In Proceedings of\nInternational Conference on Image Analysis and Recognition. Volume\n6111 of LNCS, pages 344-353, Berlin, Heidelberg, 2010. Springer-\nVerlag\n[15] T.Ouni, W.Ayedi et M.Abid, \"A Complete Non predictive\nVideoCompression Scheme Based on a 3D to 2D Geometric\ntransform\",International Journal Signal and Imaging Systems\nEngineering (IJSISE), Inderscience Publisher, 2011\n[16] J.Wang, H. Lu, L.Duan and J.S. Jin, \"Commercial Video Retrieval with\nVideo-based Bag of Words\", Fifth International Conference on\nIntelligent Multimedia Computing and Networking 2007, July.22, 2007.\nSalt Lake City, Utah, USA\n[17] S.Ali, and M.Shah, \"Human action recognition in videos using\nkinematic features and multiple instance learning\", in IEEE Transactions\non Pattern Analysis and Machine Intelligence (TPAMI)28830, 2010\n[18] H. Ning, Y. Hu, T. Huang, \"Searching human behaviors using\nspatialtemporal words\", in Proceedings of IEEE ICIP 07, 2007, pp.\n337340\n[19] A. Fathi and G. Mori. Action recognition by learning mid-level motion\nfeatures, In CVPR, 2008\n[20] R. Messing, C. Pal, and H. Kautz, \"Activity recognition using the\nvelocity histories of tracked keypoints\", In ICCV, 2009\n[21] G. Willems, T. Tuytelaars, and L. Van Gool, \"An effcient dense and\nscale-invariant spatio-temporal interest point detector\", In ECCV, 2008\n[22] A.P.B.Lopes, R.S. Oliveira, J.M. de Almeida, and A.de Albuquerque\nAraujo, Spatio-temporal frames in a bag-of-visual-features approach for\nhuman actions recognition, in SIBGRAPI 09. IEEE Computer Society,\n2009\n[23] Y. Kawai, M. Takahashi, M. Fujii, M. Naemura, S. Sato, \"NHK STRL\nat TRECVID 2010: Semantic Indexing and Surveillance Event\nDetection\", Proc. TRECVID Workshop, Gaithersburg, MD, USA,\nNovember 2010\n[24] Y. Benezeth, P.M. Jodoin, B. Emile, H. Laurent, C.Rosenberger,Review\nand evaluation of commonly-implemented background subtraction\nalgorithms, in Proc. of the International Conference on Pattern\nRecognition, 2008\n[25] C.Stauffer, W. Grimson, \"Learning patterns of activity using real-time\ntracking\", in IEEE Transactions on Pattern Analysis and Machine\nIntelligence 2000, pp. 747757\n[26] C.Tomasi and T.Kanade, Detection and tracking of Point Features,\nCarnegie Mellon University TeChnical Report CMU-CS-91-132, April\n1991\n[27] J.Y Bouguet, \"Pyramidal Implementation of the Lucas Kanade Feature\nTracker Description of the algorithm\", Intel Corporation,\nMicroprocessor Research Labs,1999\n[28] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos : Primal estimated\nsub-gradient solver for svm, ICML, pages 807814, 2007\n[29] J. Liu, S. Ali, and M. Shah, \"Recognizing human actions using multiple\nfeatures\", In CVPR, 2008\n[30] J. Niebles and L. Fei-Fei, \"A hierarchical model of shape and\nappearance for human action classiffcation\", In CVPR, 2007\n[31] J. Niebles, H. Wang, and L. Fei-Fei, \"Unsupervised learning of human\naction categories using spatial-temporal words\", IJCV, 299-318, 2008\n[32] J. Niebles, H. Wang, and L. Fei-Fei, \"Unsupervised learning of human\naction categories using spatial-temporal words\", In BMVC, 2006"]}