Author(s):
- Tian Gan
- Yongkang Wong
- Bappaditya Mandal
- Vijay Chandrasekhar
- Mohan S. Kankanhalli
Abstract:
Presentations have been an effective means of delivering information to groups for ages. Over the past few decades, technological advancements have revolutionized the way humans deliver presentations. Despite that, the quality of presentations can be varied and affected by a variety of reasons. Conventional presentation evaluation usually requires painstaking manual analysis by experts. Although the expert feedback can definitely assist users in improving their presentation skills, manual evaluation suffers from high cost and is often not accessible to most people. In this work, we propose a novel multi-sensor self-quantification framework for presentations. Utilizing conventional ambient sensors (i.e., static cameras, Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass), we first analyze the efficacy of each type of sensor with various nonverbal assessment rubrics, which is followed by our proposed multi-sensor presentation analytics framework. The proposed framework is evaluated on a new presentation dataset, namely NUS Multi-Sensor Presentation (NUSMSP) dataset, which consists of 51 presentations covering a diverse set of topics. The dataset was recorded with ambient static cameras, Kinect sensor, and Google Glass. In addition to multi-sensor analytics, we have conducted a user study with the speakers to verify the effectiveness of our system generated analytics, which has received positive and promising feedback.
Documentation:https://doi.org/10.1145/2733373.2806252
References:
K. Audhkhasi, K. Kandhway, O. Deshmukh, and A. Verma. Formant-based technique for automatic filled-pause detection in spontaneous spoken english. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 4857–4860, 2009. T. W. Banta. Assessing Student Achievement in General Education: Assessment Update Collections, volume 5. John Wiley & Sons, 2007.P. Boersma and D. Weenink. Praat, a system for doing phonetics by computer. Glot International, pages 341–345, 2002.A. Bosch, A. Zisserman, and X. Mu noz. Image classification using random forests and ferns. In International Conference on Computer Vision, pages 1–8, 2007.S. M. Brookhart and F. Chen. The quality and effectiveness of descriptive rubrics. Educational Review, pages 1–26, 2014.L. Chen, G. Feng, J. Joe, C. W. Leong, C. Kitchen, and C. M. Lee. Towards automated assessment of public speaking skills using multimodal cues. In Proceedings of the International Conference on Multimodal Interaction, pages 200–203, 2014. L. Chen, C. W. Leong, G. Feng, and C. M. Lee. Using multimodal cues to analyze mla’14 oral presentation quality corpus: Presentation delivery and slides quality. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge, pages 45–52, 2014. N. De Jong and T. Wempe. Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods, 41(2):385–390, 2009.N. E. Dunbar, C. F. Brooks, and T. Kubicka-Miller. Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovative Higher Education, 31(2):115–128, 2006.V. Echeverría, A. Avendaño, K. Chiluiza, A. Vásquez and X. Ochoa. Presentation skills estimation based on video and kinect data analysis. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge, pages 53–60, 2014. M. Ermes, J. Pärkkä, J. Mäntyjärvi, and I. Korhonen. Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Transactions on Information Technology in Biomedicine, 12(1):20–26, 2008. S. B. Fawcett and L. K. Miller. Training Public-Speaking Behavior: An Experimental Analysis and Social Validation. Journal of Applied Behavior Analysis, (2):125–135, 1975.T. Gan, Y. Wong, B. Mandal, V. Chandrasekhar, L. Li, J.-H. Lim, and M. S. Kankanhalli. Recovering social interaction spatial structure from multiple first-person views. In Proceedings of International Workshop on Socially-Aware Multimedia, pages 7–12, 2014. T. Gan, Y. Wong, D. Zhang, and M. S. Kankanhalli. Temporal encoded F-formation system for social interaction detection. In Proceedings of ACM International Conference on Multimedia, pages 937–946, 2013. F. Guo, Y. Li, M. S. Kankanhalli, and M. S. Brown. An evaluation of wearable activity monitoring devices. In Proceedings of ACM International Workshop on Personal Data Meets Distributed Multimedia, pages 31–34, 2013. J. Hernandez, Y. Li, J. M. Rehg, and R. W. Picard. BioGlass: Physiological parameter estimation using a head-mounted wearable device. In International Conference on Wireless Mobile Communication and Healthcare: “Transforming healthcare through innovations in mobile and wireless technologies”, pages 55–58, 2014.R. Hincks. Measures and perceptions of liveliness in student oral presentation speech: A proposal for an automatic feedback mechanism. System, 33(4):575–591, 2005.E. S. Klima. The signs of language. Harvard University Press, 1979.K. Kurihara, M. Goto, J. Ogata, Y. Matsusaka, and T. Igarashi. Presentation sensei: a presentation training system using speech and image processing. In Proceedings of the International Conference on Multimodal Interfaces, pages 358–365, 2007. O. D. Lara and M. A. Labrador. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys and Tutorials, 15(3):1192–1209, 2013.G. Luzardo, B. Guamán, K. Chiluiza, J. Castells, and X. Ochoa. Estimation of presentations skills based on slides and audio features. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge, pages 37–44, 2014. W. Mansell, D. M. Clark, A. Ehlers, and Y.-P. Chen. Social anxiety and attention away from emotional faces. Cognition and Emotion, 13(6):673–690, 1999.R. E. Mayer. Learning and instruction. Prentice Hall, 2003.S. Morreale and P. Backlund. Large-scale assessment in oral communication: K-12 and higher education, washington. National Communication Association, 2007.S. P. Morreale, M. R. Moore, K. P. Taylor, D. Surges-Tatum, and R. Hulbert-Johnson. The competent speaker speech evaluation form. National Communication Association, 1993.J. Müller, J. Exeler, M. Buzeck, and A. Krüger. Reflectivesigns: Digital signs that adapt to audience attention. In International Conference on Pervasive Computing, pages 17–24, 2009. A.-t. Nguyen, W. Chen, and M. Rauterberg. Online feedback system for public speakers. In IEEE Symposium on E-Learning, E-Management and E-Services, pages 1–5, 2012.T. Pfister and P. Robinson. Speech emotion classification and public speaking skill assessment. In Human Behavior Understanding Workshop, pages 151–162. Springer, 2010. R. L. Quianthy. Communication is life: Essential college sophomore speaking and listening competencies. Speech Communication Association, 1990.D. M. Randel. The Harvard dictionary of music, volume 16. Harvard University Press, 2003.M. Rouvier, G. Dupuy, P. Gay, E. el Khoury, T. Merlin, and S. Meignier. An open-source state-of-the-art toolbox for broadcast news diarization. In INTERSPEECH, pages 1477–1481, 2013.L. M. Schreiber, G. D. Paul, and L. R. Shibley. The development and test of the public speaking competence rubric. Communication Education, 61(3):205–233, 2012.A. W. Siegman and S. Feldstein. Nonverbal behavior and communication. Psychology Press, 2014.M. Slater, D. Pertaub, C. Barker, and D. M. Clark. An experimental study on fear of public speaking using a virtual environment. Cyberpsychology, Behavior, and Social Networking, 9(5):627–633, 2006.S. M. Tasko and K. Greilick. Acoustic and articulatory features of diphthong production: A speech clarity study. Journal of Speech, Language, and Hearing Research, 53(1):84–99, 2010.S. Thomson and M. L. Rucker. The development of a specialize public speaking competency scale: Test of reliability. Communication Research Reports, 19(1):18–28, 2002.A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing: Survey of an emerging domain. Image and Vision Computing, 27(12):1743–1759, 2009. J. Webster and H. Ho. Audience engagement in multimedia presentations. ACM SIGMIS Database, 28(2):63–77, 1997. X. Wei and Z. Yang. Mining in-class social networks for large-scale pedagogical analysis. In Proceedings of ACM International Conference on Multimedia, pages 639–648, 2012. X.-S. Wei, J. Wu, and Z.-H. Zhou. Scalable multi-instance learning. In IEEE International Conference on Data Mining, pages 1037–1042, 2014. J. S. Wrench, A. Goding, D. I. Johnson, and B. A. Stand Up, Speak Out: The Practice and Ethics of Public Speaking. 2011.Y. Wu, E. Y. Chang, K. C.-C. Chang, and J. R. Smith. Optimal multimodal fusion for multimedia data analysis. In Proceedings of ACM International Conference on Multimedia, pages 572–579, 2004. Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):39–58, 2009. Z. Zhang. Microsoft kinect sensor and its effect. IEEE MultiMedia, 19(2):4–10, 2012.