Publications

2023
Abstract: Orcas sind hochintelligente Tiere mit einer komplexen Kommunikations- und Populationsstruktur. Aufgrund von Nahrungsmangel, Umweltverschmutzung, Schiffslärm und anderen Faktoren sind Orca Arten, wie der Southern Resident Killer Whale, vom Aussterben bedroht. Daher ist es wichtig, Orcas unter dem Hauptaspekt der Kommunikation, Lokalisierung und sozialen Interaktion zu untersuchen. Meeresbiologen haben in der Vergangenheit hunderte von Stunden an Unterwasseraufnahmen abgehört, um mögliche Orca Stimmen zu entdecken. Das Ziel dieser Arbeit ist die Entwicklung und Analyse von robusten Methoden zur automatischen Erkennung von Orca Geräuschen. In dieser Arbeit werden tiefe neuronale Netze auf gelabelten Audiodateien trainiert, welche vom Orcalab zur Verfügung gestellt wurden. Erstmalig werden Methoden zur Augmentierung, Regularisierung, Datenrepräsentationen und Modellarchitekturen auf diesem Datensatz untersucht. Darüber hinaus wird eine neuartige Methode der Datenvorverarbeitung vorgestellt, welche die gelabelten Orca Geräusche besser ausnutzt. Außerdem werden teilüberwachte Methoden aus dem Bildbereich in die Audioklassifikationsdomäne übertragen und angepasst, welche zuvor ungenutzte ungelabelte Daten verwenden. Das beste Modell verbesserte den F1-Score des Basismodells von 0.69 auf 0.90 auf dem von Orcalab bereitgestellten Testdatensatz. Auf dem Orca Activity Sub-Challenge Datensatz der Interspeech ComParE Challenge 2019 erreichte das Modell auf dem Testdatensatz einen AUC von 0.903, welches einer Verbesserung des Basismodells um 0.037 entspricht.
BibTeX:
@inproceedings{Bohnhof2023daga,
  author = {Bohnhof, Nils and Perschewski, Jan-Ole and Stober, Sebastian},
  title = {Analyse von Deep Learning Methoden für eine Orca Geräusch Erkennung},
  booktitle = {49. Jahrestagung f\"{u}r Akustik DAGA 2023, Hamburg},
  address = {Hamburg, Germany},
  month = {Mar},
  publisher = {German Acoustical Society (DEGA)},
  year = {2023},
  pages = {1350--1353},
  note = {in German},
  url = {https://pub.dega-akustik.de/DAGA_2023/data/articles/000489.pdf}
}
Abstract: Voice conversion (VC) transforms an utterance to sound like anotherperson without changing the linguistic content. A recently proposedgenerative adversarial network-based VC method, StarGANv2-VCis very successful in generating natural-sounding conversions.However, the method fails to preserve the emotion of the sourcespeaker in the converted samples. Emotion preservation is necessaryfor natural human-computer interaction. In this paper, we showthat StarGANv2-VC fails to disentangle the speaker and emotionrepresentations, pertinent to preserve emotion. Specifically, thereis an emotion leakage from the reference audio used to capture thespeaker embeddings while training. To counter the problem, wepropose novel emotion-aware losses and an unsupervised methodwhich exploits emotion supervision through latent emotion representations. The objective and subjective evaluations prove the efficacyof the proposed strategy over diverse datasets, emotions, gender, etc.
BibTeX:
@inproceedings{Das2023,
  author = {Das, Arnab and Ghosh, Suhita and Polzehl, Tim and Siegert, Ingo and Stober, Sebastian},
  title = {StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings},
  booktitle = {12th ISCA Speech Synthesis Workshop (SSW2023)},
  month = {August},
  publisher = {ISCA},
  year = {2023},
  series = {ssw_2023},
  doi = {10.21437/ssw.2023-13}
}
Abstract: The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.
BibTeX:
@article{Das2023a,
  author = {Das, Arnab and Ghosh, Suhita and Stober, Sebastian},
  title = {PI-RADS v2 Compliant Automated Segmentation of Prostate Zones Using co-training Motivated Multi-task Dual-Path CNN},
  publisher = {arXiv},
  year = {2023},
  doi = {10.48550/ARXIV.2309.12970}
}
Abstract: The use of predictive models in education promises individual support and personalization for students. To develop trustworthy models, we need to understand what factors and causes contribute to a prediction. Thus, it is necessary to develop models that are not only accurate but also explainable. Moreover, we need to conduct holistic model evaluations that also quantify explainability or other metrics next to established performance metrics. This paper explores the use of Explainable Boosting Machines (EBMs) for the task of academic risk prediction. EBMs are an extension of Generative Additive Models and promise a state-of-the-art performance on tabular datasets while being inherently interpretable. We demonstrate the benefits of using EBMs in the context of academic risk prediction trained on online learning behavior data and show the explainability of the model. Our study shows that EBMs are equally accurate as other state-of-the-art approaches while being competitive on relevant metrics for trustworthy academic risk prediction such as earliness, stability, fairness, and faithfulness of explanations. The results encourage the broader use of EBMs for other Artificial Intelligence in education tasks.
BibTeX:
@inbook{Dsilva2023,
  author = {Dsilva, Vegenshanti and Schleiss, Johannes and Stober, Sebastian},
  title = {Trustworthy Academic Risk Prediction with Explainable Boosting Machines},
  booktitle = {Artificial Intelligence in Education},
  publisher = {Springer Nature Switzerland},
  year = {2023},
  pages = {463--475},
  isbn = {9783031362729},
  doi = {10.1007/978-3-031-36272-9_38}
}
Abstract: Automatic chord recognition (ACR) is a popular task in the field of music information retrieval. The available research for ACR tasks indicates that there is less tendency to work on symbolic data rather than audio data. One of the main reasons for this underrepresentation is that there are few symbolic music datasets with adequate annotations available. To tackle this issue, it is possible to use unsupervised techniques on datasets without chord labels for pre-training generalized input embeddings. In this paper, we use the Harmony Transformer (HT) architecture by Chen and Su in its recent version from 2021. We propose to exploit skip-grams of pitches as an unsupervised embedding technique instead of learning the input embedding as part of the network. This improves the HT such that it can make use of the large amount of unlabeled data. We do our experiments on Lakh MIDI dataset and also on BPS-FH dataset which was used in the Harmony Transformer related paper to compare the results. We also propose to use Explainable Artificial Intelligence (XAI) techniques to interpret how the model performs the chord recognition task, for example, by identifying prediction-relevant features in the input data.
BibTeX:
@inproceedings{Ebrahimzadeh2023daga,
  author = {Ebrahimzadeh, Maral and Krug, Valerie and Stober, Sebastian},
  title = {Transformer-Based Chord Recognition with Unsupervised Pre-training of Input Embeddings},
  booktitle = {49. Jahrestagung f\"{u}r Akustik DAGA 2023, Hamburg},
  address = {Hamburg, Germany},
  month = {Mar},
  publisher = {German Acoustical Society (DEGA)},
  year = {2023},
  pages = {1374--1377},
  url = {https://pub.dega-akustik.de/DAGA_2023/data/articles/000512.pdf}
}
Abstract: Learning the harmonic structure of music is crucial for various music information retrieval (MIR) tasks. Word2vec skip-gram, a well-established technique in natural language processing, has been found to effectively learn harmonic concepts in music. It represents music slices in a vector space, preserving meaningful geometric relationships. These embeddings hold great promise as inputs for MIR tasks, particularly automatic chord recognition (ACR). However, ACR research predominantly focuses on audio data due to the limited availability of well-annotated symbolic music datasets. In this work, we propose an innovative approach utilizing the Harmony Transformer (HT) architecture by Chen and Su. Instead of incorporating input embedding within the network, we leverage skip-gram as an unsupervised embedding technique. Our experiments show that this unsupervised method produces embeddings that adeptly capture harmonic concepts. We also introduce a novel visualization method to assess the fidelity of these embeddings in representing harmonic musical concepts. We perform our experiments on the Lakh MIDI and the BPS-FH dataset.
BibTeX:
@inproceedings{Ebrahimzadeh2023ismirlbd,
  author = {Ebrahimzadeh, Maral and Krug, Valerie and Stober, Sebastian},
  title = {Improving Embeddings in Harmony Transformer},
  booktitle = {24th International Society for Music Information Retrieval Conference (ISMIR'23) - Late Breaking \& Demo Papers},
  year = {2023},
  url = {https://ismir2023program.ismir.net/lbd_346.html}
}
Abstract: PurposeHigh noise levels due to low X-ray dose are a challenge in digital breast tomosynthesis (DBT) reconstruction. Deep learning algorithms show promise in reducing this noise. However, these algorithms can be complex and biased toward certain patient groups if the training data are not representative. It is important to thoroughly evaluate deep learning-based denoising algorithms before they are applied in the medical field to ensure their effectiveness and fairness. In this work, we present a deep learning-based denoising algorithm and examine potential biases with respect to breast density, thickness, and noise level.ApproachWe use physics-driven data augmentation to generate low-dose images from full field digital mammography and train an encoder-decoder network. The rectified linear unit (ReLU)-loss, specifically designed for mammographic denoising, is utilized as the objective function. To evaluate our algorithm for potential biases, we tested it on both clinical and simulated data generated with the virtual imaging clinical trial for regulatory evaluation pipeline. Simulated data allowed us to generate X-ray dose distributions not present in clinical data, enabling us to separate the influence of breast types and X-ray dose on the denoising performance.ResultsOur results show that the denoising performance is proportional to the noise level. We found a bias toward certain breast groups on simulated data; however, on clinical data, our algorithm denoises different breast types equally well with respect to structural similarity index.ConclusionsWe propose a robust deep learning-based denoising algorithm that reduces DBT projection noise levels and subject it to an extensive test that provides information about its strengths and weaknesses.
BibTeX:
@article{Eckert2023,
  author = {Dominik Eckert and Julia Wicklein and Magdalena Herbst and Stephan Dwars and Ludwig Ritschl and Steffen Kappler and Sebastian Stober},
  title = {{Deep learning based tomosynthesis denoising: a bias investigation across different breast types}},
  journal = {Journal of Medical Imaging},
  publisher = {SPIE},
  year = {2023},
  volume = {10},
  number = {6},
  pages = {064003},
  url = {https://doi.org/10.1117/1.JMI.10.6.064003},
  doi = {10.1117/1.JMI.10.6.064003}
}
Abstract: Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.
BibTeX:
@inproceedings{Ghosh2023,
  author = {Ghosh, Suhita and Das, Arnab and Sinha, Yamini and Siegert, Ingo and Polzehl, Tim and Stober, Sebastian},
  title = {Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion},
  booktitle = {INTERSPEECH 2023},
  month = {August},
  publisher = {ISCA},
  year = {2023},
  series = {interspeech_2023},
  doi = {10.21437/interspeech.2023-191}
}
Abstract: In this paper, we analyze and incorporate acoustic features in a deep learning based speaker anonymization model. Speaker anonymization aims at suppressing personally identifiable information while preserving the prosody and linguistic content. In this work a StarGAN-based voice conversion model is used for anonymization, where a source speaker’s voice is transformed into that of a target speaker. It has been typically observed that the quality of the converted voice varies across target speakers, especially when certain acoustic properties such as pitch are very different between the source and target speakers. Choosing a target speaker dissimilar to the source speaker may lead to successful anonymization. However, it has been observed that choosing a very dissimilar target speaker often leads to a low-quality voice conversion. Therefore, we aim to improve the overall quality of the converted voice, by introducing perceptual losses based on stress and intonation related acoustic features such as power envelope, F0, etc. This facilitates improved anonymization and voice quality for all target speakers.
BibTeX:
@inproceedings{Ghosh2023daga,
  author = {Ghosh, Suhita and Sinha, Yamini and Siegert, Ingo and Stober, Sebastian},
  title = {Improving voice conversion for dissimilar speakers using perceptual losses},
  booktitle = {49. Jahrestagung f\"{u}r Akustik DAGA 2023, Hamburg},
  address = {Hamburg, Germany},
  month = {Mar},
  publisher = {German Acoustical Society (DEGA)},
  year = {2023},
  pages = {1358--1361},
  url = {https://pub.dega-akustik.de/DAGA_2023/data/articles/000469.pdf}
}
Abstract: Concealing the identity through speaker anonymization is essential in various situations. This study focuses on investigating how stuttering affects the anonymization process. Two scenarios are considered: preserving the pathology in the diagnostic/remote treatment context and obfuscating the pathology. The paper examines the effectiveness of three state-of-the-art approaches in achieving high anonymization, as well as the preservation of dysfluencies. The findings indicate that while a speaker conversion method may not achieve perfect anonymization (Baseline 27.25% EER and F0 Delta 32.63% EER), it does preserve the pathology. This effect was objectively evaluated by performing a stuttering classification. Although this solution may be useful in a remote treatment scenario for speech pathologies, it presents a vulnerability in anonymization. To address this issue, we propose an alternative approach that uses automatic speech recognition and text-based speech synthesis to avoid re-identification (48.27% EER).
BibTeX:
@inproceedings{Hintz2023,
  author = {Hintz, Jan and Bayerl, Sebastian and Sinha, Yamini and Ghosh, Suhita and Schubert, Martha and Stober, Sebastian and Riedhammer, Korbinian and Siegert, Ingo},
  title = {Anonymization of Stuttered Speech -- Removing Speaker Information while Preserving the Utterance},
  booktitle = {3rd Symposium on Security and Privacy in Speech Communication},
  month = {August},
  publisher = {ISCA},
  year = {2023},
  series = {spsc_2023},
  doi = {10.21437/spsc.2023-7}
}
Abstract: Musikalische Source Separation beschäftigt sich mit dem Problem, ein Musikstück in separate Spuren, beispielsweise Gesang und Begleitung, zu zerlegen. Aktuelle Ansätze basieren in der Regel auf neuronalen Netzen, die end-to-end trainiert werden und gleichermaßen auf beliebige Musikstücke angewendet werden. Frei verfügbare Systeme, wie z.B. Spleeter oder Demucs, liefern bereits gute Resultate und sind auch von Laien nutzbar. Teils enthalten die Ergebnisse jedoch hörbare Artefakte. Dies gilt insbesondere für Musikstile, die in den Trainingsdaten unterrepräsentiert sind. Wir stellen einen Ansatz vor, um die Ergebnisse speziell für sogenannte Riddim-Alben, eine Tradition des jamaikanischen Reggae, zu verbessern. Hier singen verschiedene Künstler eigene Texte auf dieselbe musikalische Begleitung. Wir stellen Methoden vor, die sich dieses Wissen zunutze machen, um eine bessere Trennung in Gesang und Begleitung zu erzielen, als bestehende Systeme. Dabei werden die Stücke eines Albums zuerst mit herkömmlichen Methoden getrennt. Anschließend werden Unterschiede zwischen den erhaltenen Begleitungen geglättet. Eine Herausforderung hierbei ist, dass die jeweils verwendeten Begleitungen nicht immer exakt identisch sind, sondern kleine Unterschiede bestehen können. Eine qualitative Nutzerstudie kommt zum Ergebnis, dass die Teilnehmer unseren Ansatz der direkten Anwendung von Spleeter oder Demucs vorziehen. Dies zeigt, dass anwendungsabhängig Verbesserungen der Source-Separation-Performance erzielt werden können, auch bei state-of-the-art Systemen.
BibTeX:
@inproceedings{Johannsmeier2023daga,
  author = {Johannsmeier, Jens and Allan, Kenneth and Stober, Sebastian},
  title = {Verbesserte Singing Voice Separation f{\"u}r Riddim-Alben},
  booktitle = {49. Jahrestagung f\"{u}r Akustik DAGA 2023, Hamburg},
  address = {Hamburg, Germany},
  month = {Mar},
  publisher = {German Acoustical Society (DEGA)},
  year = {2023},
  pages = {930--933},
  note = {in German},
  url = {https://pub.dega-akustik.de/DAGA_2023/data/articles/000517.pdf}
}
Abstract: Deep Neural Networks (DNNs) are successful but work as black-boxes. Elucidating their inner workings is crucial as DNNs are prone to reproducing data biases and potentially harm underrepresented or historically discriminated demographic groups. In this work, we demonstrate an approach for visualizing DNN activations that facilitates to visually detect biases in learned representations. This approach displays activations as topographic maps, similar to common visualization of brain activity. In addition to visual inspection of activations, we evaluate different measures to quantify the quality of the topographic maps. With visualization and measurement of quality, we provide qualitative and quantitative means for investigating bias in representations and demonstrate this for activations of a pre-trained image recognition model when processing images of peoples’ faces. We find biases for different sensitive variables, particularly in deeper layers of the investigated DNN, and support the subjective evaluation with a quantitative measure of visual quality.
BibTeX:
@inproceedings{krug2023aequitas,
  author = {Krug, Valerie and Olson, Christopher and Stober, Sebastian},
  title = {Visualizing Bias in Activations of Deep Neural Networks as Topographic Maps},
  booktitle = {Proceedings of the 1st Workshop on Fairness and Bias in AI (AEQUITAS 2023) co-located with 26th European Conference on Artificial Intelligence (ECAI 2023) Kraków, Poland},
  publisher = {CEUR-WS},
  year = {2023},
  url = {http://ceur-ws.org/Vol-3523/}
}
Abstract: Deep Neural Networks (DNNs) are successful but work as black-boxes. Elucidating their inner workings is crucial but a difficult task. In this work, we investigate how activity and confidence of a DNN relate in a simple Multi-Layer Perceptron. Further, we observe how activity, confidence and their relation develop during model training. For ease of visual comparison, we use a technique to display DNN activity as topographic maps, similar to common visualization of brain activity. Our results indicate that activity becomes stronger and distinguished both with training time and confidence.
BibTeX:
@inproceedings{krug2023ecml,
  author = {Krug, Valerie and Olson, Christopher and Stober, Sebastian},
  title = {Relation of Activity and Confidence when Training Deep Neural Networks},
  booktitle = {Uncertainty meets Explainability, Workshop at ECML-PKDD 2023, Torino, Italy},
  year = {2023}
}
BibTeX:
@inbook{Krug2023hhai,
  author = {Krug, Valerie and Ratul, Raihan Kabir and Olson, Christopher and Stober, Sebastian},
  title = {Visualizing Deep Neural Networks with Topographic Activation Maps},
  booktitle = {HHAI 2023: Augmenting Human Intellect},
  month = {June},
  publisher = {IOS Press},
  year = {2023},
  isbn = {9781643683959},
  doi = {10.3233/faia230080}
}
Abstract: Machine Learning with Deep Neural Networks (DNNs) has become successful in solving tasks across various fields of ap- plication. However, the complexity of DNNs makes it difficult to understand how they solve their learned task. We research techniques to layout neurons of DNNs in a two-dimensional space such that neurons of similar activity are in the vicinity of each other. This allows to visualize DNN activations as topographic maps similar to how brain activity is commonly displayed. Our novel visualization technique improves the transparency of DNN-based decision-making systems and is interpretable without expert knowledge in Machine Learning
BibTeX:
@inproceedings{krug2023verilearn,
  author = {Krug, Valerie and Ratul, Raihan Kabir and Olson, Christopher and Stober, Sebastian},
  title = {Visualizing Deep Neural Networks with Topographic Activation Maps},
  booktitle = {VeriLearn 2023: Workshop on Verifying Learning AI Systems, co-located with 26th European Conference on Artificial Intelligence (ECAI 2023) Kraków, Poland},
  year = {2023},
  url = {https://dtai.cs.kuleuven.be/events/VeriLearn2023/papers/VeriLearn23_paper_12.pdf}
}
Abstract: Music classification algorithms use signal processing and machine learning approaches to extract and enrich metadata for audio recordings in music archives. Common tasks include music genre classification, where each song is assigned a single label (such as Rock, Pop, or Jazz), and musical instrument classification. Since music metadata can be ambiguous, classification algorithms cannot always achieve fully accurate predictions. Therefore, our focus extends beyond the correctly estimated class labels to include realistic confidence values for each potential genre or instrument label. In practice, many state-of-the-art classification algorithms based on deep neural networks exhibit overconfident predictions, complicating the interpretation of the final output values. In this work, we examine whether the issue of overconfident predictions and, consequently, non-representative confidence values is also relevant to music genre classification and musical instrument classification. Moreover, we describe techniques to mitigate this behavior and assess the impact of deep ensembles and temperature scaling in generating more realistic confidence outputs, which can be directly employed in real-world music tagging applications.
BibTeX:
@inproceedings{Lukashevich2023,
  author = {Lukashevich, Hanna and Grollmisch, Sascha and Abeßer, Jakob and Stober, Sebastian and Bös, Joachim},
  title = {How reliable are posterior class probabilities in automatic music classification?},
  booktitle = {Proceedings of the 18th International Audio Mostly Conference},
  month = {August},
  publisher = {ACM},
  year = {2023},
  series = {AM ’23},
  doi = {10.1145/3616195.3616228}
}
Abstract: PURPOSE. Albinism is a congenital disorder affecting pigmentation levels, structure, and function of the visual system. The identification of anatomical changes typical for people with albinism (PWA), such as optic chiasm malformations, could become an important component of diagnostics. Here, we tested an application of convolutional neural networks (CNNs) for this purpose. METHODS. We established and evaluated a CNN, referred to as CHIASM-Net, for the detection of chiasmal malformations from anatomic magnetic resonance (MR) images of the brain. CHIASM-Net, composed of encoding and classification modules, was developed using MR images of controls (n = 1708) and PWA (n = 32). Evaluation involved 8-fold cross validation involving accuracy, precision, recall, and F1-score metrics and was performed on a subset of controls and PWA samples excluded from the training. In addition to quantitative metrics, we used Explainable AI (XAI) methods that granted insights into factors driving the predictions of CHIASM-Net. RESULTS. The results for the scenario indicated an accuracy of 85 ± 14%, precision of 90 ± 14% and recall of 81 ± 18%. XAI methods revealed that the predictions of CHIASM-Net are driven by optic-chiasm white matter and by the optic tracts. CONCLUSIONS. CHIASM-Net was demonstrated to use relevant regions of the optic chiasm for albinism detection from magnetic resonance imaging (MRI) brain anatomies. This indicates the strong potential of CNN-based approaches for visual pathway analysis and ultimately diagnostics.
BibTeX:
@article{Puzniak2023,
  author = {Puzniak, Robert J. and Prabhakaran, Gokulraj T. and McLean, Rebecca J. and Stober, Sebastian and Ather, Sarim and Proudlock, Frank A. and Gottlob, Irene and Dineen, Robert A. and Hoffmann, Michael B.},
  title = {CHIASM-Net: Artificial Intelligence-Based Direct Identification of Chiasmal Abnormalities in Albinism},
  month = {October},
  journal = {Investigative Opthalmology & Visual Science},
  publisher = {Association for Research in Vision and Ophthalmology (ARVO)},
  year = {2023},
  volume = {64},
  number = {13},
  pages = {14},
  doi = {10.1167/iovs.64.13.14}
}
Abstract: The use of artificial intelligence (AI) is becoming increasingly important in various domains, making education about AI a necessity. The interdisciplinary nature of AI and the relevance of AI in various fields require that university instructors and course developers integrate AI topics into the classroom and create so-called domain-specific AI courses. In this paper, we introduce the “AI Course Design Planning Framework” as a course planning framework to structure the development of domain-specific AI courses at the university level. The tool evolves non-specific course planning frameworks to address the context of domain-specific AI education. Following a design-based research approach, we evaluated a first prototype of the tool with instructors in the field of AI education who are developing domain-specific courses in this area. The results of our evaluation indicate that the tool allows instructors to create domain-specific AI courses in an efficient and comprehensible way. In general, instructors rated the tool as useful and user-friendly and made recommendations to improve its usability. Future research will focus on testing the application of the tool for domain-specific AI course developments in different domain contexts and examine the influence of using the tool on AI course quality and learning outcomes.
BibTeX:
@article{Schleiss2023,
  author = {Schleiss, Johannes and Laupichler, Matthias Carl and Raupach, Tobias and Stober, Sebastian},
  title = {AI Course Design Planning Framework: Developing Domain-Specific AI Education Courses},
  month = {September},
  journal = {Education Sciences},
  publisher = {MDPI AG},
  year = {2023},
  volume = {13},
  number = {9},
  pages = {954},
  doi = {10.3390/educsci13090954}
}
Abstract: The integration of tools and methods of Artificial Intelligence (AI) into the engineering domain has become increasingly important, and with it comes a shift in required competencies. As a result, engineering education should now incorporate AI competencies into its courses and curricula. While interdisciplinary education at a subject level has already been explored, the development of interdisciplinary curricula often presents a challenge. This paper investigates the use of the curriculum workshop method for developing interdisciplinary, competence-oriented curricula. Using a case study of a newly developed interdisciplinary Bachelor program for AI in Engineering, the study evaluates the instrument of the curriculum workshop. The communicative methods of the tool and various aspects of its implementation through self-evaluation procedures and surveys of workshop participants are discussed. The results show that the structure and competence orientation of the method facilitate alignment among participants from different disciplinary backgrounds. However, it is also important to consolidate the mutually developed broad ideas for the curriculum design into concrete outcomes, such as a competence profile. Interdisciplinary curriculum development needs to take into account different perspectives and demands towards the curriculum which increases complexity and requires a more structured design process. The findings of the paper highlight the importance of interdisciplinary curriculum design in engineering education and provide practical insights in the application of tools for the creation of competence-oriented curricula in curriculum workshops, thereby contributing to the development of future engineers.
BibTeX:
@article{Schleiss2023b,
  author = {Schleiss, Johannes and Manukjan, Anke and Bieber, Michelle Ines and Pohlenz, Philipp and Stober, Sebastian},
  title = {Curriculum Workshops As A Method Of Interdisciplinary Curriculum Development: A Case Study For Artificial Intelligence In Engineering},
  publisher = {European Society for Engineering Education (SEFI)},
  year = {2023},
  doi = {10.21427/XTAE-AS48}
}
Abstract: As Artificial Intelligence (AI) becomes increasingly important in engineering, instructors need to incorporate AI concepts into their subject-specific courses. However, many teachers may lack the expertise to do so effectively or don’t know where to start. To address this challenge, we have developed the AI Course Design Planning Framework to help instructors structure their teaching of domain-specific AI skills. This workshop aimed to equip participants with an understanding of the framework and its application to their courses. The workshop was designed for instructors in engineering education who are interested in interdisciplinary teaching and teaching about AI in the context of their domain. Throughout the workshop, participants worked hands-on in groups with the framework, applied it to their intended courses and reflected on the use. The workshop revealed challenges in defining domain-specific AI use cases and assessing learners’ skills and instructors’ competencies. At the same time, participants found the framework effective in early course development. Overall, the results of the workshop highlight the need for AI integration in engineering education and equipping educators with effective tools and training. It is clear that further efforts are needed to fully embrace AI in engineering education.
BibTeX:
@inproceedings{Schleiss2023a,
  author = {Schleiss, Johannes and Stober, Sebastian},
  title = {Planning Interdisciplinary Artificial Intelligence Courses For Engineering Students},
  booktitle = {European Society for Engineering Education (SEFI) 2023 Annual Conference},
  publisher = {European Society for Engineering Education (SEFI)},
  year = {2023},
  doi = {10.21427/V4ZV-HR52}
}
2022
Abstract: Over the past few years, deep Artificial Neural Networks (ANNs) have become more popular due to their great success in various tasks. However, their improvements made them more capable but less interpretable. To overcome this issue, some introspection techniques have been proposed. According to the fact that ANNs are inspired by human brains, we adapt techniques from cognitive neuroscience to easier interpret them. Our approach first computes characteristic network responses for groups of input examples, for example, relating to a specific error. We then use these to compare network responses between different groups. To this end, we compute representational similarity and we visualize the activations as topographic activation maps. In this work, we present a graphical user interface called CogXAI ANNalyzer to easily apply our techniques to trained ANNs and to interpret their results. Further, we demonstrate our tool using an audio ANN for speech recognition.
BibTeX:
@inproceedings{Ebrahimzadeh2022ismirlbd,
  author = {Ebrahimzadeh, Maral and Krug, Valerie and Stober, Sebastian},
  title = {CogXAI ANNalyzer: Cognitive Neuroscience Inspired Techniques for eXplainable AI},
  booktitle = {23rd International Society for Music Information Retrieval Conference (ISMIR'22) - Late Breaking \& Demo Papers},
  year = {2022},
  url = {https://archives.ismir.net/ismir2022/latebreaking/000050.pdf}
}
Abstract: Digital Breast Tomosynthesis (DBT) is becoming increasingly popular for breast cancer screening because of its high depth resolution. It uses a set of low-dose x-ray images called raw projections to reconstruct an arbitrary number of planes. These are typically used in further processing steps like backprojection to generate DBT slices or synthetic mammography images. Because of their low x-ray dose, a high amount of noise is present in the projections. In this study, the possibility of using deep learning for the removal of noise in raw projections is investigated. The impact of loss functions on the detail preservation is analized in particular. For that purpose, training data is augmented following the physics driven approach of Eckert et al.1 In this method, an x-ray dose reduction is simulated. First pixel intensities are converted to the number of photons at the detector. Secondly, Poisson noise is enhanced in the x-ray image by simulating a decrease in the mean photon arrival rate. The Anscombe Transformation2 is then applied to construct signal independent white Gaussian noise. The augmented data is then used to train a neural network to estimate the noise. For training several loss functions are considered including the mean square error (MSE), the structural similarity index (SSIM)3 and the perceptual loss.4 Furthermore the ReLU-Loss1 is investigated, which is especially designed for mammogram denoising and prevents the network from noise overestimation. The denoising performance is then compared with respect to the preservation of small microcalcifications. Based on our current measurements, we demonstrate that the ReLU-Loss in combination with SSIM improves the denoising results.
BibTeX:
@inproceedings{Eckert2022,
  author = {Dominik Eckert and Ludwig Ritschl and Magdalena Herbst and Julia Wicklein and Sulaiman Vesal and Steffen Kappler and Andreas Maier and Sebastian Stober},
  title = {Deep learning based denoising of mammographic x-ray images: an investigation of loss functions and their detail-preserving properties},
  editor = {Wei Zhao and Lifeng Yu},
  booktitle = {Medical Imaging 2022: Physics of Medical Imaging},
  month = {apr},
  publisher = {{SPIE}},
  year = {2022},
  doi = {10.1117/12.2612403}
}
Abstract: Purpose The coronary artery calcification (CAC) score is an independent marker for the risk of cardiovascular events. Automatic methods for quantifying CAC could reduce workload and assist radiologists in clinical decision-making. However, large annotated datasets are needed for training to achieve very good model performance, which is an expensive process and requires expert knowledge. The number of training data required can be reduced in an active learning scenario, which requires only the most informative samples to be labeled. Multitask learning techniques can improve model performance by joint learning of multiple related tasks and extraction of shared informative features. Methods We propose an uncertainty-weighted multitask learning model for coronary calcium scoring in electrocardiogram-gated (ECG-gated), noncontrast-enhanced cardiac calcium scoring CT. The model was trained to solve the two tasks of coronary artery region segmentation (weak labels) and coronary artery calcification segmentation (strong labels) simultaneously in an active learning scenario to improve model performance and reduce the number of samples needed for training. We compared our model with a single-task U-Net and a sequential-task model as well as other state-of-the-art methods. The model was evaluated on 1275 individual patients in three different datasets (DISCHARGE, CADMAN, orCaScore), and the relationship between model performance and various influencing factors (image noise, metal artifacts, motion artifacts, image quality) was analyzed. Results Joint learning of multiclass coronary artery region segmentation and binary coronary calcium segmentation improved calcium scoring performance. Since shared information can be learned from both tasks for complementary purposes, the model reached optimal performance with only 12% of the training data and one-third of the labeling time in an active learning scenario. We identified image noise as one of the most important factors influencing model performance along with anatomical abnormalities and metal artifacts. Conclusions Our multitask learning approach with uncertainty-weighted loss improves calcium scoring performance by joint learning of shared features and reduces labeling costs when trained in an active learning scenario.
BibTeX:
@article{Foellmer2022,
  author = {Bernhard Föllmer and Federico Biavati and Christian Wald and Sebastian Stober and Jackie Ma and Marc Dewey and Wojciech Samek},
  title = {Active multitask learning with uncertainty-weighted loss for coronary calcium scoring},
  month = {aug},
  journal = {Medical Physics},
  publisher = {Wiley},
  year = {2022},
  doi = {10.1002/mp.15870}
}
Abstract: The C-arm Cone-Beam Computed Tomography (CBCT) increasingly plays a major role in interventions and radiotherapy. However, the slow data acquisition and high dose hinder its predominance in the clinical routine. To overcome the high-dose issue, various protocols such as sparse-view have been proposed, where a subset of projections is acquired over increased angular steps. However, applying the standard reconstruction algorithms to datasets obtained from such protocols results in volumes with severe streaking artifacts. Further, the presence of surgical instruments worsens the quality and make the reconstructions clinically useless. High-quality pre-operative CT scans are usually acquired for diagnosis and intervention planning, which contain the required high-resolution details of the body part. In this work, we propose a deep learning-based method that incorporates the planning CT along with the sparse-sampled interventional CBCT of the same subject to produce high-quality reconstructions containing the surgical instrument. We also propose a perception-aware loss for the task, which facilitates the model to capture the surgical instrument precisely, compared to the pixelwise mean squared error (MSE). The model using the planning CT and trained with pixelwise MSE loss improves the soft-tissue contrast and the reconstruction quality (mean PSNR) from 27.46 dB to 32.89 dB. The perception-aware loss further improves the reconstruction quality statistically significantly to 36.14 dB.
BibTeX:
@inproceedings{Ghosh2022,
  author = {Suhita Ghosh and Philipp Ernst and Georg Rose and Andreas Nurnberger and Sebastian Stober},
  title = {Towards Patient Specific Reconstruction Using Perception-Aware {CNN} and Planning {CT} as Prior},
  booktitle = {2022 {IEEE} 19th International Symposium on Biomedical Imaging ({ISBI})},
  month = {mar},
  publisher = {{IEEE}},
  year = {2022},
  doi = {10.1109/isbi52829.2022.9761462}
}
Abstract: This paper presents the ongoing efforts on voice anonymization with the purpose to securely anonymize a speaker’s identity in a hotline call scenario. Our hotline seeks out to provide help by remote assessment, treatment and prevention against child sexual abuse in Germany. The presented work originates from the joint contribution to the VoicePrivacy Challenge 2022 and the Symposium on Security and Privacy in Speech Communication in 2022. Having analyzed in depth the results of the first instantiation of the Voice Privacy Challenge in 2020, the current experiments aim to improve the robustness of two distinct components of the challenge baseline. First, we analyze ASR embeddings, in order to present a more precise and resistant representation of the source speech that is used in the challenge baseline GAN. First experiments using wav2vec show promising results. Second, to alleviate modeling and matching of source and target speaker characteristics, we propose to exchange the baseline x-vectors speaker identity features with the more robust ECAPA-TDNN embedding, in order to leverage its higher resolution multi-scale architecture. Also, improving on ECAPA-TDNN, we propose to extend the model architecture by integrating SE-Res2NeXt units, as the expectation that by representing features at various scales using a cutting-edge building block for CNNs, the latter will perform better than the SE-Res2Net block that creates hierarchical residual-like connections within a single residual block, allowing them to represent features at multiple scales. This expands the range of receptive fields for each network layer and depicts multi-scale features at a finer level. Ultimately, when including a more precise speaker identity embedding we expect to reach improvements for future anonymization for various application cases.
BibTeX:
@inproceedings{Khamsehashari2022,
  author = {Khamsehashari, Razieh and Sinha, Yamini and Hintz, Jan and Ghosh, Suhita and Polzehl, Tim and Franzreb, Clarlos and Stober, Sebastian and Siegert, Ingo},
  title = {Voice Privacy - leveraging multi-scale blocks with ECAPA-TDNN SE-Res2NeXt extension for speaker anonymization},
  booktitle = {2nd Symposium on Security and Privacy in Speech Communication},
  month = {September},
  publisher = {ISCA},
  year = {2022},
  series = {spsc_2022},
  doi = {10.21437/spsc.2022-8}
}
Abstract: Machine Learning with Deep Neural Networks (DNNs) has become a successful tool in solving tasks across various fields of application. The success of DNNs is strongly connected to their high complexity in terms of the number of network layers or of neurons in each layer, which severely complicates to understand how DNNs solve their learned task. To improve the explainability of DNNs, we adapt methods from neuroscience because this field has a rich experience in analyzing complex and opaque systems. In this work, we draw inspiration from how neuroscience uses topographic maps to visualize the activity of the brain when it performs certain tasks. Transferring this approach to DNNs can help to visualize and understand their internal processes more intuitively, too. However, the inner structures of brains and DNNs differ substantially. Therefore, to be able to visualize activations of neurons in DNNs as topographic maps, we research techniques to layout the neurons in a two-dimensional space in which neurons of similar activity are in the vicinity of each other. In this work, we introduce and compare different methods to obtain a topographic layout of the neurons in a network layer. Moreover, we demonstrate how to use the resulting topographic activation maps to identify errors or encoded biases in DNNs or data sets. Our novel visualization technique improves the transparency of DNN-based algorithmic decision-making systems and is accessible to a broad audience because topographic maps are intuitive to interpret without expert-knowledge in Machine Learning.
BibTeX:
@article{Krug2022,
  author = {Andreas Krug and Raihan Kabir Ratul and Sebastian Stober},
  title = {Visualizing Deep Neural Networks with Topographic Activation Maps},
  month = {April},
  journal = {arXiv preprint arXiv:2204.03528},
  year = {2022},
  url = {http://arxiv.org/abs/2204.03528}
}
Abstract: Predictive coding networks (PCNs) have an inherent degree of biological plausibility and can perform approximate backpropagation of error in supervised learning settings. However, it is less clear how predictive coding compares to state-of-the-art architectures, such as VAEs, in unsupervised and probabilistic settings. We propose a PCN that, inspired by generalized predictive coding in neuroscience, parameterizes hierarchical distributions of latent states under the Laplace approximation and maximises model evidence via iterative inference using locally computed error signals. Unlike its inspiration it uses multi-layer neural networks with nonlinearities between latent distributions. We compare our model to VAE and VLAE baselines on three different image datasets and find that generalized predictive coding shows performance comparable to variational autoencoders trained with exact error backpropagation. Finally, we investigate the possibility of learning temporal dynamics via static prediction by encoding sequential observations in generalized coordinates of motion.
BibTeX:
@inproceedings{ofner2022svrhm,
  author = {Ofner, Andr{\'e} and Millidge, Beren and Stober, Sebastian},
  title = {Generalized Predictive Coding: Bayesian Inference in Static and Dynamic Models},
  booktitle = {NeurIPS 2022 Workshop on Shared Visual Representations in Human & Machine Intelligence (SVRHM'22)},
  year = {2022},
  url = {https://openreview.net/forum?id=qaT_CByg1X5}
}
BibTeX:
@inbook{Ofner2022,
  author = {Ofner, André and Stober, Sebastian},
  title = {Deep Neural Networks and Auditory Imagery},
  booktitle = {Music and Mental Imagery},
  month = {November},
  publisher = {Routledge},
  year = {2022},
  pages = {112--122},
  isbn = {9780429330070},
  doi = {10.4324/9780429330070-12}
}
Abstract: Most deep learning models are known to be black-box models due to their overwhelming complexity. One approach to make models more interpretable is to reduce the representations to a finite number of objects. This can be achieved by clustering latent spaces or training models which include quantization by design such as the Vector Quantised-Variational AutoEncoder (VQ-VAE). However, if the architecture is not chosen carefully, a phenomenon called index collapse can be observed. Here, a large part of the codebook containing the prototypes is not used decreasing the possible performance. Approaches to circumvent this either rely on data-depending initialization or decreasing the dimensionality of the codebook vectors. In this paper, we present a novel variant of the VQ-VAE, the Neural-Gas VAE, which adapts the codebook loss inspired by neural-gas to avoid index collapse. We show that the Neural-Gas VAE achieves competitive performance on CIFAR and Speech Commands for different codebook sizes and dimensions. Moreover, we show that the resulting architecture learns a meaningful latent space and topology for both features or objects.
BibTeX:
@inproceedings{Perschewski2022,
  author = {Jan-Ole Perschewski and Sebastian Stober},
  title = {Neural-Gas {VAE}},
  booktitle = {International Conference on Artificial Neural Networks (ICANN'22)},
  publisher = {Springer International Publishing},
  year = {2022},
  series = {LNCS},
  volume = {13529},
  pages = {292--303},
  doi = {10.1007/978-3-031-15919-0_25}
}
BibTeX:
@inproceedings{2022,
  author = {Johannes Schleiss; Michelle Ines Bieber; Anke Manukjan; Lars Kellner and Sebastian Stober.},
  title = {An Interdisciplinary Competence Profile for AI in Engineering},
  booktitle = {Proceedings of the 50th European Society for Engineering Education (SEFI) Anual Conference},
  year = {2022}
}
BibTeX:
@inbook{Schleiss2022a,
  author = {Schleiss, Johannes and Brockhoff, Robert and Stober, Sebastian},
  title = {Projektseminar "Künstliche Intelligenz in den Neurowissenschaften" – interdisziplinäre und anwendungsnahe Lehre umsetzen},
  editor = {Mah, D.-K., & Torner, C.},
  booktitle = {Anwendungsorientierte Hochschullehre zu Künstlicher Intelligenz. Impulse aus dem Fellowship-Programm zur Integration von KI-Campus-Lernangeboten},
  address = {Berlin},
  publisher = {KI-Campus},
  year = {2022},
  pages = {23--31},
  doi = {10.5281/zenodo.7319832}
}
Abstract: The rise of Artificial Intelligence in Education opens up new possibilities for analysis of student data. However, the protection of private data in these applications is a major challenge. According to data regulations, the application designer is responsible for technical and organizational measures to ensure privacy. This paper aims to guide developers of educational platforms to make informed decisions about their use of privacy-preserving ML and, therefore, protect their student data.
BibTeX:
@incollection{Schleiss2022,
  author = {Johannes Schleiss and Kolja Günther and Sebastian Stober},
  title = {Protecting Student Data in~{ML} Pipelines: An Overview of~Privacy-Preserving {ML}},
  booktitle = {Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners' and Doctoral Consortium},
  publisher = {Springer International Publishing},
  year = {2022},
  pages = {532--536},
  doi = {10.1007/978-3-031-11647-6_109}
}
BibTeX:
@inproceedings{Schleiss2022a,
  author = {Johannes Schleiss; Julia Hense; Andreas Kist; Jörn Schlingensiepen & Sebastian Stober},
  title = {Teaching AI Competencies in Engineering using Projects and Open Educational Resources.},
  booktitle = {Proceedings of the 50th European Society for Engineering Education (SEFI) Anual Conference},
  year = {2022}
}
Abstract: Nearly all state of the art vision models are sensitive to image rotations. Existing methods often compensate for missing inductive biases by using augmented training data to learn pseudo-invariances. Alongside the resource demanding data inflation process, predictions often poorly generalize. The inductive biases inherent to convolutional neural networks allow for translation equivariance through kernels acting parallely to the horizontal and vertical axes of the pixel grid. This inductive bias, however, does not allow for rotation equivariance. We propose a radial beam sampling strategy along with radial kernels operating on these beams to inherently incorporate center-rotation covariance. Together with an angle distance loss, we present a radial beam-based image canonicalization model, short BIC. Our model allows for maximal continuous angle regression and canonicalizes arbitrary center-rotated input images. As a pre-processing model, this enables rotation-invariant vision pipelines with model-agnostic rotation-sensitive downstream predictions. We show that our end-to-end trained angle regressor is able to predict continuous rotation angles on several vision datasets, i.e. FashionMNIST, CIFAR10, COIL100, and LFW.
BibTeX:
@article{Schmidt2022,
  author = {Johann Schmidt and Sebastian Stober},
  title = {Learning Continuous Rotation Canonicalization with Radial Beam Sampling},
  month = {June},
  journal = {arXiv preprint arXiv:2206.10690},
  year = {2022},
  url = {http://arxiv.org/abs/2206.10690},
  doi = {10.48550/ARXIV.2206.10690}
}
Abstract: Goal-directed actions frequently require a balance between antagonistic processes (e.g., executing and inhibiting a response), often showing an interdependency concerning what constitutes goal-directed behavior. While an inter-dependency of antagonistic actions is well described at a behavioral level, a possible inter-dependency of underlying processes at a neuronal level is still enigmatic. However, if there is an interdependency, it should be possible to predict the neurophysiological processes underlying inhibitory control based on the neural processes underlying speeded automatic responses. Based on that rationale, we applied artificial intelligence and source localization methods to human EEG recordings from N = 255 participants undergoing a response inhibition experiment (Go/Nogo task). We show that the amplitude and timing of scalp potentials and their functional neuroanatomical sources during inhibitory control can be inferred by conditional generative adversarial networks (cGANs) using neurophysiological data recorded during response execution. We provide insights into possible limitations in the use of cGANs to delineate the interdependency of antagonistic actions on a neurophysiological level. Nevertheless, artificial intelligence methods can provide information about interdependencies between opposing cognitive processes on a neurophysiological level with relevance for cognitive theory.
BibTeX:
@article{Vahid2022,
  author = {Amirali Vahid and Moritz Mückschel and Sebastian Stober and Ann-Kathrin Stock and Christian Beste},
  title = {Conditional generative adversarial networks applied to {EEG} data can inform about the inter-relation of antagonistic behaviors on a neural level},
  month = {feb},
  journal = {Communications Biology},
  publisher = {Springer Science and Business Media {LLC}},
  year = {2022},
  volume = {5},
  number = {1},
  doi = {10.1038/s42003-022-03091-8}
}
2021
Abstract: The concerns over radiation-related health risks associated with the increasing use of computed tomography (CT) have accelerated the development of low-dose strategies. There is a higher need for low dosage in interventional applications as repeated scanning is performed. However, using the noisier and under-sampled low-dose datasets, the standard reconstruction algorithms produce low-resolution images with severe streaking artifacts. This adversely affects the CT assisted interventions. Recently, variational autoencoders (VAEs) have achieved state of-the-art results for the reconstruction of high fidelity images. The existing VAE approaches typically use mean squared error (MSE) as the loss, because it is convex and differentiable. However, pixel wise MSE does not capture the perceptual quality difference between the target and model predictions. In this work, we propose two simple but effective MSE based perception aware losses, which facilitate a better reconstruction quality. The proposed losses are motivated by perceptual fidelity measures used in image quality assessment. One of the losses involves calculation of the MSE in the spectral domain. The other involves calculation of the MSE in the pixel space and the Laplacian of Gaussian transformed domain. We use a hierarchical vector-quantized VAE equipped with the perception-aware losses for the artifact removal task. The best performing perception-aware loss improves the structural similarity index measure (SSIM) from 0.74 to 0.80. Further, we provide an analysis of the role of the pertinent components of the architecture in the denoising and artifact removal task.
BibTeX:
@inproceedings{Ghosh2021,
  author = {Suhita Ghosh and Andreas Krug and Georg Rose and Sebastian Stober},
  title = {Perception-Aware Losses Facilitate {CT} Denoising and Artifact Removal},
  booktitle = {2021 {IEEE} 2nd International Conference on Human-Machine Systems ({ICHMS})},
  month = {sep},
  publisher = {{IEEE}},
  year = {2021},
  doi = {10.1109/ichms53169.2021.9582444}
}
BibTeX:
@techreport{johannsmeier2021dcase,
  author = {Johannsmeier, Jens and Stober, Sebastian},
  title = {Few-shot bioacoustic event detection via segmentation using prototypical networks},
  school = {Detection and Classification of Acoustic Scenes and Events (DCASE)},
  year = {2021},
  url = {https://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Johannsmeier_69_task5.pdf}
}
Abstract: Deep Learning-based Automatic Speech Recognition (ASR) models are very successful, but hard to interpret. To gain a better understanding of how Artificial Neural Networks (ANNs) accomplish their tasks, several introspection methods have been proposed. However, established introspection techniques are mostly designed for computer vision tasks and rely on the data being visually interpretable, which limits their usefulness for understanding speech recognition models. To overcome this limitation, we developed a novel neuroscience-inspired technique for visualizing and understanding ANNs, called Saliency-Adjusted Neuron Activation Profiles (SNAPs). SNAPs are a flexible framework to analyze and visualize Deep Neural Networks that does not depend on visually interpretable data. In this work, we demonstrate how to utilize SNAPs for understanding fully-convolutional ASR models. This includes visualizing acoustic concepts learned by the model and the comparative analysis of their representations in the model layers.
BibTeX:
@article{Krug2021,
  author = {Andreas Krug and Maral Ebrahimzadeh and Jost Alemann and Jens Johannsmeier and Sebastian Stober},
  title = {Analyzing and Visualizing Deep Neural Networks for Speech Recognition with Saliency-Adjusted Neuron Activation Profiles},
  month = {jun},
  journal = {Electronics},
  publisher = {{MDPI} {AG}},
  year = {2021},
  volume = {10},
  number = {11},
  pages = {1350},
  doi = {10.3390/electronics10111350}
}
Abstract: Various convolutional neural network (CNN) based concepts have been introduced for the prostate’s automatic segmentation and its coarse subdivision into transition zone (TZ) and peripheral zone (PZ). However, when targeting a fine-grained segmentation of TZ, PZ, distal prostatic urethra (DPU) and the anterior fibromuscular stroma (AFS), the task becomes more challenging and has not yet been solved at the level of human performance. One reason might be the insufficient amount of labeled data for supervised training. Therefore, we propose to apply a semi-supervised learning (SSL) technique named uncertainty-aware temporal self-learning (UATS) to overcome the expensive and time-consuming manual ground truth labeling. We combine the SSL techniques temporal ensembling and uncertainty-guided self-learning to benefit from unlabeled images, which are often readily available. Our method significantly outperforms the supervised baseline and obtained a Dice coefficient (DC) of up to 78.9%, 87.3%, 75.3%, 50.6% for TZ, PZ, DPU and AFS, respectively. The obtained results are in the range of human inter-rater performance for all structures. Moreover, we investigate the method’s robustness against noise and demonstrate the generalization capability for varying ratios of labeled data and on other challenging tasks, namely the hippocampus and skin lesion segmentation. UATS achieved superiority segmentation quality compared to the supervised baseline, particularly for minimal amounts of labeled data.
BibTeX:
@article{Meyer2021,
  author = {Anneke Meyer and Suhita Ghosh and Daniel Schindele and Martin Schostak and Sebastian Stober and Christian Hansen and Marko Rak},
  title = {Uncertainty-aware temporal self-learning ({UATS}): Semi-supervised learning for segmentation of prostate zones and beyond},
  month = {jun},
  journal = {Artificial Intelligence in Medicine},
  publisher = {Elsevier {BV}},
  year = {2021},
  volume = {116},
  pages = {102073},
  doi = {10.1016/j.artmed.2021.102073}
}
Abstract: There is an increasing convergence between biologically plausible computational models of inference and learning with local update rules and the global gradient-based optimization of neural network models employed in machine learning. One particularly exciting connection is the correspondence between the locally informed optimization in predictive coding networks and the error backpropagation algorithm that is used to train state-of-the-art deep artificial neural networks. Here we focus on the related, but still largely under-explored connection between precision weighting in predictive coding networks and the Natural Gradient Descent algorithm for deep neural networks. Precision-weighted predictive coding is an interesting candidate for scaling up uncertainty-aware optimization — particularly for models with large parameter spaces — due to its distributed nature of the optimization process and the underlying local approximation of the Fisher information metric, the adaptive learning rate that is central to Natural Gradient Descent. Here, we show that hierarchical predictive coding networks with learnable precision indeed are able to solve various supervised and unsupervised learning tasks with performance comparable to global backpropagation with natural gradients and outperform their classical gradient descent counterpart on tasks where high amounts of noise are embedded in data or label inputs. When applied to unsupervised auto-encoding of image inputs, the deterministic network produces hierarchically organized and disentangled embeddings, hinting at the close connections between predictive coding and hierarchical variational inference.
BibTeX:
@article{Ofner2021,
  author = {Andre Ofner and Raihan Kabir Ratul and Suhita Ghosh and Sebastian Stober},
  title = {Predictive coding, precision and natural gradients},
  month = {November},
  journal = {arXiv preprint arXiv:2111.06942},
  year = {2021},
  url = {http://arxiv.org/abs/2111.06942}
}
Abstract: Humans efficiently extract relevant information from complex auditory stimuli. Oftentimes, the interpretation of the signal is ambiguous and musical meaning is derived from the subjective context. Predictive processing interpretations of brain function describe subjective music experience driven by hierarchical precisionweighted expectations. There is still a lack of efficient and structurally interpretable machine learning models operating on audio featuring such biological plausibility. We therefore propose a bio-plausible predictive coding model that analyses auditory signals in comparison to a continuously updated differentiable generative model. For this, we discuss and build upon the connections between Infinite Impulse Response filters, Kalman filters, and the inference in predictive coding with local update rules. Our results show that such gradient-based predictive coding is useful for classical digital signal processing applications like audio filtering. We test the model capability on beat tracking and audio filtering tasks and conclude by showing how top-down expectations modulate the activity on lower layers during prediction.
BibTeX:
@inproceedings{Ofner2021cmmr,
  author = {Ofner, André and Schleiss, Johannes and Stober, Sebastian},
  title = {Hierarchical {Predictive} {Coding} and {Interpretable} {Audio} {Analysis}-{Synthesis}},
  booktitle = {15th International Symposium on Computer Music Multidisciplinary Research (CMMR'21)},
  year = {2021},
  url = {https://cmmr2021.github.io/proceedings/pdffiles/cmmr2021_25.pdf}
}
Abstract: We present PredProp, a method for bidirectional, parallel and local optimisation of weights, activities and precision in neural networks. PredProp jointly addresses inference and learning, scales learning rates dynamically and weights gradients by the curvature of the loss function by optimizing prediction error precision. PredProp optimizes network parameters with Stochastic Gradient Descent and error forward propagation based strictly on prediction errors and variables locally available to each layer. Neighboring layers optimise shared activity variables so that prediction errors can propagate forward in the network, while predictions propagate backwards. This process minimises the negative Free Energy, or evidence lower bound of the entire network. We show that networks trained with PredProp resemble gradient based predictive coding when the number of weights between neighboring activity variables is one. In contrast to related work, PredProp generalizes towards backward connections of arbitrary depth and optimizes precision for any deep network architecture. Due to the analogy between prediction error precision and the Fisher information for each layer, PredProp implements a form of Natural Gradient Descent. When optimizing DNN models, layer-wise PredProp renders the model a bidirectional predictive coding network. Alternatively DNNs can parameterize the weights between two activity variables. We evaluate PredProp for dense DNNs on simple inference, learning and combined tasks. We show that, without an explicit sampling step in the network, PredProp implements a form of variational inference that allows to learn disentangled embeddings from low amounts of data and leave evaluation on more complex tasks and datasets to future work.
BibTeX:
@article{Ofner2021a,
  author = {André Ofner and Sebastian Stober},
  title = {PredProp: Bidirectional Stochastic Optimization with Precision Weighted Predictive Coding},
  month = {November},
  journal = {arXiv preprint arXiv:2111.08792},
  year = {2021},
  url = {http://arxiv.org/abs/2111.08792}
}
Abstract: This paper deals with differentiable dynamical models congruent with neural process theories that cast brain function as the hierarchical refinement of an internal generative model explaining observations. Our work extends existing implementations of gradient-based predictive coding with automatic differentiation and allows to integrate deep neural networks for non-linear state parameterization. Gradient-based predictive coding optimises inferred states and weights locally in for each layer by optimising precision-weighted prediction errors that propagate from stimuli towards latent states. Predictions flow backwards, from latent states towards lower layers. The model suggested here optimises hierarchical and dynamical predictions of latent states. Hierarchical predictions encode expected content and hierarchical structure. Dynamical predictions capture changes in the encoded content along with higher order derivatives. Hierarchical and dynamical predictions interact and address different aspects of the same latent states. We apply the model to various perception and planning tasks on sequential data and show their mutual dependence. In particular, we demonstrate how learning sampling distances in parallel address meaningful locations data sampled at discrete time steps. We discuss possibilities to relax the assumption of linear hierarchies in favor of more flexible graph structure with emergent properties. We compare the granular structure of the model with canonical microcircuits describing predictive coding in biological networks and review the connection to Markov Blankets as a tool to characterize modularity. A final section sketches out ideas for efficient perception and planning in nested spatio-temporal hierarchies.
BibTeX:
@article{Ofner2021b,
  author = {André Ofner and Sebastian Stober},
  title = {Differentiable Generalised Predictive Coding},
  month = {December},
  journal = {arXiv preprint arXiv:2112.0337},
  year = {2021},
  url = {http://arxiv.org/abs/2112.0337}
}
Abstract: The assessment of laboratory animal behavior is of central interest in modern neuroscience research. Behavior is typically studied in terms of pose changes, which are ideally captured in three dimensions. This requires triangulation over a multi-camera system which view the animal from different angles. However, this is challenging in realistic laboratory setups due to occlusions and other technical constrains. Here we propose the usage of lift-pose models that allow for robust 3D pose estimation of freely moving rodents from a single view camera view. To obtain high-quality training data for the pose-lifting, we first perform geometric calibration in a camera setup involving bottom as well as side views of the behaving animal. We then evaluate the performance of two previously proposed model architectures under given inference perspectives and conclude that reliable 3D pose inference can be obtained using temporal convolutions. With this work we would like to contribute to a more robust and diverse behavior tracking of freely moving rodents for a wide range of experiments and setups in the neuroscience community.
BibTeX:
@inproceedings{Sarkar2021,
  author = {Indrani Sarkar and Indranil Maji and Charitha Omprakash and Sebastian Stober and Sanja Mikulovic and Pavol Bauer},
  title = {Evaluation of deep lift pose models for 3D rodent pose estimation based on geometrically triangulated data},
  booktitle = {CVPR 2021 CV4Animals Workshop},
  year = {2021},
  url = {http://arxiv.org/abs/2106.12993}
}
Abstract: Scheduling still constitutes a challenging problem, especially for complex problem settings involving due dates and sequence-dependent setups. The majority of existing approaches use heuristics or meta-heuristics, like Genetic Algorithms or Reinforcement Learning. We show that a supervised learning framework can learn and generalize from generated optimal target schedules, which amplifies convergence compared to unsupervised methods. We present a deep hybrid greedy framework, which can predict near-optimal schedules by utilizing the following key mechanisms: (i) Through the interplay between heuristics and a deep neural network our hybrid model can combine the benefits. Specifically, complex patterns from optimal schedules can be learned by a neural network. We reduce the computational costs by outsourcing trivial decisions to heuristics and therefore, allowing consistent decisions during training. (ii) The problem complexity can be reduced, by employing a greedy prediction scheme, where one job at a time is predicted. (iii) We propose a re-scheduling mechanism for idle jobs, which enables long-term cost reduction and renders the framework reactive and dynamic. Through the heuristics and the neural network, our model is real-time capable during inference. We compare our model against prevailing scheduling heuristics and our model outperformed one of them in terms of makespan and lateness minimization. The key purpose of this work is to give a proof of concept, that supervised learning is applicable for complex scheduling problems.
BibTeX:
@article{Schmidt2021,
  author = {Johann Schmidt and Sebastian Stober},
  title = {Approaching Scheduling Problems via a Deep Hybrid Greedy Model and Supervised Learning},
  journal = {{IFAC}-{PapersOnLine}},
  publisher = {Elsevier {BV}},
  year = {2021},
  volume = {54},
  number = {1},
  pages = {805--810},
  doi = {10.1016/j.ifacol.2021.08.095}
}
2020
Abstract: Magnetic resonance imaging (MRI) provides detailed anatomical images of the prostate and its zones. It has a crucial role for many diagnostic applications. Automatic segmentation such as that of the prostate and prostate zones from MR images facilitates many diagnostic and therapeutic applications. However, the lack of a clear prostate boundary, prostate tissue heterogeneity, and the wide interindividual variety of prostate shapes make this a very challenging task. To address this problem, we propose a new neural network to automatically segment the prostate and its zones. We term this algorithm Dense U-net as it is inspired by the two existing state-of-the-art tools—DenseNet and U-net. We trained the algorithm on 141 patient datasets and tested it on 47 patient datasets using axial T2-weighted images in a four-fold cross-validation fashion. The networks were trained and tested on weakly and accurately annotated masks separately to test the hypothesis that the network can learn even when the labels are not accurate. The network successfully detects the prostate region and segments the gland and its zones. Compared with U-net, the second version of our algorithm, Dense-2 U-net, achieved an average Dice score for the whole prostate of 92.1± 0.8% vs. 90.7 ± 2%, for the central zone of 89.5±2% vs. 89.1±2.2%, and for the peripheral zone of 78.1± 2.5% vs. 75±3%. Our initial results show Dense-2 U-net to be more accurate than state-of-the-art U-net for automatic segmentation of the prostate and prostate zones.
BibTeX:
@article{Aldoj2020,
  author = {Nader Aldoj and Federico Biavati and Florian Michallek and Sebastian Stober and Marc Dewey},
  title = {Automatic prostate and prostate zones segmentation of magnetic resonance images using {DenseNet}-like U-net},
  month = {aug},
  journal = {Scientific Reports},
  publisher = {Springer Science and Business Media {LLC}},
  year = {2020},
  volume = {10},
  number = {1},
  url = {https://www.nature.com/articles/s41598-020-71080-0},
  doi = {10.1038/s41598-020-71080-0}
}
Abstract: The outbreak of COVID-19 has shocked the entire world with its fairly rapid spread and has challenged different sectors. One of the most effective ways to limit its spread is the early and accurate diagnosis of infected patients. Medical imaging such as X-ray and Computed Tomography (CT) combined with the potential of Artificial Intelligence (AI) plays an essential role in supporting the medical staff in the diagnosis process. Thereby, the use of five different deep learning models (ResNet18, ResNet34, InceptionV3, InceptionResNetV2, and DenseNet161) and their Ensemble have been used in this paper, to classify COVID-19, pneumoniæ and healthy subjects using Chest X-Ray. Multi-label classification was performed to predict multiple pathologies for each patient, if present. Foremost, the interpretability of each of the networks was thoroughly studied using techniques like occlusion, saliency, input X gradient, guided backpropagation, integrated gradients, and DeepLIFT. The mean Micro-F1 score of the models for COVID-19 classifications ranges from 0.66 to 0.875, and is 0.89 for the Ensemble of the network models. The qualitative results depicted the ResNets to be the most interpretable model.
BibTeX:
@article{Chatterjee2ß2ß,
  author = {Soumick Chatterjee and Fatima Saad and Chompunuch Sarasaen and Suhita Ghosh and Rupali Khatun and Petia Radeva and Georg Rose and Sebastian Stober and Oliver Speck and Andreas Nürnberger},
  title = {Exploration of Interpretability Techniques for Deep COVID-19 Classification using Chest X-ray Images},
  month = {June},
  journal = {arXiv preprint arXiv:2006.02570},
  year = {2020},
  url = {http://arxiv.org/abs/2006.02570}
}
Abstract: Cooperative Systems promise increased performance by enriching environmental perception through shared data. Conversely, the entailed openness of the individual system architectures threatens their safety. Recent works focus on a runtime safety assessment to address this threat and thereby aim for high abstractions to provide general interfaces. On the other hand, uncertainty models of shared data, which are necessary inputs to such approaches, aim for low abstractions to provide detailed representations. The present work addresses the resulting incompatibilities by proposing a Lyapunov-based method to estimate so-called Regions of Safety. We show that these enable analyzing low-level uncertainty models to interface with state-of-the-art run-time safety assessment methods and thereby facilitate self-adaptivity and guaranteed safety of cooperative systems. The approach is evaluated in the simulated scenario of Cooperative Adaptive Cruise Control.
BibTeX:
@inproceedings{Jager2020,
  author = {Georg J\"{a}ger and Johannes Schleiss and Sasiporn Usanavasin and Sebastian Stober and Sebastian Zug},
  title = {Analyzing Regions of Safety for Handling Shared Data in Cooperative Systems},
  booktitle = {2020 25th {IEEE} International Conference on Emerging Technologies and Factory Automation ({ETFA})},
  month = {sep},
  publisher = {{IEEE}},
  year = {2020},
  doi = {10.1109/etfa46521.2020.9211932}
}
Abstract: Deep Learning based Automatic Speech Recognition (ASR) models are very successful, but hard to interpret. To gain better understanding of how Artificial Neural Networks (ANNs) accomplish their tasks, introspection methods have been proposed. Adapting such techniques from computer vision to speech recognition is not straight-forward, because speech data is more complex and less interpretable than image data. In this work, we introduce Gradient-adjusted Neuron Activation Profiles (GradNAPs) as means to interpret features and representations in Deep Neural Networks. GradNAPs are characteristic responses of ANNs to particular groups of inputs, which incorporate the relevance of neurons for prediction. We show how to utilize GradNAPs to gain insight about how data is processed in ANNs. This includes different ways of visualizing features and clustering of GradNAPs to compare embeddings of different groups of inputs in any layer of a given network. We demonstrate our proposed techniques using a fully-convolutional ASR model.
BibTeX:
@article{Krug2020,
  author = {Andreas Krug and Sebastian Stober},
  title = {Gradient-Adjusted Neuron Activation Profiles for Comprehensive Introspection of Convolutional Speech Recognition Models},
  month = {February},
  journal = {arXiv preprint arXiv:2002.08125},
  year = {2020},
  url = {http://arxiv.org/abs/2002.08125}
}
Abstract: The human response to music combines low-level expectations that are driven by the perceptual characteristics of audio with high-level expectations from the context and the listener’s expertise. This paper discusses surprisal based music representation learning with a hierarchical predictive neural network. In order to inspect the cognitive validity of the network’s predictions along their time-scales, we use the network’s prediction error to segment electroencephalograms (EEG) based on the audio signal. For this, we investigate the unsupervised segmentation of audio and EEG into events using the NMED-T dataset on passive natural music listening. The conducted exploratory analysis of EEG at locations connected to peaks in prediction error in the network allowed to visualize auditory evoked potentials connected to local and global musical structures. This indicates the potential of unsupervised predictive learning with deep neural networks as means to retrieve musical structure from audio and as a basis to uncover the corresponding cognitive processes in the human brain.
BibTeX:
@inproceedings{ofner2020ismir,
  author = {André Ofner and Sebastian Stober},
  title = {Modeling perception with hierarchical prediction: Auditory segmentation with deep predictive coding locates candidate evoked potentials in {EEG}},
  booktitle = {21st International Society for Music Information Retrieval Conference (ISMIR'20)},
  year = {2020},
  url = {https://program.ismir2020.net/static/final_papers/219.pdf},
  doi = {10.5281/zenodo.4245495}
}
BibTeX:
@inproceedings{ofner2020smc,
  author = {André Ofner and Sebastian Stober},
  title = {Balancing Active Inference and Active Learning with Deep Variational Predictive Coding for {EEG}},
  booktitle = {IEEE International Conference on Systems, Man, and Cybernetics (SMC 2020)},
  year = {2020},
  doi = {10.1109/SMC42975.2020.9283147}
}
Abstract: PredNet, a deep predictive coding network developed by Lotter et al., combines a biologically inspired architecture based on the propagation of prediction error with self-supervised representation learning in video. While the architecture has drawn a lot of attention and various extensions of the model exist, there is a lack of a critical analysis. We fill in the gap by evaluating PredNet both as an implementation of the predictive coding theory and as a self-supervised video prediction model using a challenging video action classification dataset. We design an extended model to test if conditioning future frame predictions on the action class of the video improves the model performance. We show that PredNet does not yet completely follow the principles of predictive coding. The proposed top-down conditioning leads to a performance gain on synthetic data, but does not scale up to the more complex real-world action classification dataset. Our analysis is aimed at guiding future research on similar architectures based on the predictive coding theory.
BibTeX:
@inproceedings{rane2020icmr,
  author = {Rane, Roshan Prakash and Sz\"{u}gyi, Edit and Saxena, Vageesh and Ofner, Andr\'{e} and Stober, Sebastian},
  title = {PredNet and Predictive Coding: A Critical Review},
  booktitle = {Proceedings of the 2020 International Conference on Multimedia Retrieval},
  address = {New York, NY, USA},
  publisher = {Association for Computing Machinery},
  year = {2020},
  series = {ICMR ’20},
  pages = {233–241},
  isbn = {9781450370875},
  url = {https://doi.org/10.1145/3372278.3390694},
  doi = {10.1145/3372278.3390694}
}
BibTeX:
@article{vahid2020combiol,
  author = {Amirali Vahid and Moritz Mückschel and Sebastian Stober and Ann-Kathrin Stock and Christian Beste},
  title = {Applying deep learning to single-trial {EEG} data provides evidence for complementary theories on action control},
  journal = {Communications Biology},
  year = {2020},
  volume = {3},
  number = {112},
  doi = {10.1038/s42003-020-0846-z}
}
2019
Abstract: This paper describes a novel approach for the task of end-to-end argument labeling in shallow discourse parsing. Our method describes a decomposition of the overall labeling task into subtasks and a general distance-based aggregation procedure. For learning these subtasks, we train a recurrent neural network and gradually replace existing components of our baseline by our model. The model is trained and evaluated on the Penn Discourse Treebank 2 corpus. While it is not as good as knowledge-intense approaches, it clearly outperforms other models that are also trained without additional linguistic features.
BibTeX:
@inproceedings{knaebel2019conll,
  author = {Knaebel, Ren{\'e} and Stede, Manfred and Stober, Sebastian},
  title = {Window-Based Neural Tagging for Shallow Discourse Argument Labeling},
  booktitle = {Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)},
  address = {Hong Kong, China},
  month = {November},
  publisher = {Association for Computational Linguistics},
  year = {2019},
  pages = {768--777},
  url = {https://www.aclweb.org/anthology/K19-1072},
  doi = {10.18653/v1/K19-1072}
}
Abstract: Ever since the first International Symposium on Music Information Retrieval in 2000, the proceedings have been made publicly available to interested researchers. After 20 years of annual conferences and workshops, this number has grown to an impressive amount of almost 2,000 papers. When restricted to linear search and retrieval in a document collection of this size, it becomes inherently hard to identify topics, related work and trends in scientific research. Therefore, this paper presents and evaluates a map-based user interface for exploring 20 years of ISMIR publications. The interface visualizes k-nearest neighbor subsets of semantically similar papers. Users may jump from one neighborhood to the next by selecting another paper from the current subset. Through animated transitions between local k-nn maps, the interface creates the impression of panning a large global map. Evaluation results of a small user study suggest that users are able to discover interesting links between papers. Due to its generic approach, the interface is easily applicable to other document collections as well. The search interface and its source code are made publicly available.
BibTeX:
@inproceedings{stober2019ismir,
  author = {Thomas Low and Christian Hentschel and Sayantan Polley and Anustup Das and Harald Sack and Andreas N\"{u}rnberger and Sebastian Stober},
  title = {The ISMIR Explorer - A Visual Interface for Exploring 20 Years of ISMIR Publications},
  booktitle = {20th International Society for Music Information Retrieval Conference (ISMIR'19)},
  year = {2019},
  pages = {392--399},
  url = {http://archives.ismir.net/ismir2019/paper/000092.pdf}
}
Abstract: In recent decades, Predictive Coding has emerged as a unifying theory of human cognition. Related theories in cognitive neuroscience, such as Active Inference and Free Energy Minimization, have demonstrated that Predictive Coding can account for many aspects of human perception and action. However, little work has been done to explore the Predictive Coding framework in the practical domains like computer vision or robotics. A popular implementation in the field of computer vision that is inspired by Predictive Coding is called the ‘PredNet’. PredNet is trained on videos to perform future frame prediction. In a purely perceptual setup like this, Predictive Coding is defined as a hierarchical generative model that dynamically infers low-dimensional causes from high-dimensional perceptual stimuli. The architecture is trained at each level of it’s hierarchy to learn low-dimensional causal factors from temporal visual data by actively generating top-down predictions or hypotheses and testing them against bottom-up incoming frames or sensory evidence. In our recent work, we inspected the PredNet architecture and found that it fails to emulate and therefore benefit from many core ideas of Predictive Coding. We will highlight these conceptual limitations of PredNet and present preliminary results from our improved Predictive Coding architecture. Even though our architecture is inspired by PredNet, it differs from it in three main ways: (1) It is designed to perform semantic segmentation which is an important vision task for autonomous driving. The task is to classify pixels of an image as belonging to a semantic category like drivable road, pedestrian or car (2) The top-down predictions represent semantic class maps and not pixel values and (3) It performs not just short-term but also long-term predictions along its hierarchy. Finally, we compare our architecture’s performance against contemporary deep learning methods for the autonomous driving vision task. We access the semantic segmentation accuracy with an emphasis on the computational efficiency. This includes the model size, amount of training data it needs and the run-time. We also inspect the ability of the model to adjust to differing visual contexts like day time, night time and different weather conditions like rain or snow.
BibTeX:
@inproceedings{rane2019comco,
  author = {Roshan Prakash Rane and André Ofner and Shreyas Gite and Sebastian Stober},
  title = {Predictive Coding Based Vision For Autonomous Cars},
  booktitle = {Computational Cognition 2019 Workshop},
  year = {2019},
  url = {http://www.comco2019.com/abstracts/day1_rane.pdf}
}
Abstract: Predictive coding offers a comprehensive explanation of human brain function through prediction error minimisation. This idea has found traction in machine learning, where deterministic and stochastic inference allow efficient representation of sensory signals. Recently, these artificial predictive coding networks have been coupled with the brain as its natural counterpart to develop co-adaptive brain-computer interfaces based on predictive coding as a shared principle. However, it remains unclear how differences in prior knowledge affect information transfer between the coupled predictive coding networks. To address this question, this study introduces a sequential and hierarchical stochastic predictive coding model where predictions about future sensory states are conditioned on past states and top-down predictive signal for each layer. Using synthetic visual stimuli, we demonstrate the model‘s capacity to incorporate knowledge from a coupled network by comparing the generated prediction error signature with the corresponding stimulus. Our results show that information from the coupled network aids the functional differentiation and can be used to encode aspects of the stimuli that are not visible to the model itself.
BibTeX:
@inproceedings{ofner2019bc,
  author = {André Ofner and Sebastian Stober},
  title = {Knowledge transfer in coupled predictive coding networks},
  booktitle = {Bernstein Conference 2019},
  year = {2019},
  doi = {10.12751/nncn.bc2019.0073}
}
Abstract: The uninformative ordering of artificial neurons in Deep Neural Networks complicates visualizing activations in deeper layers. This is one reason why the internal structure of such models is very unintuitive. In neuroscience, activity of real brains can be visualized by highlighting active regions. Inspired by those techniques, we train a convolutional speech recognition model, where filters are arranged in a 2D grid and neighboring filters are similar to each other. We show, how those topographic filter maps visualize artificial neuron activations more intuitively. Moreover, we investigate, whether this causes phoneme-responsive neurons to be grouped in certain regions of the topographic map.
BibTeX:
@inproceedings{krug2019blackboxnlp,
  author = {Andreas Krug and Sebastian Stober},
  title = {Visualizing Deep Neural Networks for Speech Recognition with Learned Topographic Filter Maps},
  booktitle = {Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP},
  year = {2019}
}
BibTeX:
@inproceedings{krug2019nawik,
  author = {Krug, Andreas and Stober, Sebastian},
  title = {Siri visualisiert},
  booktitle = {Proceedings of the 2019 NaWik Symposium Karlsruhe},
  year = {2019},
  pages = {24--25}
}
Abstract: Attention Deficit Hyperactivity Disorder (ADHD) is one of the most prevalent neuropsychiatric disorders in childhood and adolescence and its diagnosis is based on clinical interviews, symptom questionnaires, and neuropsychological testing. Much research effort has been undertaken to evaluate the usefulness of neurophysiological (EEG) data to aid this diagnostic process. In the current study, we applied deep learning methods on event-related EEG data to examine whether it is possible to distinguish ADHD patients from healthy controls using purely neurophysiological measures. The same was done to distinguish between ADHD subtypes. The results show that the applied deep learning model (“EEGNet”) was able to distinguish between both ADHD subtypes and healthy controls with an accuracy of up to 83%. However, a significant fraction of individuals could not be classified correctly. It is shown that neurophysiological processes indicating attentional selection associated with superior parietal cortical areas were the most important for that. Using the applied deep learning method, it was not possible to distinguish ADHD subtypes from each other. This is the first study showing that deep learning methods applied to EEG data are able to dissociate between ADHD patients and healthy controls. The results show that the applied method reflects a promising means to support clinical diagnosis in ADHD. However, more work needs to be done to increase the reliability of the taken approach.
BibTeX:
@article{vahid2019jcm,
  author = {Vahid, Amirali and Bluschke, Annet and Roessner, Veit and Stober, Sebastian and Beste, Christian},
  title = {Deep Learning Based on Event-Related EEG Differentiates Children with ADHD from Healthy Controls},
  journal = {Journal of Clinical Medicine},
  year = {2019},
  volume = {8},
  number = {7},
  url = {https://www.mdpi.com/2077-0383/8/7/1055},
  doi = {10.3390/jcm8071055}
}
Abstract: Predictive coding and its generalization to active inference offer a unified theory of brain function. The underlying predictive processing paradigmhas gained significant attention in artificial intelligence research for its representation learning and predictive capacity. Here, we suggest that it is possible to integrate human and artificial generative models with a predictive coding network that processes sensations simultaneously with the signature of predictive coding found in human neuroimaging data. We propose a recurrent hierarchical predictive coding model that predicts low-dimensional representations of stimuli, electroencephalogram and physiological signals with variational inference. We suggest that in a shared environment, such hybrid predictive coding networks learn to incorporate the human predictive model in order to reduce prediction error. We evaluate the model on a publicly available EEG dataset of subjects watching one-minute long video excerpts. Our initial results indicate that the model can be trained to predict visual properties such as the amount, distance and motion of human subjects in videos.
BibTeX:
@article{ofner2019alife,
  author = {Ofner, André and Stober, Sebastian},
  title = {Hybrid Variational Predictive Coding as a Bridge between Human and Artificial Cognition},
  journal = {The 2019 Conference on Artificial Life},
  year = {2019},
  number = {31},
  pages = {68-69},
  url = {https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_00142},
  doi = {10.1162/isal\_a\_00142}
}
Abstract: Magnetic resonance imaging (MRI) provides detailed anatomical images of the prostate (PR) and its zones. The importance of segmenting the prostate and the prostate zones, such as the central zone (CZ) and the peripheral zone (PZ) lies in the fact that the diagnostic guidelines differ depending on in which zone the lesion is located. Thus, automatic prostate and prostate zone segmentation from MR images is an important topic for many diagnostic and therapeutic purposes. However, the prostate tissue heterogeneity and the huge varieties of prostate shapes among patients make this task very challenging. Therefore, we propose a new neural network named Dense U-net inspired by the state-of-the-art DenseNet and U-net to automatically segment prostate and prostate zones. It was trained on 141 patient datasets and tested on 47 patient datasets with axial T2-weighted images in four-fold cross-validation manner. The network can successfully segment the gland and its subsequent zones. This Dense U-net compared with the state-of-the-art U-net achieved an average dice score for the whole prostate of 91.2± 0.8% vs. 89.2 ± 0.8%, for CZ of 89.2± 0.2% vs. 87.4 ± 0.2%, and for PZ of 76.4± 0.2% vs. 74.0± 0.2%. The experimental results show that the developed Dense U-net was more accurate than the state-of-the-art U-net for prostate and prostate zone segmentation.
BibTeX:
@inproceedings{aldoj2019midl,
  author = {Nader Aldoj and Federico Biavati and Miriam Rutz and Florian Michallek and Sebastian Stober and Marc Dewey},
  title = {Automatic prostate and prostate zones segmentation of magnetic resonance images using convolutional neural networks},
  booktitle = {Proceedings of International Conference on Medical Imaging with Deep Learning (MIDL'19)},
  year = {2019},
  url = {https://openreview.net/forum?id=HJgDulS7cV}
}
2018
Abstract: Predictive coding and its generalization to active inference offer a unified theory of brain function. The underlying predictive processing paradigm has gained significant attention within the machine learning community for its representation learning and predictive capacity. Here, we suggest that it is possible to integrate human and artificial predictive models with an artificial neural network that learns to predict sensations simultaneously with their representation in the brain. Guided by the principles of active inference, we propose a recurrent hierarchical predictive coding model that jointly predicts stimuli, electroencephalogram and physiological signals under variational inference. We suggest that in a shared environment, the artificial inference process can learn to predict and exploit the human generative model. We evaluate the model on a publicly available dataset of subjects watching one-minute long video excerpts and show that the model can be trained to predict physical properties such as the amount, distance and motion of human subjects in future frames of the videos. Our results hint at the possibility of bi-directional active inference across human and machine.
BibTeX:
@inproceedings{ofner2018hpc,
  author = {André Ofner and Sebastian Stober},
  title = {Towards Bridging Human and Artificial Cognition: Hybrid Variational Predictive Coding of the Physical World, the Body and the Brain},
  booktitle = {NeurIPS 2018 Workshop on Modeling the Physical World},
  year = {2018}
}
Abstract: The increasing complexity of deep Artificial Neural Networks (ANNs) allows to solve complex tasks in various applications. This comes with less understanding of decision processes in ANNs. Therefore, introspection techniques have been proposed to interpret how the network accomplishes its task. Those methods mostly visualize their results in the input domain and often only process single samples. For images, highlighting important features or creating inputs which activate certain neurons is intuitively interpretable. The same introspection for speech is much harder to interpret. In this paper, we propose an alternative method which analyzes neuron activations for whole data sets. Its generality allows application to complex data like speech. We introduce time-independent Neuron Activation Profiles (NAPs) as characteristic network responses to certain groups of inputs. By clustering those time-independent NAPs, we reveal that layers are specific to certain groups. We demonstrate our method for a fully-convolutional speech recognizer. There, we investigate whether phonemes are implicitly learned as an intermediate representation for predicting graphemes. We show that our method reveals, which layers encode phonemes and graphemes and that similarities between phonetic categories are reflected in the clustering of time-independent NAPs.
BibTeX:
@inproceedings{krug2018irasl,
  author = {Andreas Krug and René Knaebel and Sebastian Stober},
  title = {Neuron Activation Profiles for Interpreting Convolutional Speech Recognition Models},
  booktitle = {NeurIPS 2018 Interpretability and Robustness for Audio, Speech and Language Workshop (IRASL'18)},
  year = {2018}
}
Abstract: We describe a framework of hybrid cognition by formulating a hybrid cognitive agent that performs hierarchical active inference across a human and a machine part. We suggest that, in addition to enhancing human cognitive functions with an intelligent and adaptive interface, integrated cognitive processing could accelerate emergent properties within artificial intelligence. To establish this, a machine learning part learns to integrate into human cognition by explaining away multi-modal sensory measurements from the environment and physiology simultaneously with the brain signal. With ongoing training, the amount of predictable brain signal increases. This lends the agent the ability to self-supervise on increasingly high levels of cognitive processing in order to further minimize surprise in predicting the brain signal. Furthermore, with increasing level of integration, the access to sensory information about environment and physiology is substituted with access to their representation in the brain. While integrating into a joint embodiment of human and machine, human action and perception are treated as the machine’s own. The framework can be implemented with invasive as well as non-invasive sensors for environment, body and brain interfacing. Online and offline training with different machine learning approaches are thinkable. Building on previous research on shared representation learning, we suggest a first implementation leading towards hybrid active inference with non-invasive brain interfacing and state of the art probabilistic deep learning methods. We further discuss how implementation might have effect on the meta-cognitive abilities of the described agent and suggest that with adequate implementation the machine part can continue to execute and build upon the learned cognitive processes autonomously.
BibTeX:
@article{ofner2018hai,
  author = {André Ofner and Sebastian Stober},
  title = {Hybrid Active Inference},
  journal = {arXiv preprint arXiv:1810.02647},
  year = {2018},
  url = {https://arxiv.org/abs/1810.02647}
}
Abstract: Artificial Neural Networks (ANNs) have experienced great success in the past few years. The increasing complexity of these models leads to less understanding about their decision processes. Therefore, introspection techniques have been proposed, mostly for images as input data. Patterns or relevant regions in images can be intuitively interpreted by a human observer. This is not the case for more complex data like speech recordings. In this work, we investigate the application of common introspection techniques from computer vision to an Automatic Speech Recognition (ASR) task. To this end, we use a model similar to image classification, which predicts letters from spectrograms. We show difficulties in applying image introspection to ASR. To tackle these problems, we propose normalized averaging of aligned inputs (NAvAI): a data-driven method to reveal learned patterns for prediction of specific classes. Our method integrates information from many data examples through local introspection techniques for Convolutional Neural Networks (CNNs). We demonstrate that our method provides better interpretability of letter-specific patterns than existing methods.#pdf#
BibTeX:
@inproceedings{krug2018introspection,
  author = {Andreas Krug and Sebastian Stober},
  title = {Introspection for Convolutional Automatic Speech Recognition},
  booktitle = {Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP},
  year = {2018},
  pages = {187--199},
  url = {http://www.aclweb.org/anthology/W18-5421}
}
Abstract: Retrieving music information from brain activity is a challenging and still largely unexplored research problem. In this paper we investigate the possibility to reconstruct perceived and imagined musical stimuli from electroencephalography (EEG) recordings based on two datasets. One dataset contains multi-channel EEG of subjects listening to and imagining rhythmical patterns presented both as sine wave tones and short looped spoken utterances. These utterances leverage the well-known speech-to-song illusory transformation which results in very catchy and easy to reproduce motifs. A second dataset provides EEG recordings for the perception of 10 full length songs. Using a multi-view deep generative model we demonstrate the feasibility of learning a shared latent representation of brain activity and auditory concepts, such as rhythmical motifs appearing across different instrumentations. Introspection of the model trained on the rhythm dataset reveals disentangled rhythmical and timbral features within and across subjects. The model allows continuous interpolation between representations of different observed variants of the presented stimuli. By decoding the learned embeddings we were able to reconstruct both perceived and imagined music. Stimulus complexity and the choice of training data shows strong effect on the reconstruction quality.
BibTeX:
@inproceedings{ofner2018ismir,
  author = {André Ofner and Sebastian Stober},
  title = {Shared Generative Representation of Auditory Concepts and EEG to Reconstruct Perceived and Imagined Music},
  booktitle = {19th International Society for Music Information Retrieval Conference (ISMIR'18)},
  year = {2018},
  pages = {392--399},
  url = {http://ismir2018.ircam.fr/doc/pdfs/101_Paper.pdf}
}
Abstract: Relationships between neuroimaging measures and behavior provide important clues about brain function and cognition in healthy and clinical populations. While electroencephalograhy (EEG) provides a portable, low cost measure of brain dynamics, it has been somewhat underrepresented in the emerging field of model-based inference. We seek to address this gap in this article by highlighting the utility of linking EEG and behavior, with an emphasis on approaches for EEG analysis that move beyond focusing on peaks or “components” derived from averaging EEG responses across trials and subjects (generating the event-related potential (ERP)). First, we review methods for deriving features from EEG in order to enhance the signal within single-trials. These methods include filtering based on user-defined features (i.e. frequency decomposition, time-frequency decomposition), filtering based on data-driven properties (i.e. blind source separation (BSS)), and generating more abstract representations of data (e.g. using deep learning). We then review cognitive models which extract latent variables from experimental tasks, including the drift diffusion model (DDM) and reinforcement learning approaches. Next, we discuss ways to access associations among these measures, including statistical models, data-driven joint models, and cognitive joint modeling using hierarchical Bayesian models (HBM). We think that these methodological tools are likely to contribute to theoretical advancements, and will help inform our understandings of brain dynamics that contribute to moment-to-moment cognitive function.
BibTeX:
@article{frontiers2018,
  author = {David A. Bridwell and James F. Cavanagh and Anne G.E. Collins and Michael D. Nunez and Ramesh Srinivasan and Sebastian Stober and Vince D. Calhoun},
  title = {Moving Beyond {ERP} Components: A Selective Review of Approaches to Integrate {EEG} and Behavior},
  journal = {Frontiers in Neuroscience},
  year = {2018},
  volume = {12},
  pages = {106},
  url = {https://www.frontiersin.org/article/10.3389/fnhum.2018.00106},
  doi = {10.3389/fnhum.2018.00106}
}
Abstract: Deep learning is a sub-field of machine learning that has recently gained substantial popularity in various domains such as computer vision, automatic speech recog- nition, natural language processing, and bioinformatics. Deep learning techniques are able to learn complex feature representations from raw signals and thus also have potential to improve signal processing in the context of brain-computer inter- faces (BCIs). However, they typically require large amounts of data for training – much more than what can often be provided with reasonable effort when working with brain activity recordings of any kind. In order to still leverage the power of deep learning techniques with limited available data, special care needs to be taken when designing the BCI task, defining the structure of the deep model, and choosing the training method. This chapter presents example approaches for the specific scenario of music- based brain-computer interaction through electroencephalography (EEG) – in the hope that these will prove to be valuable in different settings as well. We explain important decisions for the design of the BCI task and their impact on the models and training techniques that can be used. Furthermore, we present and compare various pre-training techniques that aim to improve the signal-to-noise ratio. Finally, we discuss approaches to interpret the trained models.
BibTeX:
@inbook{stober2018bcibook,
  author = {Sebastian Stober and Avital Sternin},
  editor = {Toshihisa Tanaka and Mahnaz Arvaneh},
  booktitle = {Signal Processing and Machine Learning for Brain-Machine Interfaces},
  chapter = {Decoding Music Perception and Imagination using Deep Learning Techniques},
  publisher = {IET},
  year = {2018},
  pages = {271--299},
  doi = {10.1049/PBCE114E}
}
2017
Abstract: The target group of search engine users in the Internet is very wide and heterogeneous. The users differ in background, knowledge, experience, etc. That is why, in order to find relevant information, such search systems not only have to retrieve web documents related to the search query but also have to consider and adapt to the user’s interests, skills, preferences and context. In addition, numerous user studies have revealed that the search process itself can be very complex, in particular if the user is not providing well-defined queries to find a specific piece of information, but is exploring the information space. This is very often the case if the user is not completely familiar with the search topic and is trying to get an overview of or learn about the topic at hand. Especially in this scenario, user- and task-specific adaptations might lead to a significant increase in retrieval performance and user experience. In order to analyze and characterize the complexity of the search process, different models for information(-seeking) behavior and information activities have been developed. In this chapter, we discuss selected models, with a focus on models that have been designed to cover the needs of individual users. Furthermore, an aggregated framework is proposed to address different levels of information(-seeking) behavior and to motivate approaches for adaptive search systems. To enable Companion-Systems to support users during information exploration, the proposed models provide solid and suitable frameworks to allow cooperative and competent assistance.
BibTeX:
@inbook{kotzybaGSN2017companion,
  author = {Michael Kotzyba and Tatiana Gossen and Sebastian Stober and Andreas Nürnberger},
  title = {Companion {Technology} - {A} {Paradigm} {Shift} in {Human}-{Technology} {Interaction}},
  editor = {Susanne Biundo and Andreas Wendemuth},
  chapter = {Model-Based Frameworks for User Adapted Information Exploration: An Overview},
  publisher = {Springer International Publishing},
  year = {2017},
  series = {Cognitive Technologies},
  pages = {37--56},
  url = {http://www.springer.com/de/book/9783319436647},
  doi = {10.1007/978-3-319-43665-4_3}
}
Abstract: As an emerging sub-field of music information retrieval (MIR), music imagery information retrieval (MIIR) aims to retrieve information from brain activity recorded during music cognition – such as listening to or imagining music pieces. This is a highly inter-disciplinary endeavor that requires expertise in MIR as well as cognitive neuroscience and psychology. The OpenMIIR initiative strives to foster collaborations between these fields to advance the state of the art in MIIR. As a first step, electroencephalography (EEG) recordings of music perception and imagination have been made publicly available, enabling MIR researchers to easily test and adapt their existing approaches for music analysis like fingerprinting, beat tracking or tempo estimation on this new kind of data. This paper reports on first results of MIIR experiments using these OpenMIIR datasets and points out how these findings could drive new research in cognitive neuroscience.
BibTeX:
@article{stober2017frontiers,
  author = {Stober, Sebastian},
  title = {Towards {Studying} {Music} {Cognition} with {Information} {Retrieval} {Techniques}: {Lessons} {Learned} from the {OpenMIIR} {Initiative}},
  journal = {Frontiers in Psychology},
  year = {2017},
  volume = {8},
  url = {http://journal.frontiersin.org/article/10.3389/fpsyg.2017.01255/abstract},
  doi = {10.3389/fpsyg.2017.01255}
}
Abstract: The increase in complexity of Artificial Neural Nets (ANNs) results in difficulties in understanding what they have learned and how they accomplish their goal. As their complexity becomes closer to the one of the human brain, neuroscientific techniques could facilitate their analysis. This paper investigates an adaptation of the Event-Related Potential (ERP) technique for analyzing ANNs demonstrated for a speech recognizer. Our adaptation involves deriving a large number of recordings (trials) for the same word and averaging the resulting neuron activations. This allows for a systematic analysis of neuron activation to reveal their function in detecting specific letters. We compare those observations between an English and German speech recognizer.
BibTeX:
@inproceedings{krug2017ccn,
  author = {Andreas Krug and Sebastian Stober},
  title = {Adaptation of the Event-Related Potential Technique for Analyzing Artificial Neural Nets},
  booktitle = {Conference on Cognitive Computational Neuroscience (CCN'17)},
  year = {2017}
}
Abstract: End-to-end training of automated speech recognition (ASR) systems requires massive data and compute resources. We explore transfer learning based on model adaptation as an approach for training ASR models under constrained GPU memory, throughput and training data. We conduct several systematic experiments adapting a Wav2Letter convolutional neural network originally trained for English ASR to the German language. We show that this technique allows faster training on consumer-grade resources while requiring less training data in order to achieve the same accuracy, thereby lowering the cost of training ASR models in other languages. Model introspection revealed that small adaptations to the network’s weights were sufficient for good performance, especially for inner layers.
BibTeX:
@inproceedings{kunzeKKKJS2017acl,
  author = {Kunze, Julius and Kirsch, Louis and Kurenkov, Ilia and Krug, Andreas and Johannsmeier, Jens and Stober, Sebastian},
  title = {Transfer Learning for Speech Recognition on a Budget},
  booktitle = {2n Workshop on Representation Learning for NLP at the Annual Meeting of the Association for Computational Linguistics (ACL'17)},
  year = {2017},
  url = {https://arxiv.org/abs/1706.00290}
}
Abstract: This paper introduces a pre-training technique for learning discriminative features from electroencephalography (EEG) recordings using deep neural networks. EEG data are generally only available in small quantities, they are high-dimensional with a poor signal-to-noise ratio, and there is considerable variability between individual subjects and recording sessions. Similarity-constraint encoders as introduced in this paper specifically address these challenges for feature learning. They learn features that allow to distinguish between classes by demanding that encodings of two trials from the same class are more similar to each other than to encoded trials from other classes. This tuple-based training approach is especially suitable for small datasets. The proposed technique is evaluated using the publicly available OpenMIIR dataset of EEG recordings taken while participants listened to and imagined music. For this dataset, a simple convolutional filter can be learned that significantly improves the signal-to-noise ratio while aggregating the 64 EEG channels into a single waveform.
BibTeX:
@inproceedings{stober2017icassp,
  author = {Sebastian Stober},
  title = {Learning Discriminative Features from Electroencephalography Recordings by Encoding Similarity Constraints},
  booktitle = {Proceedings of 42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'17)},
  year = {2017},
  pages = {6175--6179}
}
Abstract: We compare Visual Berrypicking, an interactive approach allowing users to explore large and highly faceted information spaces using similarity-based two-dimensional maps, with traditional browsing techniques. For large datasets, current projection methods used to generate maplike overviews suffer from increased computational costs and a loss of accuracy resulting in inconsistent visualizations. We propose to interactively align inexpensive small maps, showing local neighborhoods only, which ideally creates the impression of panning a large map. For evaluation, we designed a web-based prototype for movie exploration and compared it to the web interface of The Movie Database (TMDb) in an online user study. Results suggest that users are able to effectively explore large movie collections by hopping from one neighborhood to the next.
BibTeX:
@inproceedings{low2017mmm,
  author = {Thomas Low and Christian Hentschel and Sebastian Stober and Harald Sack and Andreas N{\"u}rnberger},
  title = {Exploring Large Movie Collections: Comparing Visual Berrypicking and Traditional Browsing},
  booktitle = {Proceedings of the 23rd International Conference on MultiMedia Modeling (MMM'17)},
  year = {2017},
  pages = {198--208},
  doi = {10.1007/978-3-319-51814-5_17}
}
2016
BibTeX:
@inbook{stober2016mirbook-chapter,
  author = {Sebastian Stober},
  title = {Music Data Analysis: Foundations and Applications},
  editor = {Claus Weihs and Dietmar Jannach and Igor Vatolkin and G\"{u}nter Rudolph},
  chapter = {Similarity-based Organization of Music Collections},
  publisher = {CRC Press},
  year = {2016},
  isbn = {9781498719568},
  url = {https://www.crcpress.com/Music-Data-Analysis-Foundations-and-Applications/Weihs-Jannach-Vatolkin-Rudolph/p/book/9781498719568}
}
Abstract: This work introduces a pre-training technique for learning discriminative features from electroencephalography (EEG) recordings using deep artificial neural networks. EEG data are generally only available in small quantities, they are high-dimensional with a poor signal-to-noise ratio, and there is considerable variability between individual subjects and recording sessions. Similarity-constraint encoders as introduced here specifically address these challenges for feature learning. They learn features that allow to distinguish between classes by demanding that encodings of two trials from the same class are more similar to each other than to encoded trials from other classes. This tuple-based training approach is especially suitable for small datasets. The proposed technique is evaluated using the publicly available OpenMIIR dataset of EEG recordings taken while 9 subjects listened to 12 short music pieces. For this dataset, a simple convolutional filter can be learned that is stable across subjects and significantly improves the signal-to-noise ratio while aggregating the 64 EEG channels into a single waveform. With this filter, a neural network classifier can be trained that is simple enough to allow for interpretation of the learned parameters by domain experts and facilitate findings about the cognitive processes. Further, a cross-subject classification accuracy of 27% is obtained with values above 40% for individual subjects.
BibTeX:
@inproceedings{srober2016bc,
  author = {Sebastian Stober},
  title = {Learning Discriminative Features from Electroencephalography Recordings by Encoding Similarity Constraints},
  booktitle = {Bernstein Conference 2016},
  year = {2016},
  doi = {10.12751/nncn.bc2016.0223}
}
Abstract: This paper addresses the question how music information retrieval techniques originally developed to process audio recordings can be adapted for the analysis of corresponding brain activity data. In particular, we conducted a case study applying beat tracking techniques to extract the tempo from electroencephalography (EEG) recordings obtained from people listening to music stimuli. We point out similarities and differences in processing audio and EEG data and show to which extent the tempo can be successfully extracted from EEG signals. Furthermore, we demonstrate how the tempo extraction from EEG signals can be stabilized by applying different fusion approaches on the mid-level tempogram features.
BibTeX:
@inproceedings{stober2016ismir,
  author = {Sebastian Stober and Thomas Pr\"{a}tzlich and Meinard M\"{u}ller},
  title = {Brain Beats: Tempo Extraction from EEG Data},
  booktitle = {17th International Society for Music Information Retrieval Conference (ISMIR'16)},
  year = {2016},
  url = {https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2016/07/022_Paper.pdf}
}
2015
Abstract: We introduce and compare several strategies for learning discriminative features from electroencephalography (EEG) recordings using deep learning techniques. EEG data are generally only available in small quantities, they are high-dimensional with a poor signal-to-noise ratio, and there is considerable variability between individual subjects and recording sessions. Our proposed techniques specifically address these challenges for feature learning. Cross-trial encoding forces auto-encoders to focus on features that are stable across trials. Similarity-constraint encoders learn features that allow to distinguish between classes by demanding that two trials from the same class are more similar to each other than to trials from other classes. This tuple-based training approach is especially suitable for small datasets. Hydra-nets allow for separate processing pathways adapting to subsets of a dataset and thus combine the advantages of individual feature learning (better adaptation of early, low-level processing) with group model training (better generalization of higher-level processing in deeper layers). This way, models can, for instance, adapt to each subject individually to compensate for differences in spatial patterns due to anatomical differences or variance in electrode positions. The different techniques are evaluated using the publicly available OpenMIIR dataset of EEG recordings taken while participants listened to and imagined music.
BibTeX:
@article{stober2015arXiv:1511.04306,
  author = {Sebastian Stober and Avital Sternin and Adrian M. Owen and Jessica A. Grahn},
  title = {Deep Feature Learning for {EEG} Recordings},
  journal = {arXiv preprint arXiv:1511.04306},
  year = {2015},
  note = {submitted as conference paper for ICLR 2016},
  url = {http://arxiv.org/abs/1511.04306}
}
Abstract: The ISMIR Paper Explorer allows to browse all papers published at ISMIR using a map-based interface where similar papers are close together. The web-based user interface creates the impression of panning a large (global) map by aligning inexpensive small maps showing local neighborhoods. By directed hopping from one neighborhood to the next, the user is able to explore the whole ISMIR paper collection.
BibTeX:
@inproceedings{stober2015ismirlbd,
  author = {Sebastian Stober and Thomas Low and Christian Hentschel and Harald Sack and Andreas N\"{u}rnberger},
  title = {The {ISMIR} Paper Explorer: A Map-Based Interface for {MIR} Literature Research},
  booktitle = {16th International Society for Music Information Retrieval Conference (ISMIR'15) - Late Breaking \& Demo Papers},
  year = {2015},
  url = {http://ismir2015.uma.es/LBD/LBD41.pdf}
}
Abstract: Music imagery information retrieval (MIIR) systems may one day be able to recognize a song from only our thoughts. As a step towards such technology, we are presenting a public domain dataset of electroencephalography (EEG) recordings taken during music perception and imagination. We acquired this data during an ongoing study that so far comprises 10 subjects listening to and imagining 12 short music fragments – each 7-16s long – taken from well-known pieces. These stimuli were selected from different genres and systematically vary along musical dimensions such as meter, tempo and the presence of lyrics. This way, various retrieval scenarios can be addressed and the success of classifying based on specific dimensions can be tested. The dataset is aimed to enable music information retrieval researchers interested in these new MIIR challenges to easily test and adapt their existing approaches for music analysis like fingerprinting, beat tracking, or tempo estimation on EEG data.
BibTeX:
@inproceedings{stober2015ismir,
  author = {Sebastian Stober and Avital Sternin and Adrian M. Owen and Jessica A. Grahn},
  title = {Towards Music Imagery Information Retrieval: Introducing the OpenMIIR Dataset of {EEG} Recordings from Music Perception and Imagination},
  booktitle = {16th International Society for Music Information Retrieval Conference (ISMIR'15)},
  year = {2015},
  pages = {763--769},
  url = {http://ismir2015.uma.es/articles/224_Paper.pdf}
}
Abstract: The neural processes involved in the perception of music are also involved in imagination. This overlap can be exploited by techniques that attempt to classify the contents of imagination from neural signals, such as signals recorded by EEG. Successful EEG-based classification of what an individual is imagining could pave the way for novel communication technologies, such as brain-computer interfaces. Our study explored whether we could accurately classify perceived and imagined musical stimuli from EEG data. To determine what characteristics of music resulted in the most distinct, and therefore most classifiable, EEG activity, we systematically varied properties of the music. These properties included time signature (3/4 versus 4/4), lyrics (music with lyrics versus music without), tempo (slow versus fast), and instrumentation. Our primary goal was to reliably distinguish between groups of stimuli based on these properties. We recorded EEG with a 64-channel BioSemi system while participants heard or imagined the different musical stimuli. We hypothesized that we would be able to classify which piece was being heard, or being imagined, from the EEG data. Using principal components analysis, we identified components common to both the perception and imagination conditions. Preliminary analyses show that the time courses of these components are unique to each stimulus and may be used for classification. To investigate other features of the EEG recordings that correlate with stimuli and thus enable accurate classification, we applied a machine learning approach, using deep learning techniques including sparse auto-encoders and convolutional neural networks. This approach has shown promising initial results: we were able to classify stimuli at above chance levels based on their time signature and to estimate the tempo of perceived and imagined music from EEG data. Our findings may ultimately lead to the development of a music-based brain-computer interface.
BibTeX:
@inproceedings{sternin2015smpc,
  author = {Avital Sternin and Sebastian Stober and Adrian M. Owen and Jessica A. Grahn},
  title = {Classifying Perception and Imagination of Music from EEG},
  booktitle = {Society for Music Perception \& Cognition Conference (SMPC'15)},
  year = {2015},
  note = {abstract/poster}
}
Abstract: Electroencephalography (EEG) recordings taken during the perception and the imagination of music contain enough information to estimate the tempo of a musical piece. Five participants listened to and imagined 12 short clips taken from familiar musical pieces — each 7s-16s long. Basic EEG preprocessing techniques were used to remove artifacts and a dynamic beat tracker was used to estimate average tempo. Autocorrelation curves were computed to investigate the periodicity seen in the average EEG waveforms, and the peaks from these curves were found to be proportional to stimulus measure length. As the tempo at which participants imagine may vary over time we used an aggregation technique that allowed us to estimate an accurate tempo over the course of an entire trial. We propose future directions involving convolutional neural networks (CNNs) that will allow us to apply our results to build a brain-computer interface.
BibTeX:
@inproceedings{sternin2015bcmi,
  author = {Avital Sternin and Sebastian Stober and Jessica A. Grahn and Adrian M. Owen},
  title = {Tempo Estimation from the EEG Signal during Perception and Imagination of Music},
  booktitle = {1st International Workshop on Brain-Computer Music Interfacing / 11th International Symposium on Computer Music Multidisciplinary Research (BCMI/CMMR'15)},
  year = {2015}
}
Abstract: In this paper we describe a novel concept of a search history visualization that is primarily designed for children. We propose to visualize the search history as a treasure map: The treasure map shows a landscape of islands. Each island represents the context of a user query. We visualize visited and unvisited relevant results and bookmarked documents for an issued query on an island. We argue that the treasure map may offer several advantages over the existing history mechanisms such as context awareness, appropriate metaphor for children, looping visualization, smaller cognitive load and higher efficiency in refinding information. We discuss design decisions that are important to build such map and interact with it and present the first prototype of the map.
BibTeX:
@inproceedings{gossen2015treasuremap,
  author = {Tatiana Gossen and Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Treasure Map: Search History for Young Users},
  booktitle = {5th Workshop on Context-awareness in Retrieval and Recommendation (CaRR'15) in conjunction with the 37th European Conference on Information Retrieval (ECIR'15)},
  year = {2015}
}
2014
Abstract: Electroencephalography (EEG) recordings of rhythm perception might contain enough information to distinguish different rhythm types/genres or even identify the rhythms themselves. We apply convolutional neural networks (CNNs) to analyze and classify EEG data recorded within a rhythm perception study in Kigali, Rwanda which comprises 12 East African and 12 Western rhythmic stimuli – each presented in a loop for 32 seconds to 13 participants. We investigate the impact of the data representation and the pre-processing steps for this classification tasks and compare different network structures. Using CNNs, we are able to recognize individual rhythms from the EEG with a mean classification accuracy of 24.4% (chance level 4.17%) over all subjects by looking at less than three seconds from a single channel. Aggregating predictions for multiple channels, a mean accuracy of up to 50% can be achieved for individual subjects.
BibTeX:
@inproceedings{stober2014nips,
  author = {Sebastian Stober and Daniel J. Cameron and Jessica A. Grahn},
  title = {Using Convolutional Neural Networks to Recognize Rhythm Stimuli from Electroencephalography Recordings},
  booktitle = {Advances in Neural Information Processing Systems 27 (NIPS'14)},
  year = {2014},
  pages = {1449--1457},
  url = {http://papers.nips.cc/paper/5272-using-convolutional-neural-networks-to-recognize-rhythm-stimuli-from-electroencephalography-recordings}
}
Abstract: Exploring image collections using similarity-based two-dimensional maps is an ongoing research area that faces two main challenges: with increasing size of the collection and complexity of the similarity metric projection accuracy rapidly degrades and computational costs prevent online map generation. We propose a prototype that creates the impression of panning a large (global) map by aligning inexpensive small maps showing local neighborhoods. By directed hopping from one neighborhood to the next the user is able to explore the whole image collection. Additionally, the similarity metric can be adapted by weighting image features and thus users benefit from a more informed navigation.
BibTeX:
@inproceedings{low2014nordichi,
  author = {Thomas Low and Christian Hentschel and Sebastian Stober and Harald Sack and Andreas N{\"u}rnberger},
  title = {Visual Berrypicking in Large Image Collections},
  booktitle = {Proceedings of the 8th Nordic Conference on Human-Computer Interaction: Fun, Fast, Foundational (NordiCHI'14)},
  year = {2014},
  pages = {1043--1046},
  url = {http://doi.acm.org/10.1145/2639189.2670271},
  doi = {10.1145/2639189.2670271}
}
Abstract: Music imagery information retrieval (MIIR) systems may one day be able to recognize a song just as we think of it. As one step towards such technology, we investigate whether rhythms can be identified from an electroencephalography (EEG) recording taken directly after their auditory presentation. The EEG data has been collected during a rhythm perception study in Kigali, Rwanda and comprises 12 East African and 12 Western rhythmic stimuli presented to 13 participants. Each stimulus was presented as a loop for 32 seconds followed by a break of four seconds before the next one started. Using convolutional neural networks (CNNs), we are able to recognize individual rhythms with a mean accuracy of 22.9% over all subjects by just looking at the EEG recorded during the silence between the stimuli.
BibTeX:
@inproceedings{stober2014audiomostly,
  author = {Sebastian Stober and Daniel J. Cameron and Jessica A. Grahn},
  title = {Does the Beat go on? -- Identifying Rhythms from Brain Waves Recorded after Their Auditory Presentation},
  booktitle = {Proceedings of the 9th Audio Mostly: A Conference on Interaction With Sound (AM'14)},
  year = {2014},
  pages = {23:1--23:8},
  url = {http://doi.acm.org/10.1145/2636879.2636904},
  doi = {10.1145/2636879.2636904}
}
BibTeX:
@inproceedings{stober2014ucnc,
  author = {Sebastian Stober},
  title = {Using Deep Learning Techniques to Analyze and Classify {EEG} Recordings},
  booktitle = {Computational Neuroscience Workshop at Unconventional Computation and Natural Computation Conference (UCNC'14)},
  year = {2014},
  note = {abstract/poster}
}
Abstract: Electroencephalography (EEG) recordings of rhythm perception might contain enough information to distinguish different rhythm types/genres or even identify the rhythms themselves. In this paper, we present first classification results using deep learning techniques on EEG data recorded within a rhythm perception study in Kigali, Rwanda. We tested 13 adults, mean age 21, who performed three behavioral tasks using rhythmic tone sequences derived from either East African or Western music. For the EEG testing, 24 rhythms – half East African and half Western with identical tempo and based on a 2-bar 12/8 scheme – were each repeated for 32 seconds. During presentation, the participants’ brain waves were recorded via 14 EEG channels. We applied stacked denoising autoencoders and convolutional neural networks on the collected data to distinguish African and Western rhythms on a group and individual participant level. Furthermore, we investigated how far these techniques can be used to recognize the individual rhythms.
BibTeX:
@inproceedings{stober2014ismir,
  author = {Sebastian Stober and Daniel J. Cameron and Jessica A. Grahn},
  title = {Classifying {EEG} Recordings of Rhythm Perception},
  booktitle = {15th International Society for Music Information Retrieval Conference (ISMIR'14)},
  year = {2014},
  pages = {649--654},
  url = {http://www.terasoft.com.tw/conf/ismir2014/proceedings/T117_317_Paper.pdf}
}
Abstract: In this paper, we explore alternative ways to visualize search results for children. We propose a novel search result visualization using characters. The main idea is to represent each web document as a character where a character visually provides clues about the webpage’s content. We focused on children between six and twelve as a target user group. Following the usercentered development approach, we conducted a preliminary user study to determine how children would represent a webpage as a sketch based on a given template of a character. Using the study results the first prototype of a search engine was developed. We evaluated the search interface on a touchpad and a touch table in a second user study and analyzed user’s satisfaction and preferences.
BibTeX:
@inproceedings{gossen2014idc,
  author = {Gossen, Tatiana and M\"{u}ller, Rene and Stober, Sebastian and N\"{u}rnberger, Andreas},
  title = {Search Result Visualization with Characters for Children},
  booktitle = {Proceedings of the 2014 Conference on Interaction Design and Children},
  address = {New York, NY, USA},
  publisher = {ACM},
  year = {2014},
  series = {IDC '14},
  pages = {125--134},
  isbn = {9781450322720},
  url = {http://doi.acm.org/10.1145/2593968.2593983},
  doi = {10.1145/2593968.2593983}
}
BibTeX:
@book{amr2012proceedings,,
  title = {Adaptive Multimedia Retrieval: Semantics, Context, and Adaptation},
  editor = {Andreas N\"{u}rnberger and Sebastian Stober and Birger Larsen and Marcin Detyniecki},
  publisher = {Springer International Publishing},
  year = {2014},
  series = {LNCS},
  volume = {8382},
  url = {http://link.springer.com/book/10.1007%2F978-3-319-12093-5},
  doi = {10.1007/978-3-319-12093-5}
}
2013
BibTeX:
@proceedings{mit2013,,
  title = {Tagungsband der Magdeburger-Informatik-Tage, 2. Doktorandentagung 2013, MIT 2013},
  editor = {Robert Buchholz and Georg Krempl and Claudia Krull and Eike Schallehn and Sebastian Stober and Frank Ortmeier and Sebastian Zug},
  publisher = {Magdeburg University},
  year = {2013},
  isbn = {9783940961969}
}
Abstract: Map-based visualizations — sometimes also called projections — are a popular means for exploring music collections. But how useful are they if the collection is not static but grows over time? Ideally, a map that a user is already familiar with should be altered as little as possible and only as much as necessary to reflect the changes of the underlying collection. This paper demonstrates to what extent existing approaches are able to incrementally integrate new songs into existing maps and discusses their technical limitations. To this end, Growing Self-Organizing Maps, (Landmark) Multidimensional Scaling, Stochastic Neighbor Embedding, and the Neighbor Retrieval Visualizer are considered. The different algorithms are experimentally compared based on objective quality measurements as well as in a user study with an interactive user interface. In the experiments, the well-known Beatles corpus comprising the 180 songs from the twelve official albums is used — adding one album at a time to the collection.
BibTeX:
@inproceedings{stober2013ismir,
  author = {Sebastian Stober and Thomas Low and Tatiana Gossen and Andreas N\"{u}rnberger},
  title = {Incremental Visualization of Growing Music Collections},
  booktitle = {14th International Conference on Music Information Retrieval (ISMIR'13)},
  year = {2013},
  pages = {433--438},
  url = {http://www.ppgia.pucpr.br/ismir2013/wp-content/uploads/2013/09/40_Paper.pdf}
}
Abstract: Children need advanced support during web search or related interactions with computer systems. At this point, a voice-controlled search engine offers different benefits. Children who have difficulties in writing will not make spelling errors using a voice control. Voice control is a natural input method and is supposed to be easier to use for children than a keyboard or mouse. To integrate a suitable voice control into search engines, it is necessary to understand the children’s behavior. Therefore, we investigate children’s speech patterns and interaction tactics during a web search using a voice-controlled search engine. A user study in form of a Wizard-of-Oz-Experiment was conducted and we found out that children are motivated to use voice-controlled search engines. However, voice control in combination with touch interactions should be possible as well. Furthermore, the analysis of the speech patterns suggests that it is possible to build a speech recognition program. The results of this study can serve as fundamentals to develop voice-controlled search dialogues for young users.
BibTeX:
@inproceedings{gossen2013hcir,
  author = {Tatiana Gossen and Michael Kotzyba and Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Voice-Controlled Search User Interfaces for Young Users},
  booktitle = {7th annual Symposium on Human-Computer Interaction and Information Retrieval},
  address = {New York, NY, USA},
  year = {2013}
}
BibTeX:
@book{amr2011proceedings,,
  title = {Adaptive Multimedia Retrieval. Large-Scale Multimedia Retrieval and Evaluation},
  editor = {Marcin Detyniecki and Ana Garc\'{\i}a-Serrano and Andreas N{\"u}rnberger and Sebastian Stober},
  address = {Berlin / Heidelberg},
  publisher = {Springer Verlag},
  year = {2013},
  series = {LNCS},
  volume = {7836},
  url = {http://www.springer.com/computer/database+management+%26+information+retrieval/book/978-3-642-37424-1},
  doi = {10.1007/978-3-642-37425-8}
}
BibTeX:
@article{Anglade:2013:RDL:2492334.2492343,
  author = {Anglade, Am{\'e}lie and Humphrey, Eric and Schmidt, Erik and Stober, Sebastian and Sordo, Mohamed},
  title = {Demos and Late-Breaking Session of the Thirteenth International Society for Music Information Retrieval Conference (ISMIR 2012)},
  address = {Cambridge, MA, USA},
  month = {#June#},
  journal = {Computer Music Journal},
  publisher = {MIT Press},
  year = {2013},
  volume = {37},
  number = {2},
  pages = {91--93},
  url = {http://dx.doi.org/10.1162/COMJ_r_00171},
  doi = {10.1162/COMJ_r_00171}
}
Abstract: In dieser Arbeit untersuchen wir Techniken der sprachgesteuerten Interaktion mit Suchmaschinen fü̈r junge Nutzer. Eine Sprachsteuerung hat viele Vorteile für Kinder. Beispielsweise kann der emotionale Zustand aus der Sprache erkannt und zur Unterstützung bei der Suche verwendet werden. Im Folgenden werden die Ergebnisse eines Wizard-of-Oz-Experimentes vorgestellt, bei dem Kinder ein Suchsystem mittels Sprachkommandos bedient haben. Die Ergebnisse der Untersuchung bilden eine Grundlage zur Entwicklung sprachgesteuerter Suchdialoge für Kinder.
BibTeX:
@inproceedings{gossen2013gi,
  author = {Tatiana Gossen and Michael Kotzyba and Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Sprachgesteuerte Benutzerschnittstellen zur Suche f\"{u}r junge Nutzer},
  booktitle = {43. Jahrestagung der Gesellschaft f\"{u}r Informatik},
  year = {2013}
}
Abstract: With the development of more and more sophisticated Music Information Retrieval (MIR) approaches, aspects of adaptivity are becoming an increasingly important research topic. Even though, adaptive techniques have already found their way into MIR systems and contribute to robustness or user satisfaction they are not always identified as such. This paper attempts a structured view on the last decade of MIR research from the perspective of adaptivity in order to increase awareness and promote the application and further development of adaptive techniques. To this end, different approaches from a wide range of application areas that share the common aspect of adaptivity are identified and systematically categorized.
BibTeX:
@article{stober2013mtap,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Adaptive Music Retrieval - A State of the Art},
  journal = {Multimedia Tools and Applications},
  year = {2013},
  volume = {65},
  number = {3},
  pages = {467-494},
  url = {http://link.springer.com/article/10.1007%2Fs11042-012-1042-z},
  doi = {10.1007/s11042-012-1042-z}
}
Abstract: The hubness phenomenon, as it was recently described, consists in the observation that for increasing dimensionality of a data set the distribution of the number of times a data point occurs among the k nearest neighbors of other data points becomes increasingly skewed to the right. As a consequence, so-called hubs emerge, that is, data points that appear in the lists of the k nearest neighbors of other data points much more often than others. In this paper we challenge the hypothesis that the hubness phenomenon is an effect of the dimensionality of the data set and provide evidence that it is rather a boundary effect or, more generally, an effect of a density gradient. As such, it may be seen as an artifact that results from the process in which the data is generated that is used to demonstrate this phenomenon. We report experiments showing that the hubness phenomenon need not occur in high-dimensional data and can be made to occur in low-dimensional data.
BibTeX:
@inproceedings{low2013hubness,
  author = {Thomas Low and Christian Borgelt and Sebastian Stober and Andreas N\"{u}rnberger},
  title = {The Hubness Phenomenon: Fact or Artifact?},
  editor = {Christian Borgelt and Maria \'{A}ngeles Gil and Joao M.C. Sousa and Michel Verleysen},
  booktitle = {Towards Advanced Data Analysis by Combining Soft Computing and Statistics},
  publisher = {Springer Berlin / Heidelberg},
  year = {2013},
  series = {Studies in Fuzziness and Soft Computing},
  volume = {285},
  pages = {267--278},
  doi = {10.1007/978-3-642-30278-7_21}
}
2012
Abstract: Most existing Music Information Retrieval (MIR) technologies require a user to use a query interface to search for a musical document. The mental image of the desired music is likely much richer than what the user is able to express through any query interface. This expressivity bottleneck could be circumvented if it was possible to directly read the music query from the user’s mind. To the authors’ knowledge, no such attempt has been made in the field of MIR so far. However, there have been recent advances in cognitive neuroscience that suggest such a system might be possible. Given these new insights, it seems promising to extend the focus of MIR by including music imagery – possibly forming a sub-discipline which could be called Music Imagery Information Retrieval (MIIR). As a first effort, there has been a dedicated session at the Late-Breaking & Demos event at the ISMIR 2012 conference. This paper aims to stimulate research in the field of MIIR by laying a roadmap for future work.
BibTeX:
@inproceedings{ismir2012miir,
  author = {Sebastian Stober and Jessica Thompson},
  title = {Music Imagery Information Retrieval: Bringing the Song on Your Mind back to Your Ears},
  booktitle = {13th International Conference on Music Information Retrieval (ISMIR'12) - Late-Breaking \& Demo Papers},
  year = {2012}
}
Abstract: In order to support individual user perspectives and different retrieval tasks, music similarity can no longer be considered as a static element of Music Information Retrieval (MIR) systems. Various approaches have been proposed recently that allow dynamic adaptation of music similarity measures. This paper provides a systematic comparison of algorithms for metric learning and higher-level facet distance weighting on the MagnaTagATune dataset. A crossvalidation variant taking into account clip availability is presented. Applied on user generated similarity data, its effect on adaptation performance is analyzed. Special attention is paid to the amount of training data necessary for making similarity predictions on unknown data, the number of model parameters and the amount of information available about the music itself.
BibTeX:
@inproceedings{ismir2012stober,
  author = {Daniel Wolff and Sebastian Stober and Andreas N\"urnberger and Tillman Weyde},
  title = {A Systematic Comparison of Music Similarity Adaptation Approaches},
  booktitle = {13th International Conference on Music Information Retrieval (ISMIR'12)},
  year = {2012},
  pages = {103--108}
}
Abstract: Music Information Retrieval (MIR) Systeme müssen fazettenreiche Informationen verarbeiten und gleichzeitig mit heterogenen Nutzern umgehen können. Insbesondere wenn es darum geht, eine Musiksammlung zu organisieren, stellen die verschiedenen Sichtweisen der Nutzer, verursacht durch deren unterschiedliche Kompetenz, musikalischen Hintergrund und Geschmack, eine große Herausforderung dar. Diese Herausforderung wird hier adressiert, indem adaptive Verfahren für verschiedene Elemente von MIR Systemen vorgeschlagen werden: Datenadaptive Techniken zur Merkmalsextraktion werden beschrieben, welche zum Ziel haben, die Qualität und Robustheit der aus Audioaufnahmen extrahierten Informationen zu verbessern. Das klassische Problem der Genreklassifikation wird aus einer neuen nutzerzentrierten Sichtweise behandelt — anknüpfend an die Idee idiosynkratischer Genres, welche die persönlichen Hörgewohnheiten eines Nutzer besser widerspiegeln. Eine adaptive Visualisierungstechnik zur Exploration und Organisation von Musiksammlungen wird entwickelt, die insbesondere Darstellungsfehler adressiert, welche ein weit verbreitetes und unumgängliche Problem von Techniken zur Dimensionsreduktion sind. Darüber hinaus wird umrissen, wie diese Technik eingesetzt werden kann, um die Interessantheit von Musikempfehlungen zu verbessern, und neue blickbasierte Interaktionstechniken ermöglicht. Schließlich wird ein allgemeiner Ansatz für adaptive Musikähnlichkeit vorgestellt, welcher als Kern für eine Vielzahl adaptiver MIR Anwendungen dient. Die Einsatzmöglichkeiten der beschriebenen Verfahren werden an verschiedenen Anwendungsprototypen gezeigt.
BibTeX:
@inproceedings{gi-diss2011stober,
  author = {Sebastian Stober},
  title = {Adaptive Verfahren zur nutzerzentrierten Organisation von Musiksammlungen},
  editor = {Steffen H{\"o}lldobler and Abraham Bernstein and Klaus-Peter L{\"o}hr and Paul Molitor and Gustaf Neumann and R{\"u}diger Reischuk and Myra Spiliopoulou and Harald St{\"o}rrle and Dorothea Wagner},
  booktitle = {Ausgezeichnete Informatikdissertationen 2011},
  address = {Bonn},
  publisher = {Gesellschaft f{\"u}r Informatik},
  year = {2012},
  series = {Lecture Notes in Informatics (LNI)},
  volume = {D-12},
  pages = {211--220},
  isbn = {978-3-88579-416-5},
  note = {in German}
}
Abstract: Surprising a user with unexpected and fortunate recommendations is a key challenge for recommender systems. Motivated by the concept of bisociations, we propose ways to create an environment where such serendipitous recommendations become more likely. As application domain we focus on music recommendation using MusicGalaxy, an adaptive user-interface for exploring music collections. It leverages a non-linear multi-focus distortion technique that adaptively highlights related music tracks in a projection-based collection visualization depending on the current region of interest. While originally developed to alleviate the impact of inevitable projection errors, it can also adapt according to user-preferences. We discuss how using this technique beyond its original purpose can create distortions of the visualization that facilitate bisociative music discovery.
BibTeX:
@inproceedings{stober2012bison,
  author = {Sebastian Stober and Stefan Haun and Andreas N\"urnberger},
  title = {Bisociative Music Discovery and Recommendation},
  editor = {Michael R. Berthold},
  booktitle = {Bisociative Knowledge Discovery},
  publisher = {Springer Berlin / Heidelberg},
  year = {2012},
  series = {Lecture Notes in Computer Science},
  volume = {7250},
  pages = {472-483},
  isbn = {978-3-642-31829-0},
  doi = {10.1007/978-3-642-31830-6_33}
}
Abstract: Personalized and user-aware systems for retrieving multimedia items are becoming increasingly important as the amount of available multimedia data has been spiraling. A personalized system is one that incorporates information about the user into its data processing part (e.g., a particular user taste for a movie genre). A context-aware system, in contrast, takes into account dynamic aspects of the user context when processing the data (e.g., location and time where/when a user issues a query). Today’s user-adaptive systems often incorporate both aspects. Particularly focusing on the music domain, this article gives an overview of different aspects we deem important to build personalized music retrieval systems. In this vein, we first give an overview of factors that influence the human perception of music. We then propose and discuss various requirements for a personalized, user-aware music retrieval system. Eventually, the state-of-the-art in building such systems is reviewed, taking in particular aspects of “similarity” and “serendipity” into account.
BibTeX:
@incollection{schedl2012user-aware,
  author = {Markus Schedl and Sebastian Stober and Emilia G{\'o}mez and Nicola Orio and Cynthia C.S. Liem},
  title = {User-Aware Music Retrieval and Recommendation},
  editor = {Meinard M{\"u}ller and Masataka Goto and Markus Schedl},
  booktitle = {Multimodal Music Processing},
  address = {Dagstuhl, Germany},
  publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  year = {2012},
  series = {Dagstuhl Follow-Ups},
  volume = {3},
  pages = {135--156},
  isbn = {978-3-939897-37-8},
  url = {http://drops.dagstuhl.de/opus/volltexte/2012/3470},
  doi = {10.4230/DFU.Vol3.11041.135}
}
2011
Abstract: Music Information Retrieval (MIR) systems have to deal with multi-faceted music information and very heterogeneous users. Especially when the task is to organize a music collection, the diverse perspectives of users caused by their different level of expertise, musical background or taste pose a great challenge. This challenge is addressed in this book by proposing adaptive methods for several elements of MIR systems: Data-adaptive feature extraction techniques are described that aim to increase the quality and robustness of the information extracted from audio recordings. The classical genre classification problem is approached from a novel user-centric perspective – promoting the idea of idiosyncratic genres that better reflect a user’s personal listening habits. An adaptive visualization technique for exploration and organization of music collections is elaborated that especially addresses the common and inevitable problem of projection errors introduced by dimensionality reduction approaches. Furthermore, it is outlined how this technique can be applied to facilitate serendipitous music discoveries in a recommendation scenario and to enable novel gaze-supported interaction techniques. Finally, a general approach for adaptive music similarity is presented which serves as the core of many adaptive MIR applications. Application prototypes demonstrate the usability of the described approaches.
BibTeX:
@phdthesis{stober2011thesis,
  author = {Sebastian Stober},
  title = {Adaptive Methods for User-Centered Organization of Music Collections},
  type = {Dissertation},
  address = {Magdeburg, Germany},
  month = {Nov},
  school = {Otto-von-Guericke-University},
  year = {2011},
  note = {published by Dr. Hut Verlag, ISBN 978-3-8439-0229-8},
  url = {http://www.dr.hut-verlag.de/978-3-8439-0229-8.html}
}
BibTeX:
@book{amr2010proceedings,,
  title = {Adaptive Multimedia Retrieval. Context, Exploration and Fusion},
  editor = {Detyniecki, Marcin and Knees, Peter and N\"{u}rnberger, Andreas and Schedl, Markus and Stober, Sebastian},
  address = {Berlin / Heidelberg},
  publisher = {Springer Verlag},
  year = {2011},
  series = {LNCS},
  volume = {6817},
  url = {http://www.springer.com/computer/database+management+%26+information+retrieval/book/978-3-642-27168-7},
  doi = {10.1007/978-3-642-27169-4}
}
Abstract: Surprising a user with unexpected and fortunate recommendations is a key challenge for recommender systems. Motivated by the concept of bisociations, we propose ways to create an environment where such serendipitous recommendations become more likely. As application domain we focus on music recommendation using MusicGalaxy, an adaptive user-interface for exploring music collections. It leverages a non-linear multi-focus distortion technique that adaptively highlights related music tracks in a projection-based collection visualization depending on the current region of interest. While originally developed to alleviate the impact of inevitable projection errors, it can also adapt according to user-preferences. We discuss how using this technique beyond its original purpose can create distortions of the visualization that facilitate bisociative music discovery.
BibTeX:
@inproceedings{audiomostly2011stober,
  author = {Sebastian Stober and Stefan Haun and Andreas N\"{u}rnberger},
  title = {Creating an Environment for Bisociative Music Discovery and Recommendation},
  booktitle = {Proceedings of Audio Mostly 2011 -- 6th Conference on Interaction with Sound -- Extended Abstracts},
  address = {Coimbra, Portugal},
  month = {Sep},
  year = {2011},
  pages = {1--6}
}
Abstract: Similarity plays an important role in many multimedia retrieval applications. However, it often has many facets and its perception is highly subjective — very much depending on a person’s background or retrieval goal. In previous work, we have developed various approaches for modeling and learning individual distance measures as a weighted linear combination of multiple facets in different application scenarios. Based on a generalized view of these approaches as an optimization problem guided by generic relative distance constraints, we describe ways to address the problem of constraint violations and finally compare the different approaches against each other. To this end, a comprehensive experiment using the Magnatagatune benchmark dataset is conducted.
BibTeX:
@inproceedings{amr2011stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {An Experimental Comparison of Similarity Adaptation Approaches},
  editor = {Marcin Detyniecki and Ana Garc\'{i}a-Serrano and Andreas N\"{u}rnberger and Sebastian Stober},
  booktitle = {Adaptive Multimedia Retrieval: Large Scale Multimedia Retrieval and Evaluation},
  address = {Berlin / Heidelberg},
  publisher = {Springer Verlag},
  year = {2011},
  series = {LNCS},
  volume = {7836},
  pages = {99--116},
  doi = {10.1007/978-3-642-37425-8_8}
}
Abstract: Music similarity plays an important role in many Music Information Retrieval applications. However, it has many facets and its perception is highly subjective – very much depending on a person’s background or task. This paper presents a generalized approach to modeling and learning individual distance measures for comparing music pieces based on multiple facets that can be weighted. The learning process is described as an optimization problem guided by generic distance constraints. Three application scenarios with different objectives exemplify how the proposed method can be employed in various contexts by deriving distance constraints either from domain-specific expert information or user actions in an interactive setting.
BibTeX:
@inproceedings{aes42i2011stober,
  author = {Sebastian Stober},
  title = {Adaptive Distance Measures for Exploration and Structuring of Music Collections},
  booktitle = {Proceedings of AES 42nd Conference on Semantic Audio},
  address = {Ilmenau, Germany},
  month = {Jul},
  year = {2011},
  pages = {275--284},
  url = {http://www.aes.org/e-lib/browse.cfm?elib=15952}
}
Abstract: Some popular algorithms used in Music Information Retrieval (MIR) such as Self-Organizing Maps (SOMs) require the objects they process to be represented as vectors, i.e. elements of a vector space. This is a rather severe restriction and if the data does not adhere to it, some means of vectorization is required. As a common practice, the full distance matrix is computed and each row of the matrix interpreted as an artificial feature vector. This paper empirically investigates the impact of this transformation. Further, an alternative approach for vectorization based on Multidimensional Scaling is proposed that is able to better preserve the actual distance relations of the objects which is essential for obtaining a good retrieval performance.
BibTeX:
@inproceedings{admire2011stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Analyzing the Impact of Data Vectorization on Distance Relations},
  booktitle = {Multimedia and Expo (ICME), 2011 IEEE International Conference on},
  address = {Barcelona, Spain},
  month = {Jul},
  year = {2011},
  pages = {1--6},
  note = {part of Proceedings of 3rd International Workshop on Advances in Music Information Research (AdMIRe'11)},
  doi = {10.1109/ICME.2011.6012134}
}
Abstract: While eye tracking is becoming more and more relevant as a promising input channel, diverse applications using gaze control in a more natural way are still rather limited. Though several researchers have indicated the particular high potential of gaze-based interaction for pointing tasks, often gaze-only approaches are investigated. However, time-consuming dwell-time activations limit this potential. To overcome this, we present a gaze-supported fisheye lens in combination with (1) a keyboard and (2) and a tilt-sensitive mobile multitouch device. In a user-centered design approach, we elicited how users would use the aforementioned input combinations. Based on the received feedback we designed a prototype system for the interaction with a remote display using gaze and a touch-and-tilt device. This eliminates gaze dwell-time activations and the well-known Midas Touch problem (unintentionally issuing an action via gaze). A formative user study testing our prototype provided further insights into how well the elaborated gaze-supported interaction techniques were experienced by users.
BibTeX:
@inproceedings{ngca2011stellmach,
  author = {Sophie Stellmach and Sebastian Stober and Raimund Dachselt and Andreas N\"{u}rnberger},
  title = {Designing Gaze-supported Multimodal Interactions for the Exploration of Large Image Collections},
  booktitle = {Proceedings of 1st International Conference on Novel Gaze-Controlled Applications (NGCA'11)},
  address = {Karlskrona, Sweden},
  month = {May},
  year = {2011},
  pages = {1--8},
  note = {Best Paper Award},
  doi = {10.1145/1983302.1983303}
}
Abstract: Sometimes users of a multimedia retrieval system are not able to explicitly state their information need. They rather want to browse a collection in order to get an overview and to discover interesting content. Exploratory retrieval tools support users in search scenarios where the retrieval goal cannot be stated explicitly as a query or user rather want to browse a collection in order to get an overview and to discover interesting content. In previous work, we have presented Adaptive SpringLens — an interactive visualization technique building upon popular neighborhood-preserving projections of multimedia collections. It uses a complex multi-focus fish-eye distortion of a projection to visualize neighborhood that is automatically adapted to the user’s current focus of interest. This paper investigates how far knowledge about the retrieval task collected during interaction can be used to adapt the underlying similarity measure that defines the neighborhoods.
BibTeX:
@inproceedings{amr2010stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Similarity Adaptation in an Exploratory Retrieval Scenario},
  editor = {Detyniecki, Marcin and Knees, Peter and N\"{u}rnberger, Andreas and Schedl, Markus and Stober, Sebastian},
  booktitle = {Adaptive Multimedia Retrieval. Context, Exploration, and Fusion},
  publisher = {Springer Berlin / Heidelberg},
  year = {2011},
  series = {Lecture Notes in Computer Science},
  volume = {6817},
  pages = {144-158},
  isbn = {978-3-642-27168-7},
  doi = {10.1007/978-3-642-27169-4_11}
}
Abstract: A common way to support exploratory music retrieval scenarios is to give an overview using a neighborhood-preserving projection of the collection onto two dimensions. However, neighborhood cannot always be preserved in the projection because of the inherent dimensionality reduction. Furthermore, there is usually more than one way to look at a music collection and therefore different projections might be required depending on the current task and the user’s interests. We describe an adaptive zoomable interface for exploration that addresses both problems: It makes use of a complex non-linear multi-focal zoom lens that exploits the distorted neighborhood relations introduced by the projection. We further introduce the concept of facet distances representing different aspects of music similarity. User-specific weightings of these aspects allow an adaptation according to the user’s way of exploring the collection. Following a user-centered design approach with focus on usability, a prototype system has been created by iteratively alternating between development and evaluation phases. The results of an extensive user study including gaze analysis using an eye-tracker prove that the proposed interface is helpful while at the same time being easy and intuitive to use.
BibTeX:
@inproceedings{cmmrext2010stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {MusicGalaxy: A Multi-focus Zoomable Interface for Multi-facet Exploration of Music Collections},
  editor = {Ystad, S{\o}lvi and Aramaki, Mitsuko and Kronland-Martinet, Richard and Jensen, Kristoffer},
  booktitle = {Exploring Music Contents},
  address = {Berlin / Heidelberg},
  publisher = {Springer Verlag},
  year = {2011},
  series = {LNCS},
  volume = {6684},
  pages = {273--302},
  note = {extended paper for post-proceedings of 7th International Symposium on Computer Music Modeling and Retrieval (CMMR'10)},
  doi = {10.1007/978-3-642-23126-1_18}
}
2010
Abstract: Sometimes users of a multimedia retrieval system are not able to explicitly state their information need. They rather want to browse a collection in order to get an overview and to discover interesting content. In previous work, we have presented a novel interface implementing a fish-eye-based approach for browsing high-dimensional multimedia data that has been projected onto display space. The impact of projection errors is alleviated by introducing an adaptive non-linear multi-focus zoom lens. This work describes the evaluation of our approach in a user study where participants are asked to solve an exploratory image retrieval task using the SpringLens interface. As a baseline, the usability of the interface is compared to a common pan-and-zoom-based interface. The results of a survey and the analysis of recorded screencasts and eye tracking data are presented.
BibTeX:
@inproceedings{nordichi2010stober,
  author = {Sebastian Stober and Christian Hentschel and Andreas N\"{u}rnberger},
  title = {Evaluation of Adaptive SpringLens - A Multi-focus Interface for Exploring Multimedia Collections},
  booktitle = {Proceedings of 6th Nordic Conference on Human-Computer Interaction (NordiCHI'10)},
  address = {Reykjavik, Iceland},
  month = {Oct},
  year = {2010},
  pages = {785--788},
  doi = {10.1145/1868914.1869029}
}
Abstract: Aspects of individualization have so far been only a minor issue of research in the field of Music Information Retrieval (MIR). Often, it is assumed that all users of a MIR system compare music in the same (objective) manner. In order to ease access to steadily growing music collections, MIR systems should however be able adapt to their users: E.g., an adaptive structuring of a collection becomes intuitively understandable and user-adaptive genre labels more mean- ingful. In the first part of this talk, the general concept of adaptive systems is ex- plained briefly. Afterwards, several approaches for incorporating adaptivity into MIR systems that are covered in the PhD project are pointed out. The second part of the talk focuses on MusicGalaxy — an adaptive visualization technique for the exploration of large music collections.
BibTeX:
@misc{dday2010stober,
  author = {Sebastian Stober},
  title = {Adaptive User-Centered Organization of Music Archives},
  month = {Jul},
  year = {2010},
  howpublished = {Talk at Doktorandentag, Faculty of Computer Science, Otto-von-Guericke-University Magdeburg}
}
Abstract: Sometimes users of a music retrieval system are not able to explicitly state what they are looking for. They rather want to browse a collection in order to get an overview and to discover interesting content. A common approach for browsing a collection relies on a similarity-preserving projection of objects (tracks, albums or artists) onto the (typically two-dimensional) display space. Inevitably, this implicates the use of dimension reduction techniques that cannot always preserve neighborhood and thus introduce distortions of the similarity space. MusicGalaxy is an interface for exploring large music collections (on the track level) using a galaxy metaphor that addresses the problem of distorted neighborhoods. Furthermore, the interface allows to adapt the underlying similarity measure to the user’s way of comparing tracks by weighting different facets of music similarity.
BibTeX:
@inproceedings{ismir2010stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {MusicGalaxy - An Adaptive User-Interface for Exploratory Music Retrieval},
  booktitle = {11th International Conference on Music Information Retrieval (ISMIR'10) - Late Breaking Demo Papers},
  address = {Utrecht, Netherlands},
  month = {Aug},
  year = {2010},
  url = {http://ismir2010.ismir.net/proceedings/late-breaking-demo-08.pdf}
}
Abstract: Visualization by projection or automatic structuring is one means to ease access to document collections, be it for exploration or organization. Of even greater help would be a presentation that adapts to the user’s individual way of structuring, which would be intuitively understandable. Meanwhile, several approaches have been proposed that try to support a user in this interactive organization and retrieval task. However, the evaluation of such approaches is still cumbersome and is usually done by expensive user studies. Therefore, we propose a framework for evaluation that simulates different kinds of structuring behavior of users, in order to evaluate the quality of the underlying adaptation algorithms.
BibTeX:
@inproceedings{simint2010stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Automatic Evaluation of User Adaptive Interfaces for Information Organization and Exploration},
  booktitle = {SIGIR Workshop on the Simulation of Interaction (SimInt'10)},
  address = {Geneva, Switzerland},
  month = {Jul},
  year = {2010},
  pages = {33--34},
  url = {http://www.mansci.uwaterloo.ca/~msmucker/publications/simint10proceedings.pdf}
}
Abstract: Sometimes users of a music retrieval system are not able to explicitly state what they are looking for. They rather want to browse a collection in order to get an overview and to discover interesting content. A common approach for browsing a collection relies on a similarity-preserving projection of objects (tracks, albums or artists) onto the (typically two-dimensional) display space. Inevitably, this implicates the use of dimension reduction techniques that cannot always preserve neighborhood and thus introduce distortions of the similarity space. This paper describes ongoing work on MusicGalaxy — an interactive user-interface based on an adaptive non-linear multi-focus zoom lens that alleviates the impact of projection distortions. Furthermore, the interface allows manipulation of the neighborhoods as well as the projection by weighting different facets of music similarity. This way the visualization can be adapted to the user’s way of exploring the collection. Apart from the current interface prototype, findings from early evaluations are presented.
BibTeX:
@inproceedings{smc2010stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {{MusicGalaxy} - An Adaptive User-Interface for Exploratory Music Retrieval},
  booktitle = {Proceedings of 7th Sound and Music Computing Conference (SMC'10)},
  address = {Barcelona, Spain},
  month = {Jul},
  year = {2010},
  pages = {382--389}
}
Abstract: Viele Ansätze zur Visualisierung einer Musiksammlung basieren auf Techniken, bei denen Objekte (Musikstücke, Alben oder Künstler) aus einem hochdimensionalen Merkmalsraum für die Darstellung in den 2- oder 3-dimensionalen Raum projiziert werden. Dabei kommt es zwangsläufig zu Verzerrungen der Abstände. Als Folge kann es vorkommen, dass benachbarte Objekte sich gar nicht so sehr ähneln, wie es die Darstellung vermuten lässt, oder weit von einander entfernte Objekte sehr ähnlich sind. In diesem Beitrag wird eine interaktive Visualisierung vorgestellt, die eine globale Sicht auf eine Musiksammlung ermöglicht und mit adaptiven Filterfunktionen und multifokalem Zoom die beschriebenen Verzerrungsprobleme gezielt adressiert.
BibTeX:
@inproceedings{daga2010stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Visualisierung von gro{\ss}en Musiksammlungen unter Ber\"{u}cksichtigung projektionsbedingter Verzerrungen},
  booktitle = {36. Jahrestagung f\"{u}r Akustik DAGA 2010, Berlin},
  address = {Berlin, Germany},
  month = {Mar},
  publisher = {German Acoustical Society (DEGA)},
  year = {2010},
  pages = {571--572},
  note = {in German}
}
Abstract: A common way to support exploratory music retrieval scenarios is to give an overview using a neighborhood-preserving projection of the collection onto two dimensions. However, neighborhood cannot always be preserved in the projection because of the dimensionality reduction. Furthermore, there is usually more than one way to look at a music collection and therefore different projections might be required depending on the current task and the user’s interests. We describe an adaptive zoomable interface for exploration that addresses both problems: It makes use of a complex non-linear multi-focal zoom lens that exploits the distorted neighborhood relations introduced by the projection. We further introduce the concept of facet distances representing different aspects of music similarity. Given user-specific weightings of these aspects, the system can adapt to the user’s way of exploring the collection by manipulation of the neighborhoods as well as the projection.
BibTeX:
@inproceedings{cmmr2010stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {A Multi-Focus Zoomable Interface for Multi-Facet Exploration of Music Collections},
  booktitle = {Proceedings of 7th International Symposium on Computer Music Modeling and Retrieval (CMMR'10)},
  address = {Malaga, Spain},
  month = {Jun},
  year = {2010},
  pages = {339--354}
}
Abstract: Sometimes it is not possible for a user to state a retrieval goal explicitly a priori. One common way to support such exploratory retrieval scenarios is to give an overview using a neighborhood-preserving projection of the collection onto two dimensions. However, neighborhood cannot always be preserved in the projection because of the dimensionality reduction. Further, there is usually more than one way to look at a collection of images — and diversity grows with the number of features that can be extracted. We describe an adaptive zoomable interface for exploration that addresses both problems: It makes use of a complex non-linear multi-focal zoom lens that exploits the distorted neighborhood relations introduced by the projection. We further introduce the concept of facet distances representing different aspects of image similarity. Given user-specific weightings of these aspects, the system can adapt to the user’s way of exploring the collection by manipulation of the neighborhoods as well as the projection.
BibTeX:
@inproceedings{wcci2010stober,
  author = {Sebastian Stober and Christian Hentschel and Andreas N\"{u}rnberger},
  title = {Multi-Facet Exploration of Image Collections with an Adaptive Multi-Focus Zoomable Interface},
  booktitle = {Proceedings of 2010 IEEE World Congress on Computational Intelligence (WCCI'10)},
  address = {Barcelona, Spain},
  month = {Jul},
  year = {2010},
  pages = {2780--2787},
  doi = {10.1109/IJCNN.2010.5596747}
}
Abstract: We present a prototype system for organization and exploration of music archives that adapts to the user’s way of structuring music collections. Initially, a growing self-organizing map is induced that clusters the music collection. The user has then the possibility to change the location of songs on the map by simple drag-and-drop actions. Each movement of a song causes a change in the underlying similarity measure based on a quadratic optimization scheme. As a result, the location of other songs is modified as well. Experiments simulating user interaction with the system show, that in this stepwise adaption the similarity measure indeed converges to one that captures how the user compares songs. This utimately leads to an individually adapted presentation that is intuitively understandable to the user and thus eases access to the database.
BibTeX:
@inproceedings{amr08stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Towards User-Adaptive Structuring and Organization of Music Collections},
  editor = {Marcin Detyniecki and Ulrich Leiner and Andreas N\"{u}rnberger},
  booktitle = {Adaptive Multimedia Retrieval. Identifying, Summarizing, and Recommending Image and Music. 6th International Workshop, AMR 2008, Berlin, Germany, June 26-27, 2008. Revised Selected Papers},
  address = {Heidelberg / Berlin},
  publisher = {Springer Verlag},
  year = {2010},
  series = {LNCS},
  volume = {5811},
  pages = {53--65},
  doi = {10.1007/978-3-642-14758-6_5}
}
2009
BibTeX:
@proceedings{lsas2009,,
  title = {{Proceedings of the 3rd Workshop on Learning the Semantics of Audio Signals (LSAS)}},
  editor = {Stephan Baumann and Juan Jos\'{e} Burred and Andreas N\"{u}rnberger and Sebastian Stober},
  address = {Graz, Austria},
  month = {Dec},
  year = {2009},
  isbn = {978-3-940961-38-9},
  url = {http://lsas2009.dke-research.de/proceedings/lsas2009proceedings.pdf}
}
Abstract: In order to enrich music information retrieval applications with information about a user’s listening habits, it is possible to automatically record a large variety of information about the listening context. However, recording such information may violate the user’s privacy. This paper presents and discusses the results of a survey that has been conducted to assess the acceptance of listening context logging.
BibTeX:
@inproceedings{lsas2009stoberSteinbrecherNuernberger,
  author = {Sebastian Stober and Matthias Steinbrecher and Andreas N\"{u}rnberger},
  title = {A Survey on the Acceptance of Listening Context Logging for MIR Applications},
  editor = {Stephan Baumann and Juan Jos\'{e} Burred and Andreas N\"{u}rnberger and Sebastian Stober},
  booktitle = {Proceedings of the 3rd Workshop on Learning the Semantics of Audio Signals (LSAS)},
  address = {Graz, Austria},
  month = {Dec},
  year = {2009},
  pages = {45--57},
  url = {http://lsas2009.dke-research.de/proceedings/lsas2009stoberSteinbrecherNuernberger.pdf}
}
Abstract: In folk song research, appropriate similarity measures can be of great help, e.g. for classification of new tunes. Several measures have been developed so far. However, a particular musicological way of classifying songs is usually not directly reflected by just a single one of these measures. We show how a weighted linear combination of different basic similarity measures can be automatically adapted to a specific retrieval task by learning this metric based on a special type of constraints. Further, we describe how these constraints are derived from information provided by experts. In experiments on a folk song database, we show that the proposed approach outperforms the underlying basic similarity measures and study the effect of different levels of adaptation on the performance of the retrieval system.
BibTeX:
@inproceedings{ismir09stober,
  author = {Korinna Bade and J\"{o}rg Garbers and Sebastian Stober and Frans Wiering and Andreas N\"{u}rnberger},
  title = {Supporting Folk-Song Research by Automatic Metric Learning and Ranking},
  booktitle = {Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR'09)},
  address = {Kobe, Japan},
  month = {Oct},
  year = {2009},
  pages = {741--746},
  url = {http://ismir2009.ismir.net//proceedings/OS9-3.pdf}
}
Abstract: Keeping one’s personal music collections well organized can be a very tedious task. Fortunately, today, many popular music players (such as AmaroK or iTunes) have an integrated library function that can automatically rename and tag music files and sort them into subdirectories. However, their common approach to stick with some hierarchy of genre, artist name, and album title barely represents the way a user would structure his collection manually. When it comes to organizing a music collection according to a user-specific hierarchy, three things are required: First, the music files have to be described by appropriate features beyond simple meta-tags. This includes content-based analysis but also incorporation of external information sources such as the web. Second, knowledge about the user’s structuring preferences must be available. And third, and most importantly, methods for learning personalized hierarchies that can integrate this knowledge are needed. We propose for this task a hierarchical constraint based clustering approach that can weight the importance of different features according to the user perceived similarity. A hierarchy based on this similarity measure reflects a user’s view on the collection.
BibTeX:
@inproceedings{daga09stober,
  author = {Korinna Bade and Andreas N\"{u}rnberger and Sebastian Stober},
  title = {Everything in its right place? Learning a user's view of a music collection},
  booktitle = {Proceedings of NAG/DAGA 2009, International Conference on Acoustics, Rotterdam},
  address = {Berlin, Germany},
  publisher = {German Acoustical Society (DEGA)},
  year = {2009},
  pages = {344--347}
}
Abstract: Automatic structuring is one means to ease access to large music collections — be it for organisation or exploration. The AUCOMA project (Adaptive User-Centered Organization of Music Archives) aims to find ways to make such a structuring intuitively understandable to a user through automatic adaptation.This article describes the motivation of the project, discusses related work in the field of music information retrieval and presents first project results.
BibTeX:
@article{ki09stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {User-Adaptive Music Information Retrieval},
  journal = {KI},
  year = {2009},
  volume = {23},
  number = {2},
  pages = {54-57},
  url = {http://www.kuenstliche-intelligenz.de/index.php?id=7778&tx_ki_pi1[showUid]=1800&cHash=5143a324cc}
}
2008
Abstract: This paper aims to motivate and demonstrate how widely available environmental data can be exploited to allow organization, structuring and exploration of music collections by personal listening contexts. We describe a logging plug-in for music players that automatically records data about the listening context and discuss possible extensions for more sophisticated context logging. Based on data collected in a small user experiment, we show how data mining techniques can be applied to reveal common usage patterns. Further, a prototype user interface based on elastic lists for browsing by listening context is presented.
BibTeX:
@inproceedings{lsas08stober,
  author = {Valentin Laube and Christian Moewes and Sebastian Stober},
  title = {Browsing Music by Usage Context},
  editor = {Juan J. Burred and Andreas N\"{u}rnberger and Geoffroy Peeters and Sebastian Stober},
  booktitle = {Proceedings of the 2nd Workshop on Learning the Semantics of Audio Signals (LSAS)},
  address = {Paris, France},
  month = {June},
  publisher = {IRCAM},
  year = {2008},
  pages = {19--29},
  url = {http://lsas2008.dke-research.de/proceedings/lsas2008_p19-29_LaubeMoewesStober.pdf}
}
BibTeX:
@proceedings{lsas2008,,
  title = {{Proceedings of the 2nd Workshop on Learning the Semantics of Audio Signals (LSAS)}},
  editor = {Juan J. Burred and Andreas N\"{u}rnberger and Geoffroy Peeters and Sebastian Stober},
  address = {Paris, France},
  month = {June},
  publisher = {IRCAM},
  year = {2008},
  isbn = {978-3-9804874-7-4},
  url = {http://lsas2008.dke-research.de/proceedings/lsas2008_proceedings.pdf}
}
Abstract: The chord progression of a song is an important high-level feature which enables indexing as well as deeper analysis of musical recordings. Different approaches to chord recognition have been suggested in the past. Though their performance increased, still significant error rates seem to be unavoidable. One way to improve accuracy is to try to correct possible misclassifications. In this paper, we propose a post-processing method based on considerations of musical harmony, assuming that the pool of chords used in a song is limited and that strong oscillations of chords are uncommon. We show that exploiting (uncertain) knowledge about the chord-distribution in a chord’s neighbourhood can significantly improve chord detection accuracy by evaluating our proposed post-processing method for three baseline classifiers on two early Beatles albums.
BibTeX:
@inproceedings{cbmi08stober,
  author = {Johannes Reinhard and Sebastian Stober and Andreas N\"{u}rnberger},
  title = {Enhancing Chord Classification through Neighbourhood Histograms},
  booktitle = {Proceedings of the 6th International Workshop on Content-Based Multimedia Indexing (CBMI 2008)},
  address = {London, UK},
  year = {2008},
  pages = {33--40},
  url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4564924},
  doi = {10.1109/CBMI.2008.4564924}
}
Abstract: Automatische Strukturierung kann den Zugriff auf Musikarchive, speziell die Exploration und Organisation, wesentlich erleichtern. Noch hilfreicher wäre eine Darstellung, die sich an die Art und Weise, wie der Nutzer Musiksammlungen strukturiert, anpasst und somit für ihn intuitiv nachvollziehbar ist. Wir stellen hier ein prototypisches System vor, welches ein personalisiertes Ähnlichkeitsmaß anhand der Nutzerinteraktion mit einer Musiksammlung lernt. Zunächst wird dazu eine wachsende Selbstorganisierende Karte (SOM) trainiert, die nutzerunabhängig ähnliche Musikstücke gruppiert. Der Nutzer kann anschließend die Position von Musikstücken in der Karte durch einfache Drag&Drop-Aktionen verändern. Jede Bewegung verursacht eine automatische Anpassung des der Karte zugrunde liegenden Ähnlichkeitsmaßes, wodurch auch andere Stücke ihre Position ändern können.
BibTeX:
@inproceedings{daga08stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {{AUCOMA - Adaptive Nutzerzentrierte Organisation von Musikarchiven}},
  editor = {Ute Jekosch and R\"{u}diger Hoffmann},
  booktitle = {Fortschritte der Akustik: Plenarvortr\"{a}ge und Fachbeitr\"{a}ge der 34. Deutschen Jahrestagung f\"{u}r Akustik DAGA 2008, Dresden},
  address = {Berlin, Germany},
  month = {Mar},
  publisher = {German Acoustical Society (DEGA)},
  year = {2008},
  pages = {547--548},
  note = {in German}
}
Abstract: Recent approaches in Automatic Image Annotation (AIA) try to combine the expressiveness of natural language queries with approaches to minimize the manual effort for image annotation. The main idea is to infer the annotations of unseen images using a smallset of manually annotated training examples. However, typically these approaches suffer from low correlation between the globallyassigned annotations and the local features used to obtain annotations automatically. In this paper we propose a frameworkto support image annotations based on a visual dictionary that is created automatically using a set of locally annotated trainingimages. We designed a segmentation and annotation interface to allow for easy annotation of the traing data. In order to providea framework that is easily extendable and reusable we make broad use of the MPEG-7 standard.
BibTeX:
@inproceedings{amr07hentschel,
  author = {Christian Hentschel and Sebastian Stober and Andreas N\"{u}rnberger and Marcin Detyniecki},
  title = {Automatic Image Annotation Using a Visual Dictionary Based on Reliable Image Segmentation},
  editor = {Marcin Detyniecki and Andreas N\"{u}rnberger},
  booktitle = {Adaptive Multimedial Retrieval: Retrieval, User, and Semantics. 5th International Workshop, AMR 2007, Paris, France, July 5-6, 2007, Revised Selected Papers},
  address = {Heidelberg / Berlin},
  publisher = {Springer Verlag},
  year = {2008},
  series = {LNCS},
  volume = {4918},
  pages = {45--56},
  doi = {10.1007/978-3-540-79860-6_4}
}
Abstract: Automatic structuring is one means to ease access to document collections, be it for organization or for exploration. Of even greater help would be a presentation that adapts to the user’s way of structuring and thus is intuitively understandable. We extend an existing user-adaptive prototype system that is based on a growing self-organizing map and that learns a feature weighting scheme from a user’s interaction with the system resulting in a personalized similarity measure. The proposed approach for adapting the feature weights targets certain problems of previously used heuristics. The revised adaptation method is based on quadratic optimization and thus we are able to pose certain contraints on the derived weighting scheme. Moreover, thus it is guaranteed that an optimal weighting scheme is found if one exists. The proposed approach is evaluated by simulating user interaction with the system on two text datasets: one artificial data set that is used to analyze the performance for different user types and a real world data set – a subset of the banksearch dataset – containing additional class information.
BibTeX:
@inproceedings{amr07stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {User Modelling for Interactive User-Adaptive Collection Structuring},
  editor = {Marcin Detyniecki and Andreas N\"{u}rnberger},
  booktitle = {Adaptive Multimedial Retrieval: Retrieval, User, and Semantics. 5th International Workshop, AMR 2007, Paris, France, July 5-6, 2007, Revised Selected Papers},
  address = {Heidelberg / Berlin},
  publisher = {Springer Verlag},
  year = {2008},
  series = {LNCS},
  volume = {4918},
  pages = {95-108},
  doi = {10.1007/978-3-540-79860-6_8}
}
2007
Abstract: Current work on Query-by-Singing/Humming (QBSH) focusses mainly on databases that contain MIDI files. Here, we present an approach that works on real audio recordings that bring up additional challenges. To tackle the problem of extracting the melody of the lead vocals from recordings, we introduce a method inspired by the popular “karaoke effect” exploiting information about the spatial arrangement of voices and instruments in the stereo mix. The extracted signal time series are aggregated into symbolic strings preserving the local approximated values of a feature and revealing higher-level context patterns. This allows distance measures for string pattern matching to be applied in the matching process. A series of experiments are conducted to assess the discrimination and robustness of this representation. They show that the proposed approach provides a viable baseline for further development and point out several possibilities for improvement.
BibTeX:
@inproceedings{ismir07qbsh,
  author = {Alexander Duda and Andreas N\"{u}rnberger and Sebastian Stober},
  title = {Towards Query by Singing/Humming on Audio Databases},
  editor = {Simon Dixon and David Bainbridge and Rainer Typke},
  booktitle = {Proceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007},
  address = {Vienna, Austria},
  month = {September},
  publisher = {\"{O}CG},
  year = {2007},
  pages = {331-334},
  url = {http://ismir2007.ismir.net/proceedings/ISMIR2007_p331_duda.pdf}
}
Abstract: Most of the currently existing image retrieval systems make use of either low-level features or semantic (textual) annotations. A combined usage during annotation and retrieval is rarely attempted. In this paper, we propose a standardized annotation framework that integrates semantic and feature based information about the content of images. The presented approach is based on the MPEG-7 standard with some minor extensions. The proposed annotation system SAFIRE (Semantic Annotation Framework for Image REtrieval) enables the combined use of low-level features and annotations that can be assigned to arbitrary hierarchically organized image segments. Besides the framework itself, we discuss query formalisms required for this unified retrieval approach.
BibTeX:
@inproceedings{amr06safire,
  author = {Christian Hentschel and Andreas N\"{u}rnberger and Ingo Schmitt and Sebastian Stober},
  title = {{SAFIRE: Towards Standardized Semantic Rich Image Annotation}},
  editor = {Marcin Detyniecki and Andreas N\"{u}rnberger and Eric Bruno and Stephane Marchand-Maillet},
  booktitle = {Adaptive Multimedia Retrieval: User, Context, and Feedback. 4th International Workshop, AMR 2006, Geneva, Switzerland, July, 27-28, 2006, Revised Selected Papers},
  address = {Berlin / Heidelberg},
  publisher = {Springer Verlag},
  year = {2007},
  series = {LNCS},
  volume = {4398},
  pages = {12--27},
  doi = {10.1007/978-3-540-71545-0_2}
}
2006
BibTeX:
@proceedings{lsas2006,,
  title = {{Proceedings of the 1st Workshop on Learning the Semantics of Audio Signals (LSAS)}},
  editor = {Pedro Cano and Andreas N\"{u}rnberger and Sebastian Stober and George Tzanetakis},
  address = {Athens, Greece},
  month = {Dec},
  year = {2006},
  url = {http://irgroup.cs.uni-magdeburg.de/lsas2006/proceedings/LSAS06_Full.pdf}
}
Abstract: We have developed the system DAWN (direction anticipation in web navigation) that learns navigational patterns to help users navigating through the world wide web. In this paper, we present the prediction model and the algorithm for link recommendation of this system. Besides this main focus, we briefly outline the system architecture and further motivate the purpose of such a system and the approach taken. A first evaluation on real-world data gave promising results.
BibTeX:
@inproceedings{kes2006stober,
  author = {Sebastian Stober and Andreas N\"{u}rnberger},
  title = {{DAWN -- A System for Context-Based Link Recommendation in Web Navigation}},
  editor = {Bogdan Gabrys and Robert J. Howlett and Lakhmi C. Jain},
  booktitle = {Knowledge-Based Intelligent Information and Engineering Systems},
  address = {Berlin / Heidelberg},
  month = {Oct},
  publisher = {Springer Verlag},
  year = {2006},
  series = {LNAI},
  volume = {4251},
  pages = {763--770},
  isbn = {3-540-46535-9}
}
Abstract: In this paper, we present the system DAWN (direction anticipation in web navigation) that helps users to navigate through the world wide web. Firstly, the purpose of such a system and the approach taken are motivated. We then point out relations to other approaches, describe the system and outline the underlying prediction model. Evaluation on real world data gave promising results.
BibTeX:
@inproceedings{ah2006stober,
  author = {Sebastian Stober and Andreas N\"urnberger},
  title = {{Context-Based Navigational Support in Hypermedia}},
  editor = {Barry Smyth and Helen Ashman and Vincent Wade},
  booktitle = {Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (AH 2006)},
  address = {Berlin / Heidelberg},
  month = {Jun},
  publisher = {Springer Verlag},
  year = {2006},
  series = {LNCS},
  volume = {4018},
  pages = {328 -- 332},
  doi = {10.1007/11768012_43}
}
Abstract: Searching the Web and other local resources has become an every day task for almost everybody. However, the currently available tools for searching still provide only very limited support with respect to categorization and visualization of search results as well as personalization. In this paper, we present a system for searching that can be used by an end user and also by researchers in order to develop and evaluate a variety of methods to support a user in searching. The CARSA system provides a very flexible architecture based on web services and XML. This includes the use of different search engines, categorization methods, visualization techniques, and user interfaces. The user has complete control about the features used. This system therefore provides a platform for evaluating the usefulness of different retrieval support methods and their combination.
BibTeX:
@inproceedings{amr05carsa,
  author = {Korinna Bade and Ernesto William {De Luca} and Andreas N\"{u}rnberger and Sebastian Stober},
  title = {{CARSA - An Architecture for the Development of Context Adaptive Retrieval Systems}},
  editor = {Keith van Rijsbergen and Andreas N\"{u}rnberger and Joemon M. Jose and Marcin Detyniecki},
  booktitle = {Adaptive Multimedia Retrieval: User, Context, and Feedback. 3rd International Workshop, AMR 2005, Glasgow, UK, July 28-29, 2005, Revised Selected Papers},
  address = {Berlin / Heidelberg},
  month = {Feb},
  publisher = {Springer Verlag},
  year = {2006},
  series = {LNCS},
  volume = {3877},
  pages = {91 -- 101},
  doi = {10.1007/11670834_8}
}
2005
Abstract: Im stetig wachsenden und sich verändernden Datenmeer des World Wide Web sind Nutzer bei der Navigation von Webseite zu Webseite weitestgehend auf sich allein gestellt. In dieser Arbeit wird ein Ansatz vorgestellt, mit Hilfe dessen vorhergesagt werden kann, ob ein Link von einem Benutzer wahrscheinlich weiterverfolgt werden wird. Diese Vorhersagen ermöglichen es, bestimmte Links besonders hervorzuheben, wodurch ein Benutzer bei der Navigation unterstützt werden kann. Das implementierte Verfahren ist clientseitig einsetzbar und damit in seiner Anwendung nicht auf bestimmte Bereiche des World Wide Web beschränkt. Zur Vorhersage wird ein Markov Modell höherer Ordnung aus aufgezeichneten Browsingpfaden gelernt. Ein Browsingpfad wird dabei in eine Folge von Kontexten zerlegt, wobei jeder Kontext als Dokumentenvektor mit TF/iDF-Gewichten repräsentiert wird und beispielsweise dem Text einer Webseite oder eines Absatzes entspricht. Die Menge der Kontexte wird geclustert, wodurch Browsingpfade zu Navigationsmustern abstrahiert werden und sich die Größe des daraus gelernten Modells reduziert. Zum Lernen des Modells wurde ein von Borges und Levene für den serverseitigen Einsatz entwickelter Algorithmus erweitert und auf die clientseitige Anwendung übertragen. Die Vorhersage von Links erfolgt schließlich durch ein speziell entwickeltes Verfahren, das für einen Browsingpfad die gleichzeitige Betrachtung mehrerer ähnlicher Navigationsmuster im Modell erlaubt. Das gesamte Verfahren ist parametrisiert. Der Einfluß der verschiedenen Parameter und die Qualität der Vorhersagen konnten jedoch nur auf einer kleinen Datensammlung untersucht werden, wodurch nur ein grundlegender Eindruck von der Funktionsweise des Systems vermittelt werden kann. Das System ist in ein Framework zur clientseitigen Aufzeichnung und Analyse von Benutzeraktionen beim Browsen eingebettet, welches ebenfalls im Kontext dieser Arbeit entwickelt wurde. Dieses Framework ist ein eigenständiges und erweiterbares System, welches auch für andere Arbeiten verwendet und nach den jeweiligen Anforderungen leicht erweitert werden kann.
BibTeX:
@mastersthesis{stober05diploma,
  author = {Sebastian Stober},
  title = {Kontextbasierte Navigationsunterst\"{u}tzung mit Markov-Modellen},
  type = {Diploma Thesis},
  address = {Magdeburg, Germany},
  month = {Dec},
  school = {Otto-von-Guericke-University},
  year = {2005},
  note = {in German}
}
Abstract: This report refers to work completed during my internship with the Mechatronics Research Group at the department of Mechanical and Manufacturing Engineering at the University of Melbourne, Australia from September 5th, 2003 until March 5th, 2004. Recognition of three-dimensional objects in two-dimensional images is a key area of research in computer vision. One approach is to save multiple 2D views instead of a 3D object representation thus reducing the problem to a 2D to 2D matching problem. The Mechatronics Research Group is developing a novel system that focuses on artificial objects and further reduces the 2D views to symbolic descriptions. These descriptions are based on shape-primitives: ellipses, rectangles and isosceles triangles. Evidence insupport of a hypothesis for a certain object classification is collected through an active vision approach. This work deals with the design and implementation of a data structure that is capable of holding such a symbolic representation and an algorithm for comparison and matching. The chosen symbolic representation of an object view is rotation-, scaling- and translation-invariant. For the comparison and matching of two object views a branch & bound algorithm based on problem specific heuristics is used. Furthermore, a GA-based generalization operator is proposed to reduce the number of object views in the system database. Experiments show that the query performance scales linearly with the size of the database. For a database containing 10000 entries, a response time of less than a second is expected on an average system.
BibTeX:
@mastersthesis{stober05internship,
  author = {Sebastian Stober},
  title = {Design and Implementation of an Algorithm and Data Structure for Matching of Geometric Primitives in Visual Object Classification},
  type = {Internship Report},
  address = {Magdeburg, Germany},
  month = {Apr},
  school = {Otto-von-Guericke-University},
  year = {2005}
}